Regaining Lost Knowledge

A recent Python newsgroup query asked for an efficient solution to the problem of computing a running median as a large sliding window advances over a stream of data

One category of replies can be classified as clever. The respondents used their innate intelligence and knowledge of Python for a fresh look at the problem. Their solutions focused on the fact the position of the median doesn’t move much between successive updates. Unfortunately, these solutions were catastrophically slow for large data windows.

Another category of reply relied on education. A respondent remembered that QuickSelect is a fast O(n) way of finding a median in unsorted data. I responded with an ASPN recipe implementing QuickSelect (written by yours truly). These posts represented progress, a triumph of education over cleverness, but even that improved solution was unusably slow for large window sizes.

A more promising type of reply relied on research. Surely, this problem had been solved before. Indeed, there is a published paper: Efficient Algorithm for Computing a Running Median by Soymya D. Mohanty with an O(sqrt(n)) solution. Score one for science!

However, that solution was trumped by respondents who characterized the solution mathematically, “the obvious way to compute a running median involves a tree structure so you can quickly insert and delete elements, and find the median. That would be asymptotically O(log n) but messy to implement.” Fortunately, such an implementation exists using the blist Python extension. Alas, we had a good solution but not a portable one. Without the extension module, the B+ tree structure is non-trivial to implement.

When I thought about the problem, the mathematical characterization suggested data structures that maintained sorted data with O(log n) updates, and previous education indicated a skiplist would fit the bill, but it took cleverness to discover that indexing the skiplist to find the median could be reduced to O(log n) time by adding link widths to the structure. This thinking led to my solution which is easily portable across languages and scales well to very large window sizes.

Had I discovered something new under the sun? Yes and no.

Yes, as far as I can tell the idea of using an indexable skiplist to solve the running median problem in O(log n) time had never been presented before anywhere else. The best published solution was Mohanty’s O(sqrt n) solution. Score one for combining mathematical characterization with education and cleverness.

And no, the big inspiration of figuring out how to make a skiplist indexable was not a new result. Score a big failure for research. Everywhere I had looked for skiplist resources, only the basics were presented (insertion and deletion in O(log n) time). No resource mentioned indexable skiplists. The previous work on the problem had effectively been lost. An entire generation of programmers was learning about skiplists but not being taught that they could be made efficiently indexable.

To help the world regain this lost knowledge, I updated the wikipedia entry for skiplists to show how to make them indexable with my Python recipe and I’ve added a link to Pugh’s earlier research on the problem.

Will that wikipedia entry really solve the problem of lost knowledge? The page view statistics suggest that it will. Only time will tell.

Explore posts in the same categories: Algorithms

This entry was posted on February 6, 2010 at 11:28 pm and is filed under Algorithms. You can subscribe via RSS 2.0 feed to this post's comments.

Tags: Lost Knowledge, Running Median, Skiplists, Wikipedia

You can comment below, or link to this permanent URL from your own site.

7 Comments on “Regaining Lost Knowledge”

Vorlath Says:

March 26, 2010 at 8:55 am
I came up with my own indexed skip list several years ago. I like that you updated the wikipedia page. Skip lists are great. My current version of skip lists is where you can operate on it via both index and object. For example, if you want to search for a node both by ID or by index, this list will do it for you in O(log n) in both cases. Note that the objects can be inserted in random order (and not sorted by name).

It’s great for implementing ordered (indexed) sets where you would also like to retrieve elements by name. Again, sorting by index and sorting by name would give two different lists.

Reply
Pole Says:

March 31, 2010 at 7:40 am
(Sorry for double post…)
Uh…
I think we can use 2 heaps (one for ), that’s a lot easier than skip lists.
When you add a value, just put the value in the right heap. If the median isn’t the median anymore, just put a value from a heap to the other.
To delete value in a heap, set “deleted” in a table. When the top of a heap is set “deleted”, delete it (that’s “lazy” heap).

Reply
rrenaud Says:

April 29, 2011 at 7:16 pm
Maybe the reason you didn’t find any published literature is because you were searching for the wrong terms?

The CLRS algorithms book has a chapter dedicated to order statistics for binary search trees. They support insertion/deletion and finding the i’th item in sorted order all in O(log(n)). It’s clear how to implement the sliding window median if you have one of these already built.

When you get a new item in the stream, delete the trailing part of the window, insert the new item into the window, and then ask for the median.

A lecture on order statistics for BSTs

http://www.catonmat.net/blog/mit-introduction-to-algorithms-part-seven

And a homework problem assigned (I think for undergrads?) in 2004 to add order statistics to skip lists: Problem 5-3 here:

Click to access ps5.pdf

Reply
- rhettinger Says:
  
  May 3, 2011 at 1:22 pm
  The world of binary search trees seems to be very well documented and throughly covers searching and indexing.
  
  In contrast, the world of skiplists is sparsely documented. Aside from the homework problem in some course handouts and a brief mention (without source code) in the original skiplist paper, it seems to have been completely forgotten that skiplists can augmented for O(log n) indexing.
  
  Both data structures can solve the running median problem. The purported advantage of skiplists over binary search trees is that balancing is accomplished trivially through randomization. That makes the implementation comparatively simple.
  
  Reply
Bulwersator Says:

May 3, 2011 at 11:13 am
Thanks for updating wikipedia!

Reply
Edsko de Vries Says:

May 6, 2011 at 2:51 am
I for one appreciated your addition to the Wikipedia article — without it Pugh’s article would have remained unknown to me!

Reply
Josiah C. Says:

December 31, 2011 at 6:50 pm
It’s a great addition to the article, and it’s great to have an easily portable one.

Incidentally, you are not the first. Redis (an in-memory data structure server written in C) has had an indexable skiplist as part of it’s available data structures for over 2 years.

Reply