Posted tagged ‘Lost Knowledge’

Regaining Lost Knowledge

February 6, 2010

A recent Python newsgroup query asked for an efficient solution to the problem of computing a running median as a large sliding window advances over a stream of data

One category of replies can be classified as clever.  The respondents used their innate intelligence and knowledge of Python for a fresh look at the problem.  Their solutions focused on the fact the position of the median doesn’t move much between successive updates.  Unfortunately, these solutions were catastrophically slow for large data windows.

Another category of reply relied on education.  A respondent remembered that QuickSelect is a fast O(n) way of finding a median in unsorted data. I responded with an ASPN recipe implementing QuickSelect (written by yours truly). These posts represented progress, a triumph of education over cleverness, but even that improved solution was unusably slow for large window sizes.

A more promising type of reply relied on research. Surely, this problem had been solved before. Indeed, there is a published paper: Efficient Algorithm for Computing a Running Median by Soymya D. Mohanty with an O(sqrt(n)) solution. Score one for science!

However, that solution was trumped by respondents who characterized the solution mathematically, “the obvious way to compute a running median involves a tree structure so you can quickly insert and delete elements, and find the median. That would be asymptotically O(log n) but messy to implement.” Fortunately, such an implementation exists using the blist Python extension. Alas, we had a good solution but not a portable one. Without the extension module, the B+ tree structure is non-trivial to implement.

When I thought about the problem, the mathematical characterization suggested data structures that maintained sorted data with O(log n) updates, and previous education indicated a skiplist would fit the bill, but it took cleverness to discover that indexing the skiplist to find the median could be reduced to O(log n) time by adding link widths to the structure.  This thinking led to my solution which is easily portable across languages and scales well to very large window sizes.

Had I discovered something new under the sun?  Yes and no.

Yes, as far as I can tell the idea of using an indexable skiplist to solve the running median problem in O(log n) time had never been presented before anywhere else.  The best published solution was Mohanty’s O(sqrt n) solution. Score one for combining mathematical characterization with education and cleverness.

And no, the big inspiration of figuring out how to make a skiplist indexable was not a new result.  Score a big failure for research.  Everywhere I had looked for skiplist resources, only the basics were presented (insertion and deletion in O(log n) time).  No resource mentioned indexable skiplists.  The previous work on the problem had effectively been lost.  An entire generation of programmers was learning about skiplists but not being taught that they could be made efficiently indexable.

To help the world regain this lost knowledge, I updated the wikipedia entry for skiplists to show how to make them indexable with my Python recipe and I’ve added a link to Pugh’s earlier research on the problem.

Will that wikipedia entry really solve the problem of lost knowledge?  The page view statistics suggest that it will.  Only time will tell.