Python Sorted Collections
Hey folks! This is a guest post by Grant Jenks. Let’s give him a warm welcome and get right on into what he has to say. 🙂
Hello all! I’m Grant Jenks and I’m guest-posting about one of my favorite topics: Python Sorted Collections.
Python is a little unusual regarding sorted collection types as compared with other programming languages. Three of the top five programming languages in the TIOBE Index include sorted list, sorted dict or sorted set data types. But neither Python nor C include these. For a language heralded as “batteries included” that’s a little strange.
The reasoning is a bit circular but boils down to: the standard library covers most use cases, for everything else there’s PyPI, the Python Package Index. But PyPI works only so well. In fact, some peculiarities of the Python community make PyPI’s job quite difficult. For example, Python likes Monty Python references which many find unusual or obscure. And as Phil Karlton would point out, naming things is hard.
As an aside, it’s worth noting
collections.OrderedDict in the Python standard library. OrderedDict maintains the order that items were added to the dictionary. Sometimes that order is sorted:
>>> from collections import OrderedDict >>> letters = [('a', 0), ('b', 1), ('c', 2), ('d', 3)] >>> values = OrderedDict(letters) >>> print(values) OrderedDict([('a', 0), ('b', 1), ('c', 2), ('d', 3)]) >>> print(list(values.keys())) ['a', 'b', 'c', 'd']
We can continue editing this OrderedDict. Depending on the key we add, the order may remain sorted.
>>> values['e'] = 4 >>> print(list(values.keys())) ['a', 'b', 'c', 'd', 'e']
But sort order won’t always be maintained. If we remove an existing key and add it back, then we’ll see it appended to the end of the keys.
>>> del values['a'] >>> values['a'] = 0 >>> print(list(values.keys())) ['b', 'c', 'd', 'e', 'a']
Ooops! Notice now that ‘a’ is at the end of the list of keys. That’s the difference between ordered and sorted. While OrderedDict maintains order based on insertion order, a SortedDict would maintain order based on the sorted order of the keys.
A few years ago I set out to select a sorted collections library from PyPI. I was initially overwhelmed by the options. There are many data types in computer science theory that can be used and each has various tradeoffs. For example, Red-Black Trees are used in the Linux Kernel but Tries are often more space efficient and used in embedded systems. Also B-Trees work very well with a huge number of items and are commonly used in databases.
What I really wanted was a pure-Python solution that was fast-enough. Finding a solution at the intersection of those requirements was really tough. Most fast implementations were written in C and many lacked benchmarks or documentation.
I couldn’t find the right answer so I built it: Sorted Containers. The right answer is pure-Python. It’s Python 2 and Python 3 compatible. It’s fast. It’s fully-featured. And it’s extensively tested with 100% coverage and hours of stress. SortedContainers includes SortedList, SortedDict, and SortedSet implementations with a familiar API.
>>> from sortedcontainers import SortedList, SortedDict, SortedSet >>> values = SortedList('zaxycb') >>> values 'a' >>> values[-1] 'z' >>> list(values) # Sorted order is automatic. ['a', 'b', 'c', 'x', 'y', 'z'] >>> values.add('d') >>> values 'd' >>> del values >>> list(values) # Sorted order is maintained. ['b', 'c', 'd', 'x', 'y', 'z']
Each of the
SortedSet data types looks, swims, and quacks like its built-in counterpart.
>>> items = SortedDict(zip('dabce', range(5))) >>> list(items.keys()) # Keys iterated in sorted order. ['a', 'b', 'c', 'd', 'e'] >>> items['b'] 2 >>> del items['c'] >>> list(items.keys()) # Sorted order is automatic. ['a', 'b', 'd', 'e'] >>> items['c'] = 10 >>> list(items.keys()) # Sorted order is maintained. ['a', 'b', 'c', 'd', 'e']
Each sorted data type also plays nicely with other data types.
>>> keys = SortedSet('dcabef') >>> list(keys) ['a', 'b', 'c', 'd', 'e', 'f'] >>> 'c' in keys True >>> list(keys | 'efgh') ['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h'] >>> list(keys & 'cde') ['c', 'd', 'e'] >>> list(keys & 'yzab') ['a', 'b']
In addition to the familiar API of the built-ins, maintaining sorted order affords efficient opportunities for searching and indexing.
- You can very quickly and efficiently lookup the presence or index of a value. What would previously require a linear scan is now done in logarithmic time.
>>> import string >>> values = SortedList(string.lowercase) >>> 'q' in values True >>> values.index('r') 17
- You can slice containers by index or by value. Even mappings and sets support numeric indexing and iteration.
>>> items = SortedDict(zip(string.lowercase, range(26))) >>> list(items.irange('g', 'j')) ['g', 'h', 'i', 'j'] >>> items.index('g') 6 >>> items.index('j') 9 >>> list(items.islice(6, 10)) ['g', 'h', 'i', 'j']
- Mappings also support numeric indexing using a Pandas-like iloc interface.
>>> items.iloc 'a' >>> items.iloc 'f' >>> items.iloc[:5] ['a', 'b', 'c', 'd', 'e'] >>> items.iloc[-3:] ['x', 'y', 'z']
Using these features, you can easily duplicate the advanced features found in Pandas DataFrame indexes, SQLite column indexes, and Redis sorted sets.
On top of it all, performance is very good across the API and faster-than-C implementations for many methods. There are extensive benchmarks comparing alternative implementations, load-factors, runtimes, and simulated workloads. SortedContainers has managed to unseat the decade-old incumbent “blist” module and convinced authors of alternatives to recommend SortedContainers over their own package.
How does it work? I’m glad you asked! In addition to the implementation details, I’ll be giving a talk at PyCon 2016 in Portland, Oregon on Python Sorted Collections that will get into the gory details. We’ll see why benchmarks matter most in claims about performance and why the strengths and weakness of modern processors affect how you choose your data structures. It’s possible to write fast code in pure-Python!
Your feedback on the project is welcome!