Python Uniq(sequence)
UPDATE (2008.04.13): This has gotten to be a very popular hit for people looking for a quick and easy way to uniqify their lists. The code that was originally posted here is still available here. However, it is ugly and buggy and was only meant for me to find when I needed it. If you liked it, there it is. However, I have spent some time to simplify things. This function uses generators in a lot of places. If you immediately want a list, just apply list() to the output of the function. This does not seem to cause a large hit in performance. Read the docstring for more info. Table of performance characteristics (using l = [ randint(0,1000) for x in xrange(10000000) ] as the sample dataset, randint from random, of course), all values in seconds:
| Order\Hashable | Guess | True | False |
| Ordered | 1.506 | 0.979 | 279 |
| Unordered | 0.560 | 0.560 | 6.997 |
Another Update: I’ve gotten some requests to include a “count” keyword that returns either a list of pairs (original value, count) or a dictionary of the same. This isn’t included in the code (it might be the same return type while meaning something different), but it isn’t hard. I didn’t include the order keyword, but you should get the idea. For people extending this, try not to use the list’s method “count”. In the functions I wrote making use of it, they were more than 100x slower than the below. The number in a comment to the right is the runtime on the same dataset as above. The asymptotic behavior of both of these should be pretty good. Unfortunately, my understanding is that comparisons other than (in)equality between different data types will be dropped in P3K. Meaning, sorting a sequence of objects with different base types (None, int, string) will raise an exception. This will make the second function fail in those situations. I still fail to see why this is undesirable behavior for a dynamically typed language:
def uniqCountDict(seq): # 2.07
d = {}
for i in seq:
try: d[i] +=1
except KeyError: d[i] =1
return d #.items() # if you want a list of tuple pairs, or iteritems
def uniqCountSorted(seq): #6.53
seq, tups, count, item = sorted(seq), [], 0, None
append = tups.append
for j in seq:
if j == item: count+=1
else:
append((item,count))
count, item =1, j
if tups[0][0] == None:
c = tups.pop(0)
tups.insert(0,(None,c[1]-1))
return tups
Download from here or copy and paste from below (unfortunately, wordpress 2.5 seems to be retarded and is stripping whitespace from the preformatted text. I’m leaving the code below so you can glance at it, but the download will have to do if you want to run it without re-indenting manually).
def uniq(seq, order=0, hashable=-1):
u"""return a copy of the given sequence with no repeated items.
Set order to True if you would like order preservation
Setting hashable to -1, 0, or 1 determines whether the function
tries to guess at whether all the times are hashable,
assumes that the items are not hashable,
or assumes that the items are hashable, respectively.
May return generator functions. Apply list() (or similar) if you want all
the values NOW.
The fall back option, for when order preservation isn't needed,
is to sort the input sequence with sorted(). I believe it is possible that
the items in the given sequence do not sort in a way such that the equality
operator gives sane results for consecutive items. If you know the objects
in the list do not sort in a sane manner, set order to True.
This will lead to the worst case algorithm and
should only be used if you know this bad behavior is possible because it is
also probably rare. However, if the items are hashable, you are OK.
Inspired by:
http://aspn.activestate.com/ASPN/Cookbook/Python/Recipe/52560 and
http://www.peterbe.com/plog/uniqifiers-benchmark"""
def hashOrder(seq):
"""generator function, returning unique items from sequence in order
requires that all the items be hashable."""
a, seq = set(), iter(seq)
for next in seq:
if next not in a:
a.add(next)
yield next
def hashNoOrder(seq):
"""returns items from sequence in whatever order
requires that all the items be hashable."""
return (x for x in set(seq))
def sortUniq(seq):
"""Sorts the sequence, then returns the items as they become unique"""
seq = sorted(seq)
last = seq[0]
yield last
for i in seq[1:]:
if i != last:
last = i
yield last
def brute(seq):
noDupes = []
count, append = noDupes.count, noDupes.append
for i in seq:
if not count(i):
append(i)
yield i
if hashable == True:
if order: return hashOrder(seq)
else: return hashNoOrder(seq)
if not hashable:
if order: return brute(seq)
else:
try: return sortUniq(seq)
except (TypeError, ValueError): return brute(seq)
else:
if order:
try:
set(seq) #make sure all items are hashable; this is far from ideal
return hashOrder(seq)
except (TypeError, ValueError): return brute(seq)
else:
try: return hashNoOrder(seq)
except (TypeError, ValueError):
try: return sortUniq(seq)
except (TypeError, ValueError): return brute(seq)
OLD POST:
This post is not likely to interest those reading this with interest in economics or social sciences.
Often we want to trim a sequence so that each value is unique. Sometimes we want to preserve order. Depending on the various properties of the sequence, this task can be fast, or it can be very, very slow. It is also a pain to decide which one to use when. Thus, I am working on my own uniq function that will take care of these logics for me. Without specifying any of the various options, thus far, the function gets me results on par with the fastest uniq’s I could find lying Google around for various purposes. Of course, many of the algorithms are very close to exactly the same as mine, which would explain the performance similarities. I haven’t tested every branch of the code yet, though. Nevertheless, this function *should* be able to take any python sequence and give you back what you want, nearly as fast as possible without much thought to using it. With a bit more information, you might get some kind of performance or memory friendliness increases. Or you might just be able to delay when the performance hit occurs (e.g. between iterations instead of waiting for the entire list at once). Anyway, here is the code. If you see any places where things can be improved, I am more than happy to hear about it. Also, like I said, I haven’t fully tested or profiled the function, so certain code branches may be broken.
I think PEP270, despite being rejected, should make use of a function like this as a method for as many sequence types possible. The interface is optionally very simple, with a good deal of control for users who want it. And the need arises relatively often. Granted, the set(), used extensively here, makes this moot in many cases. But it depends on having hashable elements, which is where things get complicated and difficult to program efficiently, thus the utility in such a method. It would be very Pythonic, imo, to have this builtin. A uniq and uniq_update could also be used for returning the uniq’d object, or updating the object with the uniq’d result.
edit: code snipped here


February 19th, 2008 at 16:22 -0600
This line in _retGen smells:
if isinstance(gen, types.GeneratorType): return True
shouldn’t it test seq’s type?
Also, _hashOrderFast is lazy too if gen is True.
February 19th, 2008 at 22:46 -0600
yes, it would appear you are right, on both points.