Learn To Loop The Python Way: Iterators And Generators Explained


If you’ve ever written any Python at all, the chances are you’ve used iterators without even realising it. Writing your own and using them in your programs can provide significant performance improvements, particularly when handling large datasets or running in an environment with limited resources. They can also make your code more elegant and give you “Pythonic” bragging rights.

Here we’ll walk through the details and show you how to roll your own, illustrating along the way just why they’re useful.

You’re probably familiar with looping over objects in Python using English-style syntax like this:


people = [['Sam', 19], ['Laura', 34], ['Jona', 23]]
for name, age in people:
...

info_file = open('info.txt')
for line in info_file:
...

hundred_squares = [x**2 for x in range(100)]

", ".join(["Punctuated", "by", "commas"])

These kind of statements are possible due to the magic of iterators. To explain the benefits of being able to write your own iterators, we first need to dive into some details and de-mystify what’s actually going on.

Iterators and iterables are two different concepts. The definitions seem finickity, but they’re well worth understanding as they will make everything else much easier, particularly when we get to the fun of generators. Stay with us!

Iterators

An iterator is an object which represents a stream of data. More precisely, an object that has a __next__ method. When you use a for-loop, list comprehension or anything else that iterates over an object, in the background the __next__ method is being called on an iterator.

Ok, so let’s make an example. All we have to do is create a class which implements __next__. Our iterator will just spit out multiples of a specified number.


class Multiple:
def __init__(self, number):
self.number = number
self.counter = 0

def __next__(self):
self.counter += 1
return self.number * self.counter

if __name__ == '__main__':
m = Multiple(463)
print(next(m))
print(next(m))
print(next(m))
print(next(m))

When this code is run, it produces the following output:


$ python iterator_test.py
463
926
1389
1852

Let’s take a look at what’s going on. We made our own class and defined a __next__ method, which returns a new iteration every time it’s called. An iterator always has to keep a record of where it is in the sequence, which we do using self.counter. Instead of calling the object’s __next__ method, we called next on the object. This is the recommended way of doing things since it’s nicer to read as well as being more flexible.

Cool. But if we try to use this in a for-loop instead of calling next manually, we’ll discover something’s amiss.


if __name__ == '__main__':
for number in Multiple(463):
print(number)

$ python iterator_test.py
Traceback (most recent call last):
File "iterator_test.py", line 11, in <module>
for number in Multiple(463):
TypeError: 'Multiple' object is not iterable

What? Not iterable? But it’s an iterator!

This is where the difference between iterators and iterables becomes apparent. The for loop we wrote above expected an iterable.

Iterables

An iterable is something which is able to iterate. In practice, an iterable is an object which has an __iter__ method, which returns an iterator. This seems like a bit of a strange idea, but it does make for a lot of flexibility; let us explain why.

When __iter__ is called on an object, it must return an iterator. That iterator can be an external object which can be re-used between different iterables, or the iterator could be self. That’s right: an iterable can simply return itself as the iterator! This makes for an easy way to write a compact jack-of-all-trades class which does everything we need it to.

To clarify: strings, lists, files, and dictionaries are all examples of iterables. They are datatypes in their own right, but will all automatically play nicely if you try and loop over them in any way because they return an iterator on themselves.

With this in mind, let’s patch up our Multiple example, by simply adding an __iter__ method that returns self.


class Multiple:
def __init__(self, number):
self.number = number
self.counter = 0

def __iter__(self):
return self

def __next__(self):
self.counter += 1
return self.number * self.counter

if __name__ == '__main__':
for number in Multiple(463):
print(number)

It now runs as we would expect it to. It also goes on forever! We created an infinite iterator, since we didn’t specify any kind of maximum condition. This kind of behaviour is sometimes useful, but often our iterator will need to provide a finite amount of items before becoming exhausted. Here’s how we would implement a maximum limit:


class Multiple:
def __init__(self, number, maximum):
self.number = number
self.maximum = maximum
self.counter = 0

def __iter__(self):
return self

def __next__(self):
self.counter += 1
value = self.number * self.counter

if value > self.maximum:
raise StopIteration
else:
return value

if __name__ == '__main__':
for number in Multiple(463, 3000):
print(number)

To signal that our iterator has been exhausted, the defined protocol is to raise StopIteration. Any construct which deals with iterators will be prepared for this, like the for loop in our example. When this is run, it correctly stops at the appropriate point.


$ python iterator_test.py
463
926
1389
1852
2315
2778

So why is it worthwhile to be able to write our own iterators?

Many programs have a need to iterate over a large list of generated data. The conventional way to do this would be to calculate the values for the list and populate it, then loop over the whole thing. However, if you’re dealing with big datasets, this can tie up a pretty sizeable chunk of memory.

As we’ve already seen, iterators can work on the principle of lazy evaluation: as you loop over an iterator, values are generated as required. In many situations, the simple choice to use an iterator or generator can markedly improve performance, and ensure that your program doesn’t bottleneck when used in the wild with bigger datasets or smaller memory than it was tested on.

Now that we’ve had a quick poke around under the hood and understand what’s going on, we can move onto a much cleaner and more abstracted way to work: generators.

You may have noticed that there’s a fair amount of boilerplate code in the example above. Generators make it far easier to build your own iterators. There’s no fussing around with __iter__ and __next__, and we don’t have to keep track of an internal state or worry about raising exceptions.

Let’s re-write our multiple-machine as a generator.


def multiple_gen(number, maximum):
counter = 1
value = number * counter

while value <= maximum:
yield value
counter += 1
value = number * counter

if __name__ == '__main__':
for number in multiple_gen(463, 3000):
print(number)

Wow, that’s a lot shorter than our iterator example. The main thing to note is a new keyword: yield. yield is similar to return, but instead of terminating the function, it simply pauses execution until another value is required. Pretty neat.

In most cases where you generate values, append them to a list and then return the whole list, you can simply yield each value instead! It’s more readable, there’s less code, and it performs better in most cases.

With all this talk about performance, it’s time we put iterators to the test!

Here’s a really simple program comparing our multiple-machine from above with a ‘traditional’ list approach. We generate multiples of 463 up to 100,000,000,000 and time how long each strategy takes.


import time

def multiple(number, maximum):
counter = 1
multiple_list = []
value = number * counter

while value <= maximum:
multiple_list.append(value)
value = number * counter
counter += 1

return multiple_list


def multiple_gen(number, maximum):
counter = 1
value = number * counter

while value <= maximum:
yield value
counter += 1
value = number * counter


if __name__ == '__main__':
MULTIPLE = 463
MAX = 100_000_000_000

start_time = time.time()
for number in multiple_gen(MULTIPLE, MAX):
pass
print(f"Generator took {time.time() - start_time :.2f}s")

start_time = time.time()
for number in multiple(MULTIPLE, MAX):
pass
print(f"Normal list took {time.time() - start_time :.2f}s")

We ran this on a few different Linux and Windows boxes with various specs. On average, the generator approach was about three times faster, using barely any memory, whilst the normal list method quickly gobbled all the RAM and a decent chunk of swap as well. A few times we got a MemoryError when the normal list approach was running on Windows.

You might be familiar with list comprehensions: concise syntax for creating a list from an iterable. Here’s an example where we compute the cube of each number in a list.


nums = [2512, 37, 946, 522, 7984]

cubes = [number**3 for number in nums]

It just so happens that we have a similar construct to create generators (officially called “generator expressions”, but they’re nearly identical to list comprehensions). It’s as easy as swapping [] for (). A quick session at a Python prompt confirms this.


>>> nums = [2512, 37, 946, 522, 7984]
>>> cubes = [number**3 for number in nums]
>>> type(cubes)
<class 'list'>
>>> cubes_gen = (number**3 for number in nums)
>>> type(cubes_gen)
<class 'generator'>
>>>

Again, not likely to make much difference in the example above, but it’s a two-second change which does come in handy.

When you’re dealing with lots of data, it’s essential to be smart about how you use resources, and if you can process the data one item at a time, iterators and generators are just the ticket. A lot of the techniques we’ve talked about above are just common sense, but the fact that they are built into Python in a defined way is great. Once you dip your toe into iterators and generators, you’ll find yourself using them surprisingly often.



Source link

Leave a Reply

Your email address will not be published. Required fields are marked *