Learn To Loop The Python Way: Iterators And Generators Explained

If you’ve ever written any Python at all, the chances are you’ve used iterators without even realising it. Writing your own and using them in your programs can provide significant performance improvements, particularly when handling large datasets or running in an environment with limited resources. They can also make your code more elegant and give you “Pythonic” bragging rights.

Here we’ll walk through the details and show you how to roll your own, illustrating along the way just why they’re useful.

You’re probably familiar with looping over objects in Python using English-style syntax like this:

people = [['Sam', 19], ['Laura', 34], ['Jona', 23]]
for name, age in people:
    ...

info_file = open('info.txt')
for line in info_file:
    ...

hundred_squares = [x**2 for x in range(100)]

", ".join(["Punctuated", "by", "commas"])

These kind of statements are possible due to the magic of iterators. To explain the benefits of being able to write your own iterators, we first need to dive into some details and de-mystify what’s actually going on.

Iterators and Iterables

Iterators and iterables are two different concepts. The definitions seem finickity, but they’re well worth understanding as they will make everything else much easier, particularly when we get to the fun of generators. Stay with us!

Iterators

An iterator is an object which represents a stream of data. More precisely, an object that has a __next__ method. When you use a for-loop, list comprehension or anything else that iterates over an object, in the background the __next__ method is being called on an iterator.

Ok, so let’s make an example. All we have to do is create a class which implements __next__. Our iterator will just spit out multiples of a specified number.

class Multiple:
    def __init__(self, number):
        self.number = number
        self.counter = 0

    def __next__(self):
        self.counter += 1
        return self.number * self.counter

if __name__ == '__main__':
    m = Multiple(463)
    print(next(m))
    print(next(m))
    print(next(m))
    print(next(m))

When this code is run, it produces the following output:

$ python iterator_test.py
463
926
1389
1852

Let’s take a look at what’s going on. We made our own class and defined a __next__ method, which returns a new iteration every time it’s called. An iterator always has to keep a record of where it is in the sequence, which we do using self.counter. Instead of calling the object’s __next__ method, we called next on the object. This is the recommended way of doing things since it’s nicer to read as well as being more flexible.

Cool. But if we try to use this in a for-loop instead of calling next manually, we’ll discover something’s amiss.

if __name__ == '__main__':
    for number in Multiple(463):
        print(number)
$ python iterator_test.py
Traceback (most recent call last):
  File "iterator_test.py", line 11, in <module>
    for number in Multiple(463):
TypeError: 'Multiple' object is not iterable

What? Not iterable? But it’s an iterator!

This is where the difference between iterators and iterables becomes apparent. The for loop we wrote above expected an iterable.

Iterables

An iterable is something which is able to iterate. In practice, an iterable is an object which has an __iter__ method, which returns an iterator. This seems like a bit of a strange idea, but it does make for a lot of flexibility; let us explain why.

When __iter__ is called on an object, it must return an iterator. That iterator can be an external object which can be re-used between different iterables, or the iterator could be self. That’s right: an iterable can simply return itself as the iterator! This makes for an easy way to write a compact jack-of-all-trades class which does everything we need it to.

To clarify: strings, lists, files, and dictionaries are all examples of iterables. They are datatypes in their own right, but will all automatically play nicely if you try and loop over them in any way because they return an iterator on themselves.

With this in mind, let’s patch up our Multiple example, by simply adding an __iter__ method that returns self.

class Multiple:
    def __init__(self, number):
        self.number = number
        self.counter = 0

    def __iter__(self):
        return self

    def __next__(self):
        self.counter += 1
        return self.number * self.counter

if __name__ == '__main__':
    for number in Multiple(463):
        print(number)

It now runs as we would expect it to. It also goes on forever! We created an infinite iterator, since we didn’t specify any kind of maximum condition. This kind of behaviour is sometimes useful, but often our iterator will need to provide a finite amount of items before becoming exhausted. Here’s how we would implement a maximum limit:

class Multiple:
    def __init__(self, number, maximum):
        self.number = number
        self.maximum = maximum
        self.counter = 0

    def __iter__(self):
        return self

    def __next__(self):
        self.counter += 1
        value = self.number * self.counter

        if value > self.maximum:
            raise StopIteration
        else:
            return value

if __name__ == '__main__':
    for number in Multiple(463, 3000):
        print(number)

To signal that our iterator has been exhausted, the defined protocol is to raise StopIteration. Any construct which deals with iterators will be prepared for this, like the for loop in our example. When this is run, it correctly stops at the appropriate point.

$ python iterator_test.py
463
926
1389
1852
2315
2778

It’s good to be lazy

So why is it worthwhile to be able to write our own iterators?

Many programs have a need to iterate over a large list of generated data. The conventional way to do this would be to calculate the values for the list and populate it, then loop over the whole thing. However, if you’re dealing with big datasets, this can tie up a pretty sizeable chunk of memory.

As we’ve already seen, iterators can work on the principle of lazy evaluation: as you loop over an iterator, values are generated as required. In many situations, the simple choice to use an iterator or generator can markedly improve performance, and ensure that your program doesn’t bottleneck when used in the wild with bigger datasets or smaller memory than it was tested on.

Now that we’ve had a quick poke around under the hood and understand what’s going on, we can move onto a much cleaner and more abstracted way to work: generators.

Generators

You may have noticed that there’s a fair amount of boilerplate code in the example above. Generators make it far easier to build your own iterators. There’s no fussing around with __iter__ and __next__, and we don’t have to keep track of an internal state or worry about raising exceptions.

Let’s re-write our multiple-machine as a generator.

def multiple_gen(number, maximum):
    counter = 1
    value = number * counter

    while value <= maximum:
        yield value
        counter += 1
        value = number * counter

if __name__ == '__main__':
    for number in multiple_gen(463, 3000):
        print(number)

Wow, that’s a lot shorter than our iterator example. The main thing to note is a new keyword: yield. yield is similar to return, but instead of terminating the function, it simply pauses execution until another value is required. Pretty neat.

In most cases where you generate values, append them to a list and then return the whole list, you can simply yield each value instead! It’s more readable, there’s less code, and it performs better in most cases.

With all this talk about performance, it’s time we put iterators to the test!

Here’s a really simple program comparing our multiple-machine from above with a ‘traditional’ list approach. We generate multiples of 463 up to 100,000,000,000 and time how long each strategy takes.

import time

def multiple(number, maximum):
    counter = 1
    multiple_list = []
    value = number * counter

    while value <= maximum:
        multiple_list.append(value)
        value = number * counter
        counter += 1

    return multiple_list


def multiple_gen(number, maximum):
    counter = 1
    value = number * counter

    while value <= maximum:
        yield value
        counter += 1
        value = number * counter


if __name__ == '__main__':
    MULTIPLE = 463
    MAX = 100_000_000_000

    start_time = time.time()
    for number in multiple_gen(MULTIPLE, MAX):
        pass
    print(f"Generator took {time.time() - start_time :.2f}s")

    start_time = time.time()
    for number in multiple(MULTIPLE, MAX):
        pass
    print(f"Normal list took {time.time() - start_time :.2f}s")

We ran this on a few different Linux and Windows boxes with various specs. On average, the generator approach was about three times faster, using barely any memory, whilst the normal list method quickly gobbled all the RAM and a decent chunk of swap as well. A few times we got a MemoryError when the normal list approach was running on Windows.

Generator comprehensions

You might be familiar with list comprehensions: concise syntax for creating a list from an iterable. Here’s an example where we compute the cube of each number in a list.

nums = [2512, 37, 946, 522, 7984]

cubes = [number**3 for number in nums]

It just so happens that we have a similar construct to create generators (officially called “generator expressions”, but they’re nearly identical to list comprehensions). It’s as easy as swapping [] for (). A quick session at a Python prompt confirms this.

>>> nums = [2512, 37, 946, 522, 7984]
>>> cubes = [number**3 for number in nums]
>>> type(cubes)
<class 'list'>
>>> cubes_gen = (number**3 for number in nums)
>>> type(cubes_gen)
<class 'generator'>
>>>

Again, not likely to make much difference in the example above, but it’s a two-second change which does come in handy.

Summary

When you’re dealing with lots of data, it’s essential to be smart about how you use resources, and if you can process the data one item at a time, iterators and generators are just the ticket. A lot of the techniques we’ve talked about above are just common sense, but the fact that they are built into Python in a defined way is great. Once you dip your toe into iterators and generators, you’ll find yourself using them surprisingly often.

17 thoughts on “Learn To Loop The Python Way: Iterators And Generators Explained

  1. Looks like there is something wrong with these lines:

    33: print(f”Generator took {time.time() – start_time :.2f}s”)
    38: print(f”Normal list took {time.time() – start_time :.2f}s”)

    Can’t find such syntax for print.

  2. This is only scratching the surface of generators. They also have two other properties, the send and return values. Generators can also receive values during their operation, this is the basis of asyncio.

  3. Technically speaking comprehensions yield inside parenthesis, and the same thing would be return inside square brackets. Also it is worthwhile to make sure that iterator is all good iterable with functions __iter__, __next__, and so on and so forth. This is easy to understand, understandable, grasp the meaning to put into practical coding. Well, I would say ‘Disabled’ configuration settings can be more nerve-wracking situation!

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.