Common Concepts - Iterators

By Clemens Tiedt

2021-09-29
#rust
#python
#iterator
#common concepts


Iterators are a concept that almost any modern high-level programming language implements and you have probably used them already, even if you didn't notice. In this article, I want to pull back the proverbial curtain and look at why iterators are so useful and how they work.

What is an iterator?

Iterators are a generalization of the most basic properties of sequence data types. A sequence has a current state and a way to get from this state to the next one. This can be used, for example, to access the items of an array in order (the state is the current index and you change it by increasing the index). This iterator also has a natural stopping point when the index is greater than the length of the array. However, you can also create infinite iterators. For example, you could take away the array from our previous example and simply return the next higher natural number every time your iterator is polled.

To an outsider, an iterator only has some sort of next function or method to poll it again that can either return an item or signify that the iterator is exhausted1.

Code example in Python:

i = iter([0, 1, 2])
next(i) # returns 0
next(i) # returns 1
next(i) # returns 2
next(i) # raises exception StopIteration

Code example in Rust:

let i = [0, 1, 2].into_iter();
println!("{:?}", i.next()); // prints Some(0)
println!("{:?}", i.next()); // prints Some(1)
println!("{:?}", i.next()); // prints Some(2)
println!("{:?}", i.next()); // prints None

History

Iterators were first introduced in Barbara Liskov's programming language CLU. In her 1992 article A History of CLU she describes that the people working on CLU realized that different collection data structures (e.g. arrays, linked lists, and sets) all used different methods to access elements. However, it would be useful to be able to take a collection of any type and go through its elements in a for loop like this2:

for element in collection:
    do_stuff_with(element)

This example written in CLU actually looks remarkably similar to modern Python:

odds = iter(n:int) yields int
    i:int

    i = 1
    while i < n do
      yield i
      i := i + 2
    end
  end odds

As you can see, we define an iterator odds that "yields" an integer (in contrast to a procedure which would "return" a value). The logic in the iterator is relatively straightforward. We have an integer i that is initialized with 1 and increased by 2 as long as it is smaller than the limit given by the parameter n.

What's interesting is the yield i inside the loop. What looks like a normal function is really a stateful object that saves its internal state in i and changes its state any time it is polled (e.g. by a for loop).

Iterators in Python

Since I wrote that the CLU code looked similar to Python, I should probably talk about iterators in Python. Actually, Python distinguished between iterators and generators. Let's rewrite our example from before using a generator:

def odds(n):
    i = 1
    while i < n:
        yield i
        i += 2

As you can see, Python doesn't require us to explicitly declare that this as a generator and we save a few lines that were only used to end blocks in CLU, but you should still see the similarity.

Let's paste this inside our Python REPL and call odds to see what happens when we call it:

>>> odds(13)
<generator object odds at 0x7f79a26eccf0>

As we see, calling a generator function creates a new object of type generator. But how do we do anything with it? Looking at some Python documentation tells us we can call next on it:

>>> next(odds(13))
1

That's a result! Let's choose a smaller number and call next until we exhaust the generator:

>>> o = odds(6)
>>> next(o)
1
>>> next(o)
3
>>> next(o)
5
>>> next(o)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
StopIteration

So our generator object is stateful and returns new numbers every time we call it. And once the condition in our while loop becomes false, it stops. More precisely, it raises the StopIteration exception - in Python, an exception is not necessarily an error and is in this case used to signal that the iterator is exhausted.

This is our first glimpse into the inner workings of Python iterators: Raise StopIteration if the iterator is exhausted. Actually, the documentation from before tells us even more: Our custom iterator needs to have __iter__ and __next__ methods where __iter__ returns the iterator itself and __next__ returns the next item or raises StopIteration when the iterator is exhausted.

So let's rewrite odds in this explicit version:

class odds:
    def __init__(self, n):
        self.i = 1
        self.n = n

    def __iter__(self):
        return self

    def __next__(self):
        if self.i < self.n:
            i = self.i
            self.i += 2
            return i
        else:
            raise StopIteration

We have a normal Python class and don't use a loop anymore since we don't rely on Python magic to track the state of our iterator, instead tracking it ourselves using normal attributes. Let's create an instance of this class:

>>> odds(13)
<__main__.odds object at 0x7f79a267ed30>

This is not a generator like our previous version, but it implements what Python calls the iterator protocol. So, it works just the same as our generator:

>>> o = odds(6)
>>> next(o)
1
>>> next(o)
3
>>> next(o)
5
>>> next(o)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "<stdin>", line 13, in __next__
StopIteration

As you may have noticed, I have been talking about iterators and generators and sometimes switching between these terms. What's the difference?

The Python glossary tells us that a generator is "a function which returns a generator iterator". A generator iterator is an object that is generated by a generator to implement the iterator protocol for the content of the generator function. It runs the function and suspends execution when it encounters a yield statement. It then saves the function's state and resumes it once next is called on it. If the function exits, the generator iterator raises StopIteration.

What happens when you use our first Python implementation of odds is this:

  1. By calling odds(6) we create a new generator iterator and assign it to o
  2. Calling next(o) starts the function described by odds
  3. The function comes across a yield statement and returns the yielded expression
  4. When we call next(o) again, the function resumes, increasing the variable i and reentering the while loop
  5. When we call next(o) and the while loop (and therefore the function) exits, o raises a StopIteration exception

But why should you even use generators? The answer is that they can be a lot more effective than a collection-based approach. The keyword here is laziness. Unlike a list which needs to be completely kept in memory, the generator only has to know its current state, therefore saving memory and often even being faster.

Here are some runtimes (in nanoseconds) for our odds iterators compared to the following function that simply creates a list:

def odds_list(n):
    o = []
    i = 1
    while i < n:
        o.append(i)
        i += 2
    return o
n Generator Self-Implemented Iterator List
10 9744 12705 8925
100 8206 14766 8542
1000 38032 106564 49902
10000 389495 932232 488954
100000 3084526 10046116 4852884
1000000 32289332 93365725 44623061

As you can see, the generator version has a slight advantage in speed for bigger inputs whereas our custom implementation falls behind. I would guess that this implementation is harder for Python to optimize than the generator version.

However, if we look at the result of getsizeof, our custom implementation is the smallest at 48 bytes. The generator uses consistently uses 112 bytes whereas the list grows up to 4167352 bytes (or about 4.2MB) for an input of n=1000000.

This shows that if you care about speed, generators using the yield expression are the way to go. If you need to optimize for memory usage, you can implement the iterator yourself, though the gain is probably in many cases not significant enough to warrant the additional effort compared to a generator.

Iterators in Rust

Of course, Python is not the only language to feature iterators - as I showed in the introduction, Rust also has them. In Rust, iterators exist as a trait (similar to the "iterator protocol" used by Python, but formalized in the language). Let's look at the definition:

trait Iterator {
    type Item;

    fn next(&mut self) -> Option<Self::Item>;
}

Since Rust is statically typed, you also need to tell it the type of the iterator, but next looks very similar to Python's __next__. Unlike Python, where exceptions are part of the normal control flow, Rust only has panics that take the whole program with them. Instead, if an action could fail, you use Result<T, E> (which can contain error information) or Option<T> if a value might not be defined. This is actually very fitting for iterators: If there is a next item in the iterator, it returns that item, otherwise it returns None - no need for exceptions here!

Our odds iterator could look something like this:

struct Odds {
    state: u32,
    limit: u32,
}

impl Odds {
    fn new(limit: u32) -> Self {
        Self {
            state: 1,
            limit,
        }
    }
}

impl Iterator for Odds {
    type Item = u32;

    fn next(&mut self) -> Option<Self::Item> {
        if self.state < self.limit {
            let tmp = self.state;
            self.state += 2;
            Some(tmp)
        } else {
            None
        }
    }
}

And then you can use it like this:

let mut o = Odds::new(6);
println!("{:?}", o.next()); // Some(1)
println!("{:?}", o.next()); // Some(3)
println!("{:?}", o.next()); // Some(5)
println!("{:?}", o.next()); // None

As you can see, the logic is more or less identical to the odds class we built in Python. If you are already familiar with Rust, you have probably wondered about the definition of the Iterator trait. It seems like a prime candidate for generics (i.e. Iterator<T>), but instead uses associated types. Let's see what happens if we define our own GenericIter<T> trait:

trait GenericIter<T> {
    fn next(&mut self) -> Option<T>;
}

impl GenericIter<u32> for Odds {
    fn next(&mut self) -> Option<u32> {
        if self.state < self.limits {
            let tmp = self.state;
            self.state += 2;
            Some(tmp)
        } else {
            None
        }
    }
}

The implementation is identical, except for the type signatures. So what makes the version with an associated type superior? Well, with our generic version we could add code like this:

impl GenericIter<u64> for Odds {
    fn next(&mut self) -> Option<u64> {
        if self.state < self.limits {
            let tmp = self.state;
            self.state += 2;
            Some(tmp as u64)
        } else {
            None
        }
    }
}

Now we have two implementations of GenericIter for Odds. Since their generics differ, they can both coexist peacefully. Except, if we try using an instance of our iterator, this happens:

error[E0282]: type annotations needed
  --> src/main.rs:66:31
   |
66 |         println!("{:?}", o.next());
   |                          --^^^^--
   |                          | |
   |                          | cannot infer type for type parameter `T` declared on the trait `GenericIter`
   |                          this method call resolves to `Option<T>`

For more information about this error, try `rustc --explain E0282`.
error: could not compile `rust-iter` due to previous error

The compiler cannot decide whether to call GenericIter::<u32>::next or GenericIter::<64>::next. We could of course not use the method syntax and call next as e.g. GenericIter::<u64>::next(&mut o), but that is less than elegant. A reasonable idea to solve this would be to only allow one implementation of Iterator per type. And this brings us back to the question of why the Rust standard library uses an associated type for the item type of the Iterator trait. Since the trait is not generic anymore, you can only have one implementation per type and therefore only one next implementation to call. But what if we want an iterator with say an internal state of type u64 that outputs u32s?

One of the cool things you can do with iterators is combine and modify them using adapters. For example, we could create a new iterator by applying a function to each element - we could map an iterator of item type T to one of type U. Let's try to build such a thing:

struct Map<I, T>
where
    I: Iterator,
{
    i: I,
    f: Box<dyn Fn(I::Item) -> T>,
}

As you can see, our map iterator contains another iterator and has a function that maps an argument of I's item type to the generic type T. Let's continue by implementing Iterator for this new type:

impl<I: Iterator, T> Iterator for Map<I, T> {
    type Item = T;

    fn next(&mut self) -> Option<Self::Item> {
        self.i.next().map(&mut self.f)
    }
}

To get the next item of a Map, we get the next element of the wrapped iterator and apply the mapping function to it. Since it.next() returns an Option, we use the Option::map method which transforms an Option<T> to a Option<U>, so that if the Option is Some, the mapping function is applied to its contents, otherwise it stays None.

We can now build a function that creates a new Map given an iterator and a function3:

fn map<I, B>(i: I, f: Box<dyn Fn(I::Item) -> B>) -> Map<I, B>
where
    I: Iterator + Sized,
{
    Map { i, f }
}

If you look at the Rust standard library implementation of Iterator::map, you will see that it doesn't require you to box the mapping function by making use of the way the compiler deals with generics. If you are interested in the nitty-gritty details, have a look at the source code. But we do have a working map implementation:

let mut o = map(Odds::new(6), Box::new(|n| n * 2));

println!("{:?}", o.next()); // Some(2)
println!("{:?}", o.next()); // Some(4)
println!("{:?}", o.next()); // Some(10)
println!("{:?}", o.next()); // None

From here, you can do tons of things with iterators, for example:

Beyond just those, Rust brings many more built-in adapters for iterators.

All of them follow the same basic pattern as our Map, though. When you call the relevant function on an Iterator, a new iterator is created that usually takes ownership of the iterator(s) it adapts and somehow transforms their outputs in its own next implementation.

Wrapping up

As I wrote in the beginning of this article, iterators are a very common concept. I hope that you now know how they can be used to provide a generalized interface for interacting with sequences and can be more efficient by working lazily. Writing your own iterator is often easier than you might think, no matter which language we are talking about, and can help you write cleaner and even more performant code. Happy coding!

  1. Technically, an iterator can yield more items after not yielding once, but the more common semantics are that once an iterator doesn't yield an item it is exhausted. The Python documentation even states that "Implementations that do not obey this property are deemed broken".

  2. Of course, Liskov described this in CLU syntax, but I find Python syntax easier to read.

  3. In the standard library, map is provided as a default implementation of the Iterator trait. Since we don't want to modify or build the trait from scratch here, we're just writing it as a function.


Read more:

A Soil VM for the Linux Kernel

One of the Linux kernel features that have gained the most traction in the last few years is probably (e)BPF. Originally, the "Berkeley Packet Filter" was intended as a means of filtering network packets in kernel mode. However, BPF quickly developed into a fully-featured VM used for all kinds of purposes. The appeal of BPF is not hard to see: It allows you to load kernel mode code at system runtime (similar to kernel modules) while keeping some degree of sandboxing and fault tolerance afforded by the VM. It is much more difficult to break your kernel with a BPF program than with a regular kernel module. One of the most prominent current users of BPF is sched_ext, a framework for writing scheduler implementations in BPF. This lets you easily tinker with your scheduler and see results live and without the risk of breaking your kernel if your implementation crashes....

2024-09-05 #soil #c #linux #kernel

Share this article: Common Concepts - Iterators