Notes on "Learning Rust with entirely too many linked lists"

Published on: 2019-09-29

Home | Archive

The next step to Rust mastery, after studying the official Rust book, is to work out a wonderful document called learning Rust with entirely too many linked lists. This post is a small collection of notes I took down while reading the “entirely too many linked lists” document.

Note: Creating linked data structures in Rust is strictly not an exercise suited for beginners.

The Box

Let’s create a simple “Node” structure (in C) to represent a node of a linked list:

struct Node {
    int dat;
    struct Node *next;
}

You create a new node by:

struct Node *h;
h = malloc(sizeof(struct Node));

Now, how do you do this in Rust?

struct Node {
    dat: i32,
    next: Box<Node>,
}

What is this “Box” and how do we use it?

struct Foo {
    x: i32,
    y: i32,
}

let h = Box::new(Foo{x: 1, y: 2});
println!("{}", (*h).x);

This is almost equivalent to the C program:

struct Foo {
    int x, y;
};

struct Foo *p;
p = malloc(sizeof(struct Foo));
p->x = 1; p->y = 2;

Box::new allocates space on the heap and “places” the structure in the newly allocated space, returning a pointer. Almost like malloc!

“almost” because there are some important differences.

Here is a common bug in C: say the variable “p” goes out of scope (or is overwritten by soemthing else) before you call “free(p)” and de-allocate the block. If this address is not stored anywhere else, you will never be able to use the block or de-allocate it - we have a cute little “memory leak”.

Here is a nastier one: suppose you copy the address in “p” to the variable “q” - now we have two pointers pointing to the same block. Imagine you de-allocate by calling free(p) at one point in the code and at some other point, accidentally use “q” to access the de-allocated block. We now have a “use-after-free” bug.

The Rust “Box” does not suffer from any of these problems.

The pointer variable “h” is supposed to “own” the allocated storage. When “h” goes out of scope, the allocated storage is automatically freed (“dropped”, in Rust terminology). No memory leaks.

Try to compile this:

let h1 = Box::new(Foo{x:1, y:2});
let h2 = h1;
println!("{}", (*h1).x);

You will get a compiler error … the “Box” type obeys Rust “move” semantics - when we copy the address in h1 to h2, ownership of the allocated block also gets transferred to h2 - you can no longer access the block through h1. It is as if h1 has vanished into thin air!

This is important: you can never have a situation where two variables of type “Box” refer to the same block of memory. This means you will never get errors like use-after-free, double-free etc.

Here is a nice way to visualize a Box: as a C pointer to heap allocated storage - storage which automatically gets de-allocated when the pointer goes out of scope, storage which can never have two pointers pointing to it at the same time.

Unconstrained use of pointers is the root of a lot of evil in the C/C++ world; by restricting the use of pointers in certain specific ways, Rust manages to gain memory safety without sacrificing performance.

Boxes also have another cute feature: you need not write (*h).x to access the field “x” of the struct pointed to by h. You can simply write “h.x” and the compiler will do the rest for you.

Coming back to the linked list node, we need a “next pointer” in the node, which our Box is capable of providing. One small issue is: we want this pointer to either point to a Node or be “null”.

There are no segfaulting, undefined-behaviour-inducing stupid null pointers in “safe” Rust. You can get the same effect in a completely type-safe way by using the “Option” type.

struct Node {
    dat: i32,
    next: Option<Box<Node>>
}

fn main() {
    let h1 = Node {dat:10, next: None};
    let h2 = Node {dat:20, next: Some(Box::new(h1))};

    println!("{}", h2.next.unwrap().dat);
}

We have a simple two-node linked list!

Using references instead of “Box”

Rust has “references”, which are superficially like the pointers in C/C++. You can embed these references in structures - like this:

struct Foo {
    x: i32,
    y: i32,
}

struct Node {
    p: &Foo,
}

[note: above code will NOT compile]

So why not create linked lists by using references embedded in structures? The problem is that Rust references do much more than “point-to-things”. All manipulations involving pointers are statically checked to make sure we don’t do stupid things like keep a pointer to a memory location which has been deallocated (and use this pointer to access the deallocated location). And a few other similar things. Sometimes your code will have really tricky bugs hidden inside because of which the compiler refuses to compile the code. Some other times, your code will be perfectly okay, but the compiler will decide to err on the side of safety and refuse to compile the code because the static analyzer was not sophisticated enough to pass judgement on the code you wrote. In general, you will have to struggle a lot to make the compiler believe in the correctness of your code if it involves the kind of pointer manipulations that are typical when creating linked data structures (doubly linked lists, trees, graphs etc).

This is the reason why beginning Rust programmer’s are explicitly advised not to attempt “simple” programmig assignments like creating linked lists, trees etc in Rust. That doesn’t mean that you can’t do it in Rust; it simply means you need to put in much more effort and understand Rust at a much deeper level to get the code to compile.

If you still wish to play with Linked lists in Rust, you can always start by using “owned pointers”, the stuff that you get when you use a Box (and a few other such things, as we will see later). The key difference between a Box and the Rust references (obtained by using the & operator) is that there is a very clear notion of ownership in the case of a Box. Consider the following program:

fn main() {
    let x: &mut i32;
    {
        let mut y = 10;
        x = &mut y;
    }
    *x = 20;
}

Because the variable “x” has no control over the lifetime of variable “y”, the compiler is forced to perform static analysis to determine that all the statements in the code that derefer x (in the above case, the line “*x = 20”) are valid. This can become very tricky as soon as you start embedding references in data structures and sending them across function (and returning them back). Even in the above case, the dereference is invalid because “y” is defined in an inner scope and gets deallocated as soon as the scope ends.

There is absolutely no such problem with a Box.

fn main() {
    let mut x: Box<i32>;
    {
       let y = Box::new(10);
       x = y;
    }
    *x = 20; 
 }

The statement “y = Box::new(10)” allocates some space on the heap, stores 10 in that space and returns a pointer which gets stored in y. Now, the statement “x=y” results in a “move”; x is now pointing to the same heap location, and “y” is no longer valid. This means when the local scope terminates, it has no effect on the heap location containing 10 - as the variable “y” is no longer valid, termination of the inner scope does not result in de-allocation of the heap location which “y” had pointed to. As long as variable “x” is live, the heap location containing 10 will also be valid.

Moving things out of borrowed content

Here is something which you will frequently do when iplementing linked lists in C:

//newnode: points to new node to be added
//addafter: points to node after which newnode is to be added

newnode->next = addafter->next;
addafter->next = newnode;

Let’s focus on the first line: “newnode->next = addafter->next”. We are basically copying the content of the field “next” of the structure pointed to by “addafter” to the field “next” of the structure pointed to by “newnode”.

Note that in Rust, what looks like a copy is often a “move”.

Try compiling this Rust program:

struct Foo {
    x: i32,
    y: Vec<i32>,
}
fn main() {
    let f = Foo {x:1, y:vec![1,2,3]};
    let p = &mut f;
    let q = f.y;
}

We are trying to move the field “y” of the structure pointed to by “p” to another variable called “q”.

The code doesn’t compile. Moving the field “y” to “q” basically invalidates that field; if you now try to access the struct through p, the acess will be invalid. This is not allowed.

The inability to do something like this will prevent us from writing the Rust equivalent of the C code fragment shown above.

But there is a way you can do this in Rust: use the function mem::replace.

use std::mem;
let a = mem::replace(&mut p.y, vec![]);
println!("{:?}", a); // prints [1,2,3]

mem::replace replaces the field “y” with vec![], and returns the original value of the field.

A similar trick can be used to move a value out of an “Option”, at the same time setting the Option’s value to “None”:

let mut p = Some(10);
let r = p.take(); // r is Some(10)
println!("{:?}", p); // p is None

Here are two other very useful operations on Option values:

fn main() {
    let a:Option<Vec<i32>> = Some(vec![1,2,3]);
    let b = a.map(|v| v[0]);
    println!("{:?}", b); // prints Some(1)

    let c:Option<Vec<i32>> = None;
    let d = a.map(|v| v[0]);
    println!("{:?}", d); // prints None
}

This is useful in situations where you need to apply a function on the value wrapped inside the Option (and return a new option), if the Option is not None. Note that the “map” function moves the variable “a” so that it becomes unavailable to the rest of the code.

fn main() {

    let a:Option<Vec<i32>> = Some(vec![1,2,3]);
    let b = a.and_then(|v| Some(v[0]));
    println!("{:?}", b);

    let a:Option<Vec<i32>> = None;
    let b = a.and_then(|v| Some(v[0]));
    println!("{:?}", b);

}

The difference between “map” and “and_then” is that and_then simply returns the value generated by the function which it gets as its argument (this function must return an Option) whereas map will first unwrap the option, apply the function on the unwrapped value, then wrap the result back in an Option and returns that Option.

Reference Counted pointers

What if you wish to have two pointers pointing to one node of a linked list?

The pointed-to node can’t be allocated using a “Box” because a Box will always have only a single pointer which “owns” it; try to copy this pointer and it will move, invalidating the first one.

The next option is to have a Rust “&” reference; this brings with it the problem of handling lifetimes which can sometimes be very tricky.

Reference-counted pointers solve this problem by providing owned, immutable pointers which are very similar to the references that you see in high level garbage-collected languages like Python, Java etc.

Here is a code fragment in Python:

a = [1,2,3]
b = a
b.append(10)
b = 0
a = 0

The variable “a” is in fact a pointer to a block of memory which holds the list [1,2,3]. Besides the numbers 1,2 and 3, this block of memory embeds a reference count which is initially equal to one. The assignment “b = a” copies the address in “a” to “b” and simultaneously increments the reference count. Now, when you replace the address in “b” with the value 0, Python will decrement the reference count (the ref count is now 1). The next assignment (a = 0) decrements the refcount to 0. Whenever the ref-count becomes 0, Python automatically deallocates the block.

Rust provides a special Box-like pointer type “Rc” with a crucial difference - you can have many Rc’s pointing to the same block of memory. Each “Rc” owns the pointed to block (just like a “Box”). But does this not create a problem? When each “Rc” goes out of scope, does it not result in the pointed to block getting de-allocated (“dropped”), resulting in multiple attempts to deallocate the same block, which is an error?

No, the block of memory pointed to by an Rc gets deallocated only when the last Rc pointing to that block goes out of scope. The “Rc” behaves like a reference in Python!

Here is an example:

use std::rc::Rc;

struct Foo{
    x:i32,
    y:i32,
    z:Vec<i32>,
}

fn main() {
    let mut f = Foo{x:1, y:2, z:vec![10,20]};
    let mut a = Rc::new(f);
    println!("{:?}", Rc::strong_count(&a)); // 1
    {
        let b = Rc::clone(&a);
        println!("{:?}", Rc::strong_count(&a)); // 2
    }
    println!("{:?}", Rc::strong_count(&a)); // again 1, as b has gone out of scope.
    
}

The “Rc::clone” function simply increments the reference count of the location pointed to by “a” and returns the same address, which gets stored in “b”.

The Rc::strong_count function returns the embedded reference count.

Note that “Rc” is an immutable reference; statements like “a.x = 10” will not compile.

Mutability

Consider the following program:

struct Hello {
    x: i32,
    y: i32,
}

fn main() {
    let mut h1 = Hello{x:1, y:2};
    let p1 = &h1;
    p1.x = 10;
}

Even though “h1” is mutable, the statement “p1.x=10” generates a compiler error becuase “p1” is an immutable reference.

Mutating objects behind immutable references

Consider the following program:

use std::cell::RefCell;
struct Hello {
    x: i32,
    y: i32,
}

fn main() {
    let  h1 = RefCell::new(Hello{x:1, y:2});
    let p = &h1;
    p.borrow_mut().x = 10;
    println!("{}",p.borrow_mut().x);
    
}

RefCell::new returns a value of type struct RefCell; a field of this struct will be an object of type struct Hello (RefCell has nothing to do with pointers - it is simply a struct with one of the fields storing the value passed as argument to RefCell::new).

The variable “p” is an immutable borrow of this RefCell struct. Normally, we will not be able to modify a value that is behind an immutable “&” reference. But the “borrow_mut” method on a refcell magically returns a mutable reference to the value stored as part of the RefCell.

In what way is this useful?

As we are talking about linked lists, let’s think of two nodes with the “next” pointer of both these nodes pointing to a third node.

We have seen that we can do this easily using a reference-counted pointer. But there is one problem: the Rc is an immutable pointer - you can’t modify the object that the Rc points to.

That’s where the RefCell comes into the picture.

Make the Rc point to a RefCell that wraps over your object and you can mutate the object using the Rc!

use std::cell::RefCell;
struct Hello {
    x: i32,
    y: i32,
}

fn main() {
    let  h1 = RefCell::new(Hello{x:1, y:2});
    let p = Rc::new(h1);
    let q = Rc::clone(&p);

    p.borrow_mut().x = 10;
}

The next program is a bit weird; it compiles fine, but generates a panic when it is executed!

use std::cell::RefCell;
struct Hello {
    x: i32,
    y: i32,
}

fn main() {
    let  h1 = RefCell::new(Hello{x:1, y:2});
    let p = h1.borrow_mut();
    let q = h1.borrow_mut();
}

You are actually able have two mutable borrows in the text of your program, the compiler is not able to prevent that from happening because the “borrow_mut” function doesn’t return an object of type “&mut T”, but something else (of type std::cell::RefMut). The pointer aliasing rules are applicable only for objects of type “&T” and “&mut T”. At run time, if you try to get a mutable borrow while another mutable borrow is in scope, the program will crash preventing anything unsafe from happening.

RefCell’s basically shift the burden of ensuring exclusive mutable access to the runtime, rather than the static analyzer. This allows us to play with pointers in a much more carefree way, with of course much greater chance of our code crashing and burning at runtime!

Here is a useful method on RefCell’s:

use std::cell::RefCell;

#[derive(Debug)]
struct Hello {
    x: i32,
    y: i32,
}

fn main() {
    let  h1 = RefCell::new(Hello{x:1, y:2});
    let q = h1.into_inner();
    println!("{:?}", q);
}

“into_inner” consumes the RefCell and returns the inner value (in this case, a value of type struct Hello).

Unsafe Rust

The restrictions Rust places on pointer manipulations are necessary for ensuring memory safety - but sometimes you will need to be back in C-land and go wild with pointers!

Imagine you need to write to a memory-mapped I/O port on a microcontroller (with address 0x10):

fn main() {
    let p: *mut i32;
    p = 0x10 as *mut i32;
    *p = 0;
}

“p” is declared as a special pointer type: *mut i32. Think of this as somewhat like your normal C pointer; it is called a “raw pointer” in Rust (in this case, a raw pointer to a 32 bit integer).

This is C-like code: stuff in a random number in a pointer variable and derefer it.

This code doesn’t compile. Dereferencing a pointer containing random memory addresses is dangerous and Rust doesn’t let you do it without some ceremony.

Now try this:

fn main() {
    let p: *mut i32;
    p = 0x10 as *mut i32;
    unsafe {
        *p = 0;
    }
}

The program segfaults when it is executed on a Linux system! We are back in the land of C!

Dereferencing a raw pointer is an “unsafe” operation and must be put inside an “unsafe” block. It is our responsibility to make sure that this code does not create any problems. Most Rust programs will never have unsafe blocks in them. Lower levels of code which interacts with the hardware and performs complex memory manipulation will have some “unsafe” blocks in them - as these are explicitly marked as “unsafe”, it is easier for us to identify and review these code fragments more thoroughly. Unsafe blocks must be avoided for all practical purposes and should be used only when they are absolutely unavoidable, that too with extreme caution as they can trigger undefined behaviour and all sorts of weird things might happen …