An interesting approach for bytecode dispatch

Today I implemented a way to perform bytecode dispatch that I found interesting. It's supposedly called direct threaded code or threaded code interpretation, but my implementation is a lot simpler than how that sounds... What's in a name?

Before I get into it, let's first have some context. Following the second part of Robert Nystrom's Crafting Interpreters, I have reached the part where they show how to build a virtual machine.

I have my implementation written in Zig, but the implementation is easily application to other languages.

There's an enum that is used to determine which instruction is currently being executed:

const Opcode = enum(u8) {
    Constant,
    Return,
};

I have kept it short for brevity, but assume this is all of the opcode supported by our virtual machine. The VM is given a value containing a slice of bytes ([]u8). Some of the bytes in that slice correspond to instructions (opcode and operand) and some is just data. Here's a rough sketch of how the VM is structured:

const VM = struct {
    chunk: []u8,
    pub fn init(chunk: []u8) VM {
        return VM {
            .chunk = chunk,
        };
    }
    pub fn run(self: *VM) !void {
        // interpret instructions from `chunk`
    }
};

The VM just stores a slice of bytes which the run() function is supposed to use. The run() function is the actual meat of our VM. It fetches, decodes and executes one bytecode at a time.

Typically, we would implement the function like so:

const VM = struct {
    ...
    fn run(self: *VM) !void {
        // instruction pointer
        var ip: usize = 0;
        while (ip < self.chunk.len) {
            // fetch a byte and convert it into an instance of `Opcode`
            const opcode: Opcode = @enumFromInt(self.chunk[ip]);
            switch (opcode) {
                .Constant => {
                    // possibly a large block of code
                },
                .Return => {
                    // possibly a large block of code
                },
            }
        }
    }
};

We use a virtual instruction pointer to keep track of where we are inside the chunk. In a loop, we fetch a byte and use @enumFromInt, a built-in function provided by Zig, that converts an integer into an enum value1.

Then follows the dispatcher code. We look up which Opcode variant our opcode represents, and jump to the appropriate label for handling that variant.

Since every instruction is handled from that function, that makes it the most performance critical part of our virtual machine.

However, this approach has several inefficiencies. First of all, it is difficult for a processor to predict where a jump is going to occur. The switch may translate to a jump that covers most, perhaps even all, of the opcodes. I may have shown only two jump targets, but a real VM would need to handle a lot more than that. There are limits to how many jump targets that the branch predictor can remember. Moreover, the target address calculation consists of instructions being executed at every iteration, which is mostly redundant.

Modern processors use instruction caches to drastically improve performance. However, the code to handle each of the opcode variants may be arbitrarily large such that our interpreter loop may not fit into the cache.

Such inefficiencies have led programmers to find clever approaches to perform bytecode dispatch. The direct threaded code approach involves creating a buffer holding addresses to all of the jump targets. The buffer is populated once in advance and outside the interpreter loop. Each of the element in the buffer is indexed by the numerical value of the opcode. Thus, the bytecode dispatch corresponds to fetching the opcode and jumping directly to the target indexed by the opcode.

For our implementation, I have first defined various functions to handle each of the opcode variants separately. Our VM structure may look something like:

const VM = struct {
    chunk: []u8,
    pub fn init(chunk: []u8) VM {
        return VM {
            .chunk = chunk,
        };
    }
    pub fn run(self: *VM) !void {
        // interpret instructions from `chunk`
    }
    pub fn handle_return(self: *VM, ip: *usize) void {
        // code to handle Return
    }
    pub fn handle_constant(self: *VM, ip: *usize) void {
        // code to handle Constant
    }
};

Then we add a field to VM that holds function pointers. For convenience, I have also defined a new type: OpcodeHandler, which is a constant pointer to a function that handles an opcode. Then I add a new field to the structure for storing those OpcodeHandlers.

const VM = struct {
    // a function that handles a particular Opcode variant
    const OpcodeHandler = *const (fn (self: *VM, ip: *usize) void);
    chunk: []u8,
    handlers: []OpcodeHandler
    ...
};

I have updated the initializer slightly so that it accepts an allocator. Now in the initializer, we call a new function get_handlers() that allocates an array and populates it with the required values. I am using a feature provided by Zig that allows us to get the number of variants contained in an enum. We use that length to find the maximum integer value that an Opcode variant might hold. That value should be the length of our handlers array. The implementation of get_handlers() should make this clearer.

const VM = struct {
    const OpcodeHandler = *const (fn (self: *VM, ip: *usize) void);
    chunk: []u8,
    handlers: []OpcodeHandler
    pub fn init(allocator: Allocator, chunk: []u8) VM {
        return VM {
            .chunk = chunk,
            .handlers = get_handlers(allocator),
        };
    }
    ...
    fn get_handlers(allocator: Allocator) []OpcodeHandler {
        // total number of variants in Opcode
        const total = @typeInfo(Opcode).Enum.fields.len;
        var handlers = allocator.alloc(OpcodeHandler, total) catch @panic("OOM");
        // store the function pointers at the index corresponding to the opcode value
        for (0..total) |i| {
            const opcode: Opcode = @enumFromInt(i);
            switch (opcode) {
                .Return => {
                    handlers[i] = VM.handle_return;
                },
                .Constant => {
                    handlers[i] = VM.handle_constant;
                },
            }
        }
        return handlers;
    }
    pub fn handle_return(self: *VM, ip: *usize) void {
        // code to handle Return
    }
    pub fn handle_constant(self: *VM, ip: *usize) void {
        // code to handle Constant
    }
};

Now the interpreter loop only needs to fetch an opcode and call the function at the index given by the opcode value, instead of doing the switch.

const VM = struct {
    ...
    fn run(self: *VM) !void {
        var ip: usize = 0;
        while (ip < self.chunk.len) {
            const opcode: Opcode = @enumFromInt(self.chunk[ip]);
            self.handlers[opcode](self, &ip);
        }
    }
};

And there you go, a much simpler and more efficient solution!

1

errors aren't possible and the sun rises from the west