| # Final Encoding for RPython Interpreters
|
|
|
| An RPython interpreter for a programming language generally does three or four
|
| things, in order:
|
|
|
| 1. Read and parse input programs
|
| 1. Encode concrete syntax as abstract syntax
|
| 1. *Optionally*, optimize or reduce the abstract syntax
|
| 1. Evaluate the abstract syntax: read input data, compute, print output data,
|
| etc.
|
|
|
| Today we'll look at abstract syntax. Most programming languages admit a
|
| [concrete parse tree](https://en.wikipedia.org/wiki/Parse_tree) which is
|
| readily abstracted to provide an [abstract syntax
|
| tree](https://en.wikipedia.org/wiki/Abstract_syntax_tree) (AST). The AST is
|
| usually encoded with the *initial* style of encoding. An initial encoding can
|
| be transformed into any other encoding for the same AST, looks like a
|
| hierarchy of classes, and is implemented as a static structure on the heap.
|
|
|
| In contrast, there is also a *final* encoding. A final encoding can be
|
| transformed into by any other encoding, looks like an interface for the
|
| actions of the interpreter, and is implemented as an unwinding structure on
|
| the stack.
|
|
|
| In RPython, an initial encoding is built from a hierarchy of classes. Each
|
| class represents a type of tree nodes, corresponding to a parser production in
|
| the concrete parse tree. Each class instance therefore represents an
|
| individual tree node and the fields and methods of each instance store/compute
|
| properties of that node. This seems like an obvious and simple approach; what
|
| other approaches could there be? We need an example.
|
|
|
| ## Final Encoding of Brainfuck
|
|
|
| We will consider [Brainfuck](https://esolangs.org/wiki/Brainfuck), a simple
|
| Turing-complete programming language. An example Brainfuck program might be:
|
|
|
| [-]
|
|
|
| This program is built from a loop and a decrement, and sets a cell to zero. In
|
| an initial encoding which follows the [algebraic semantics of
|
| Brainfuck](https://esolangs.org/wiki/Algebraic_Brainfuck), the program could
|
| be expressed by applying class constructors to build a structure on the heap:
|
|
|
| Loop(Plus(-1))
|
|
|
| A final encoding is similar, except that class constructors are replaced by
|
| methods, the structure is built on the stack, and we are parameterized over
|
| the choice of class:
|
|
|
| lambda cls: cls.loop(cls.plus(-1))
|
|
|
| In ordinary Python, transforming between these would be trivial, and mostly is
|
| a matter of passing around the appropriate class. Indeed, initial and final
|
| encodings are equivalent; we'll return to that fact later. However, in RPython,
|
| all of the types must line up, and classes must be determined before
|
| translation. To facilitate this, we'll use some monomorphic tricks, but first
|
| let's see what an actual Brainfuck interface looks like, so that we can cover
|
| all of the difficulties with final encoding.
|
|
|
| Before we embark, please keep in mind that local code doesn't know what `cls`
|
| is. There's no type-safe way to inspect an arbitrary semantic domain. In the
|
| initial-encoded version, we can ask `isinstance(bf, Loop)` to see whether an
|
| AST node is a loop, but there simply isn't an equivalent for final-encoded
|
| ASTs. So, there is an implicit challenge to think about: how do we evaluate a
|
| program in an arbitrary semantic domain? For bonus points, how do we optimize
|
| a program without inspecting the types of its AST nodes?
|
|
|
| ### Core Functionality
|
|
|
| Final encoding is given as methods on an interface. These five methods
|
| correspond precisely to the summands of the algebra of Brainfuck.
|
|
|
| def plus(self, i): pass
|
| def right(self, i): pass
|
| def input(self): pass
|
| def output(self): pass
|
| def loop(self, bfs): pass
|
|
|
| Note that the `.loop()` method takes another program as an argument.
|
| Initially-encoded ASTs have other initially-encoded ASTs as fields on class
|
| instances; finally-encoded ASTs have other finally-encoded ASTs as parameters
|
| to interface methods.
|
|
|
| ### Monoid
|
|
|
| In order to optimize input programs, we'll need to represent the underlying
|
| [monoid](https://en.wikipedia.org/wiki/Monoid) of Brainfuck programs. To do
|
| this, we add the signature for a monoid:
|
|
|
| def unit(self): pass
|
| def join(self, l, r): pass
|
|
|
| This is technically a [unital
|
| magma](https://en.wikipedia.org/wiki/Magma_(algebra)), since RPython doesn't
|
| support algebraic laws, but we will enforce the algebraic laws later on during
|
| optimization. We also want to make use of the folklore that [free
|
| monoids](https://en.wikipedia.org/wiki/Free_monoid) are lists, allowing
|
| callers to pass a list of actions which we'll reduce with recursion:
|
|
|
| def joinList(self, bfs):
|
| if not bfs: return self.unit()
|
| elif len(bfs) == 1: return bfs[0]
|
| elif len(bfs) == 2: return self.join(bfs[0], bfs[1])
|
| else:
|
| i = len(bfs) >> 1
|
| return self.join(self.joinList(bfs[:i]), self.joinList(bfs[i:]))
|
|
|
| ### Idioms
|
|
|
| Finally, our interface includes a few high-level idioms, like the zero program
|
| shown earlier, which are defined in terms of low-level behaviors. In an
|
| initial encoding, these could be defined as module-level functions; here, we
|
| define them on a mixin class `BF`.
|
|
|
| def zero(self): return self.loop(self.plus(-1))
|
| def move(self, i): return self.scalemove(i, 1)
|
| def move2(self, i, j): return self.scalemove2(i, 1, j, 1)
|
| def scalemove(self, i, s):
|
| return self.loop(self.joinList([
|
| self.plus(-1), self.right(i), self.plus(s), self.right(-i)]))
|
| def scalemove2(self, i, s, j, t):
|
| return self.loop(self.joinList([
|
| self.plus(-1), self.right(i), self.plus(s), self.right(j - i),
|
| self.plus(t), self.right(-j)]))
|
|
|
| ## Interface-oriented Architecture
|
|
|
| ### Applying Interfaces
|
|
|
| Now, we hack at RPython's object model until everything translates. First,
|
| consider the task of pretty-printing. For Brainfuck, we'll simply regurgitate
|
| the input program as a Python string:
|
|
|
| class AsStr(object):
|
| import_from_mixin(BF)
|
| def unit(self): return ""
|
| def join(self, l, r): return l + r
|
| def plus(self, i): return '+' * i if i > 0 else '-' * -i
|
| def right(self, i): return '>' * i if i > 0 else '<' * -i
|
| def loop(self, bfs): return '[' + bfs + ']'
|
| def input(self): return ','
|
| def output(self): return '.'
|
|
|
| Via `rlib.objectmodel.import_from_mixin`, no stressing with covariance of
|
| return types is required. Instead, we shift from a Java-esque view of classes
|
| and objects, to an OCaml-ish view of prebuilt classes and constructors.
|
| `AsStr` is monomorphic, and any caller of it will have to create their own
|
| covariance somehow. For example, here are the first few lines of the parsing
|
| function:
|
|
|
| @specialize.argtype(1)
|
| def parse(s, domain):
|
| ops = [domain.unit()]
|
|
|
| By invoking `rlib.objectmodel.specialize.argtype`, we make copies of the
|
| parsing function, up to one per call site, based on our choice of semantic
|
| domain. [Oleg](https://okmij.org/ftp/tagless-final/) calls these "symantics"
|
| but I prefer "domain" in code. Also, note how the parsing stack starts with
|
| the unit of the monoid, which corresponds to the empty input string.
|
|
|
| ### Composing Interfaces
|
|
|
| Earlier, we noted that an interpreter can optionally optimize input programs
|
| after parsing. To support this, we'll precompose a [peephole
|
| optimizer](https://en.wikipedia.org/wiki/Peephole_optimization) onto an
|
| arbitrary domain. We could also postcompose with a parser instead, but that
|
| sounds more difficult. Here are the relevant parts:
|
|
|
| def makePeephole(cls):
|
| domain = cls()
|
| def stripDomain(bfs): return domain.joinList([t[0] for t in bfs])
|
| class Peephole(object):
|
| import_from_mixin(BF)
|
| def unit(self): return []
|
| def join(self, l, r): return l + r
|
| # Actual definition elided... for now...
|
| return Peephole, stripDomain
|
|
|
| Don't worry about the actual optimization yet. What's important here is the
|
| pattern of initialization of semantic domains. `makePeephole` is an
|
| [SML](https://en.wikipedia.org/wiki/Standard_ML)-style functor on semantic
|
| domains: given a final encoding of Brainfuck, it produces another final
|
| encoding of Brainfuck which incorporates optimizations. The helper
|
| `stripDomain` is a finalizer which performs the extraction from the
|
| optimizer's domain to the underlying `cls` that was passed in. For example,
|
| let's optimize pretty-printing:
|
|
|
| AsStr, finishStr = makePeephole(AsStr)
|
|
|
| Now, it only takes one line to parse and print an optimized AST without ever
|
| building it on the heap:
|
|
|
| print finishStr(parse(text, AsStr()))
|
|
|
| ## Performance
|
|
|
| But is it fast? Yes. It's faster than the prior version, which was
|
| initial-encoded, and also faster than Andrew Brown's classic version ([part
|
| 1](https://pypy.org/posts/2011/04/tutorial-writing-interpreter-with-pypy-3785910476193156295.html),
|
| [part
|
| 2](https://pypy.org/posts/2011/04/tutorial-part-2-adding-jit-8121732841568309472.html)),
|
| which did not perform optimizations.
|
|
|
| ### JIT
|
|
|
| First, why is it faster than the same interpreter with initial encoding? Well,
|
| it still has initial encoding! There is an `Op` class with a hierarchy of
|
| subclasses implementing individual behaviors. A sincere tagless-final student,
|
| or those who remember [Stop Writing
|
| Classes](https://pyvideo.org/pycon-us-2012/stop-writing-classes.html), will
|
| recognize that the following classes could be plain functions, and should
|
| think of the classes as a concession to RPython's lack of support for lambdas
|
| with closures rather than an initial encoding. We aren't ever going to
|
| directly typecheck any `Op`, but the JIT will guard them anyway, so we
|
| effectively get a fully-promoted AST inlined into each JIT trace. First, some
|
| simple behaviors:
|
|
|
| class Op(object): _immutable_ = True
|
|
|
| class _Input(Op):
|
| _immutable_ = True
|
| def runOn(self, tape, position):
|
| tape[position] = ord(os.read(0, 1)[0])
|
| return position
|
| Input = _Input()
|
|
|
| class _Output(Op):
|
| _immutable_ = True
|
| def runOn(self, tape, position):
|
| os.write(1, chr(tape[position]))
|
| return position
|
| Output = _Output()
|
|
|
| class Add(Op):
|
| _immutable_ = True
|
| _immutable_fields_ = "imm",
|
| def __init__(self, imm): self.imm = imm
|
| def runOn(self, tape, position):
|
| tape[position] += self.imm
|
| return position
|
|
|
| The JIT does technically have less information than before; it no longer knows
|
| that a sequence of immutable operations is immutable enough to be worth
|
| unrolling, but a bit of `rlib.jit.unroll_safe` fixes that:
|
|
|
| class Seq(Op):
|
| _immutable_ = True
|
| _immutable_fields_ = "ops[*]",
|
| def __init__(self, ops): self.ops = ops
|
| @unroll_safe
|
| def runOn(self, tape, position):
|
| for op in self.ops: position = op.runOn(tape, position)
|
| return position
|
|
|
| Finally, the JIT entry point is at the head of each loop, just like with prior
|
| interpreters. Since Brainfuck doesn't support mid-loop jumps, there's no
|
| penalty for only allowing merge points at the head of the loop.
|
|
|
| class Loop(Op):
|
| _immutable_ = True
|
| _immutable_fields_ = "op",
|
| def __init__(self, op): self.op = op
|
| def runOn(self, tape, position):
|
| op = self.op
|
| while tape[position]:
|
| jitdriver.jit_merge_point(op=op, position=position, tape=tape)
|
| position = op.runOn(tape, position)
|
| return position
|
|
|
| That's the end of the implicit challenge. There's no secret to it; just
|
| evaluate the AST. Here's part of the semantic domain for evaluation, as well
|
| as the "functor" to optimize it and the top-level code to run the operation.
|
| In `AsOps.join()` are the *only* `isinstance()` calls in the entire
|
| interpreter!
|
|
|
| class AsOps(object):
|
| import_from_mixin(BF)
|
| def unit(self): return Shift(0)
|
| def join(self, l, r):
|
| if isinstance(l, Seq) and isinstance(r, Seq):
|
| return Seq(l.ops + r.ops)
|
| elif isinstance(l, Seq): return Seq(l.ops + [r])
|
| elif isinstance(r, Seq): return Seq([l] + r.ops)
|
| return Seq([l, r])
|
| # Other methods elided!
|
|
|
| AsOps, finishOps = makePeephole(AsOps)
|
|
|
| tape = bytearray("\x00" * cells)
|
| finishOps(parse(text, AsOps())).runOn(tape, 0)
|
|
|
| ### Peephole Optimization
|
|
|
| Our peephole optimizer is an abstract interpreter with one instruction of
|
| lookahead/rewrite buffer. It implements the aforementioned algebraic laws of
|
| the Brainfuck monoid. It also implements idiom recognition for loops. First,
|
| the abstract interpreter. The abstract domain has six elements:
|
|
|
| class AbstractDomain(object): pass
|
| meh, aLoop, aZero, theIdentity, anAdd, aRight = [AbstractDomain() for _ in range(6)]
|
|
|
| We'll also tag everything with an integer, so that `anAdd` or `aRight` can be
|
| exact annotations. *This* is the actual `Peephole.join()` method:
|
|
|
| def join(self, l, r):
|
| if not l: return r
|
| rv = l[:]
|
| bfHead, adHead, immHead = rv.pop()
|
| for bf, ad, imm in r:
|
| if ad is theIdentity: continue
|
| elif adHead is aLoop and ad is aLoop: continue
|
| elif adHead is theIdentity:
|
| bfHead, adHead, immHead = bf, ad, imm
|
| elif adHead is anAdd and ad is aZero:
|
| bfHead, adHead, immHead = bf, ad, imm
|
| elif adHead is anAdd and ad is anAdd:
|
| immHead += imm
|
| if immHead: bfHead = domain.plus(immHead)
|
| elif rv: bfHead, adHead, immHead = rv.pop()
|
| else:
|
| bfHead = domain.unit()
|
| adHead = theIdentity
|
| elif adHead is aRight and ad is aRight:
|
| immHead += imm
|
| if immHead: bfHead = domain.right(immHead)
|
| elif rv: bfHead, adHead, immHead = rv.pop()
|
| else:
|
| bfHead = domain.unit()
|
| adHead = theIdentity
|
| else:
|
| rv.append((bfHead, adHead, immHead))
|
| bfHead, adHead, immHead = bf, ad, imm
|
| rv.append((bfHead, adHead, immHead))
|
| return rv
|
|
|
| If this were to get much longer, then [implementing a
|
| DSL](https://pypy.org/posts/2024/10/jit-peephole-dsl.html) would be worth it,
|
| but this is a short-enough method to inline. The abstract interpretation is
|
| assumed by induction for the left-hand side of the join, save for the final
|
| instruction, which is loaded into a rewrite register. Each instruction on the
|
| right-hand side is inspected exactly once. The logic for `anAdd` followed by
|
| `anAdd` is exactly the same as for `aRight` followed by `aRight` because they
|
| both have underlying [Abelian
|
| groups](https://en.wikipedia.org/wiki/Abelian_group) given by the integers.
|
| The rewrite register is carefully pushed onto and popped off from the
|
| left-hand side in order to cancel out `theIdentity`, which itself is merely a
|
| unifier for `anAdd` or `aRight` of 0.
|
|
|
| Note that we generate a lot of garbage. For example, parsing a string of *n*
|
| '+' characters will cause the peephole optimizer to allocate *n* instances of
|
| the underlying `domain.plus()` action, from `domain.plus(1)` up to
|
| `domain.plus(n)`. An older initial-encoded version of this interpreter used
|
| [hash consing](https://en.wikipedia.org/wiki/Hash_consing) to avoid ever
|
| building an op more than once, even loops. It appears more efficient to
|
| generate lots of immutable garbage than to repeatedly hash inputs and search
|
| mutable hash tables, at least for optimizing Brainfuck incrementally during
|
| parsing.
|
|
|
| Finally, let's look at idiom recognition. RPython lists are initial-coded, so
|
| we can dispatch based on the length of the list, and then inspect the abstract
|
| domains of each action.
|
|
|
| def isConstAdd(bf, i): return bf[1] is anAdd and bf[2] == i
|
|
|
| def oppositeShifts(bf1, bf2):
|
| return bf1[1] is bf2[1] is aRight and bf1[2] == -bf2[2]
|
|
|
| def oppositeShifts2(bf1, bf2, bf3):
|
| return (bf1[1] is bf2[1] is bf3[1] is aRight and
|
| bf1[2] + bf2[2] + bf3[2] == 0)
|
|
|
| def loop(self, bfs):
|
| if len(bfs) == 1:
|
| bf, ad, imm = bfs[0]
|
| if ad is anAdd and imm in (1, -1):
|
| return [(domain.zero(), aZero, 0)]
|
| elif len(bfs) == 4:
|
| if (isConstAdd(bfs[0], -1) and
|
| bfs[2][1] is anAdd and
|
| oppositeShifts(bfs[1], bfs[3])):
|
| return [(domain.scalemove(bfs[1][2], bfs[2][2]), aLoop, 0)]
|
| if (isConstAdd(bfs[3], -1) and
|
| bfs[1][1] is anAdd and
|
| oppositeShifts(bfs[0], bfs[2])):
|
| return [(domain.scalemove(bfs[0][2], bfs[1][2]), aLoop, 0)]
|
| elif len(bfs) == 6:
|
| if (isConstAdd(bfs[0], -1) and
|
| bfs[2][1] is bfs[4][1] is anAdd and
|
| oppositeShifts2(bfs[1], bfs[3], bfs[5])):
|
| return [(domain.scalemove2(bfs[1][2], bfs[2][2],
|
| bfs[1][2] + bfs[3][2],
|
| bfs[4][2]), aLoop, 0)]
|
| if (isConstAdd(bfs[5], -1) and
|
| bfs[1][1] is bfs[3][1] is anAdd and
|
| oppositeShifts2(bfs[0], bfs[2], bfs[4])):
|
| return [(domain.scalemove2(bfs[0][2], bfs[1][2],
|
| bfs[0][2] + bfs[2][2],
|
| bfs[3][2]), aLoop, 0)]
|
| return [(domain.loop(stripDomain(bfs)), aLoop, 0)]
|
|
|
| This ends the bonus question. How do we optimize an unknown semantic domain?
|
| We must maintain an abstract context which describes elements of the domain.
|
| In initial encoding, we ask an AST about itself. In final encoding, we already
|
| know everything relevant about the AST.
|
|
|
| ## Discussion
|
|
|
| Given that initial and final encodings are equivalent, and noting that
|
| RPython's toolchain is written to prefer initial encodings, what did we
|
| actually gain? Did we gain anything?
|
|
|
| One obvious downside to final encoding in RPython is interpreter size. The
|
| example interpreter shown here is a rewrite of an initial-encoded interpreter
|
| which can be seen
|
| [here](https://github.com/rpypkgs/rpypkgs/blob/659c8a26d428a1e04fdff614b28e464a50d4647b/bf/bf.py)
|
| for comparison. Final encoding adds about 20% more code in this case.
|
|
|
| Final encoding is not necessarily more code than initial encoding, though. All
|
| AST encodings in interpreters are subject to the [Expression
|
| Problem](https://en.wikipedia.org/wiki/Expression_problem), which states that
|
| there is generally a quadratic amount of code required to implement multiple
|
| behaviors for an AST with multiple types of nodes; specifically, *n* behaviors
|
| for *m* types of nodes require *n* × *m* methods. Initial encodings improve the
|
| cost of adding new types of nodes; final encodings improve the cost of adding
|
| new behaviors. Final encoding may tend to win in large codebases for mature
|
| languages, where the language does not change often but new behaviors are added
|
| frequently and maintained for long periods.
|
|
|
| Optimizations in final encoding require a bit of planning. The
|
| abstract-interpretation approach is solid but relies upon the monoid and its
|
| algebraic laws. In the worst case, an entire class hierarchy could be required
|
| to encode the abstraction.
|
|
|
| Final encoding was popularized via the tagless-final movement. A "tag", in
|
| this jargon, is a runtime identifier for an object's type or class; a tagless
|
| encoding effectively doesn't allow `isinstance()` at all. In the above
|
| presentation, tags could be hacked in, but were not materially relevant to
|
| most steps. Tags were required for the final evaluation step, though, and the
|
| tagless-final insight is that certain type systems can express type-safe
|
| evaluation without those tags. We won't go further in this direction because
|
| tags also communicate valuable information to the JIT.
|
|
|
| ### Summarizing Table
|
|
|
| Initial Encoding | Final Encoding
|
| ---|---
|
| hierarchy of classes | signature of interfaces
|
| class constructors | method calls
|
| built on the heap | built on the stack
|
| traversals allocate stack | traversals allocate heap
|
| tags are available with `isinstance()` | tags are only available through hacks
|
| cost of adding a new AST node: one class | cost of adding a new AST node: one method on every other class
|
| cost of adding a new behavior: one method on every other class | cost of adding a new behavior: one class
|