Recently I got quite interested in the lambda calculus, more specifically its untyped variant. The lambda calculus is a model of computation that is the base for many functional programming languages such as Haskell (which is typed) or Scheme). For quite some time I also wanted to build an interpreter for a programming language so I decided to give it a try and build one for the lambda calculus.

I will call this interpreter lambda.

Overview

Before we get started, let's get a quick overview of how the interpreter for lambda (and most interpreters for other languages, really) will be built. It consists of three basic parts, a lexer, a parser and an evaluator, each of them serving a different purpose:

The lexer takes a program in form of an input string and turns it into a list of tokens or lexemes.
The parser takes these tokens and composes them into an abstract syntax tree (or "AST") according to the grammar of the language.
The evaluator finally takes the AST and evaluates its statements and expressions according to evaluation rules and yields some behaviour.

This project is somewhat large and so my writing about it will be split into multiple parts. In this part, we will focus on the grammar and lexer of lambda.

The Grammar

The grammar for the untyped Lambda calculus is fairly simple. Here it is in EBNF:

variable    = "v" | variable, "'";
application = term, " ", term;
abstraction = "λ", letter, ".", term;
term        = variable | "(", application, ")" | "(", abstraction, ")";

Examples for valid terms ("lambda-terms") based on this grammar are:

v
(v v')
(λv.v)
(λv.(v v')
((λv.((λv'.(v' v'')) v)) v''')

This actually looks quite complicated. There are many parentheses and the variables are only distinguished by the number of ' suffixes. So let's make some small changes to the lambda calculus' grammar for our convenience:

First, we replace the variable rule by the following one:

variable = "a" | "b" | ... | "z";

while the original rule technically allowed for an infinite number of variables, which is important to some proofs, such as Barendregt's Substituion Lemma, we very likely won't run into issues limiting ourselves to just 26 variable names.

Secondly, we add a rule to allow omitting the outermost parentheses:

lambdaterm  = variable | application | abstraction | term;

Now, we make a small change to the abstraction rule:

abstraction = "\", letter, "." , ( abstraction | term );

This is allows us to write nested abstractions more easily, without having to use many parentheses, for example λx.λy.(x y).

Finally, also substitute the λ symbol for \ to make it easier to type.

As a result, our valid lambda-terms from before would now look like this:

v
x y 
\x.x
\x.(x y)
(x y) (x y)
\x.\y.(x y)

Of course we can still use the parentheses if we want to.

The final grammar now looks like this:

variable    = "a" | "b" | ... | "z";
application = term, " ", term;
abstraction = "\", letter, "." , ( abstraction | term );
term        = variable | "(", application, ")" | "(", abstraction, ")";
lambdaterm  = variable | application | abstraction | term;

Implementing the lexer

Now that we have the grammar settled, we can start implementing the lexer. It's main purpose is to split the incoming stream of characters into so called "tokens" or "lexemes". These tokens will later be consumed by the parser to compose the Abstract Syntax Tree according to our grammar.

Let's start by defining a Token structure and all the tokens we need:

type Kind int

type Token struct {
    Kind     Kind
    Literal  string
    Position int
}

const (
    ILLEGAL Kind = iota // Illegal token
    EOF                 // EOF
    LAMBDA              // "\"
    LPAREN              // "("
    RPAREN              // ")"
    DOT                 // "."
    IDENT               // any ASCII Char
    SPACE               // " "
)

The Token struct will hold three things we need later: the Kind of token we are dealing with, the Literal, that is the "value" of the token as it appears in the input stream, and the position of a tokens first character. We have a Kind for each type of token that can appear according to our grammar. Notice how we do not differentiate between upper- and lower-case letters for identifiers yet. EOF and ILLEGAL are special. They signify the end of the input stream as well as illegal characters (such as numeric digits).

The lexer itself is pretty straightforward. It exposes a single method we call Next() which emits whatever is the next token in the input. Internally, it will read byte-by-byte as much input as needed.

type Lexer struct {
    input   string
    pos     int
    readPos int
    ch      byte
}

func New(input string) *Lexer {
    l := &Lexer{input: input}
    l.readChar()
    return l
}

Here we define our Lexer which holds a copy of the input string, some information about the current and next read position and the current byte we are inspecting.

readChar() is defined like this:

func (l *Lexer) readChar() {
    if l.readPos >= len(l.input) {
        l.ch = 0
    } else {
        l.ch = l.input[l.readPos]
    }
    l.pos = l.readPos
    l.readPos += 1
}

It reads exactly one byte of input and updates the internal state of the lexer accordingly.

Finally, let's have a look at Next():

func (l *Lexer) Next() token.Token {
    var tok token.Token

    switch l.ch {
    case '\\':
        tok = newToken(token.LAMBDA, l.ch, l.pos)
// [...]
    case '.':
        tok = newToken(token.DOT, l.ch, l.pos)
    case 0:
        tok.Literal = ""
        tok.Kind = token.EOF
        tok.Position = l.pos
    default:
        if isLetter(l.ch) {
            tok = newToken(token.IDENT, l.ch, l.pos)
        } else {
            tok = newToken(token.ILLEGAL, l.ch, l.pos)
        }
    }

    l.readChar()
    return tok
}

It is basically just a switch statement that returns the correct kind of token based on the current char under inspection. There are two special cases here:

On line 37 we check for the zero byte 0. This is being set by readChar() once we have reached the end of the input. Accordingly, the token to return is token.EOF. Secondly, on line 40 we use the defalt cause to assert whether l.ch (the current byte) is actually a letter we want to allow. If that is the case we can plug the byte right into the token Literal. This is possible because according to our grammar, all identifiers for variables have to be a single character. Other lexers usually have a function called consumeBytes() or similar which will read a sequence of bytes that qualify as a single identifier and return them for the token literal. In case our byte is not a valid (lowercase) character, we return token.ILLEGAL as we have checked all other valid cases already.

For completeness, here are the definitions of newToken and isLetter:

func newToken(kind token.Kind, ch byte, p int) token.Token {
    return token.Token{Kind: kind, Literal: string(ch), Position: p}
}

func isLetter(ch byte) bool {
    return 'a' <= ch && ch <= 'z'
}

Conclusion

And that is already it for this part of the series. In the next part we will create the parser which takes the tokens emitted by the lexer and compose them into the Abstract Syntax Tree. You can find all the code shown here in the lambda GitHub repository.

Leo's Blog

Leo's Blog

Building lambda calculus in Go - Part 1

Overview

The Grammar

Implementing the lexer

Conclusion