Adventure in parserland – parsing lambda expressions in F# – Part V

We are now going to look at a solution which is concise, efficient and gives sophisticated error messages. It is also less than 20 lines of code. We’ll be using FParsec.

FParsec is a port of an Haskell library. It is a parser combinator library or, as I like to think of it, an internal DSL to build parsers in F#. My usual disclaimer: I’m not an expert in FParsec. It is likely that, if you are an expert, you can come up with more maintainable/efficient/elegant version of this parser.

The whole code is below:

let ws = " \t\n"
let specialChars = ".)(\\\n"

let pWs = spaces
let pName = manyChars (noneOf (ws + specialChars)) |>> EName

let pExpr, pExprRef = createParserForwardedToRef<Expression, Unit>()

let curry2 f a b = f(a,b)
let pFunction = pchar '\\' >>. pipe2 pName (pchar '.' >>. pExpr) (curry2 Function)

let pApplication = pchar '(' >>. pipe2 pExpr (pWs >>. pExpr) (curry2 Application)
                          .>> pWs .>> pchar ')'

do pExprRef := pFunction <|> pApplication <|> pName

let pExpressions = sepBy pExpr spaces1

This mirrors pretty closely the grammar we are trying to parse. A program is a bunch of expressions separated by whitespaces.

let pExpressions = sepBy pExpr spaces1

sepBy is a combinator that takes a parser that defines what to parse and a parser that defines what the separator should be. And the separator should be one or more spaces …

pExpr is either a function, an application or a name. the operator <|> is a choice combinator that tries each parser in order. It tries the right parser just if the left parser fails and it doesn’t consume input. So it doesn’t backtrack. If you need a backtracking parser, you’ll need to get acquainted with the attempt combinator.

do pExprRef := pFunction <|> pApplication <|> pName

A function starts with a ‘\’, then comes a name, a dot and an expression.

let pFunction = pchar '\\' >>. pipe2 pName (pchar '.' >>. pExpr)
(curry2 Function)

I know that might look crazy (and maybe it is), but just bear with me. Someone , who I’m not sure I can name, once told me that functional programming is great to write code, but terrible to read it and debug it. The phrase stayed with me as containing a grain of truth. In any case, in the code above:

  • >>. is a combinator that says “use the left parser, discard its value and then use the right one, returning its value”. Try to guess what .>> does …
  • pipe2 is a combinator that says “apply the first parser, the second parser and then call a function passing as parameters the values returned by the two parsers
  • curry2 is a function combinator that transform a function that takes a tuple to a function that takes the parameters as untupled
let curry2 f a b = f(a,b)

An application works similarly, but differently …

let pApplication = pchar '(' >>. pipe2 pExpr (pWs >>. pExpr) (curry2 Application)
                              .>> pWs .>> pchar ')'

The only difference is that now we have to consume the optional whitespaces and the ‘)’ at the end. A rule of thumb that I use is to use >>.  to flow the result through and pipeX when I need more than one result.

The last thing is pName, which consume chars until it finds either a whitespace or a special char.

let ws = " \t\n"
let specialChars = ".)(\\\n"

let pWs = spaces
let pName = manyChars (noneOf (ws + specialChars)) |>> EName

And there you have it, a lexer, a parser all in 20 lines of code. I don’t like the code that I wrote above much. I’m sure I could refine it plenty and it probably contains some bugs, but it gives an idea of what is possible with FParsec.

Adventure in parserland – parsing lambda expressions in F# – Part IV

Let’ now look at the parser. First let’s review the grammar:

    (*
        <expression> ::= <name> | <function> | <application> 
        <name> ::= non­blank character sequence 
        <function> ::= \ <name> . <body> 
        <body> ::= <expression> 
        <application> ::= ( <function expression> <argument expression> ) 
        <function expression> ::= <expression> 
        <argument expression> ::= <expression> 
    *)

And the data type to represent it:

type Name = string
and Body = Expression
and Function = Name * Expression
and FunctionExpression = Expression
and ArgumentExpression = Expression
and Expression = 
| EName of string
| Function of Expression * Body
| Application of FunctionExpression * ArgumentExpression
| EOT

In essence, the data type need to store all the information needed for subsequent stages of computation (i.e. beta reductions and such). The closer it is to the grammar, the better. In this case it looks pretty close.

Remember what is the main goal of our parser:

let parseTextReader: TextReader -> seq<Expression> =
                    textReaderToLazyList >> tokenStream >> parseExpressions

We have already looked at TextReaderToLazyList and tokenStream. Now it is the time to look at parseExpressions. It’s goal is to  parse the LazyList<Token> and return a sequence of expressions. The choice of returning a sequence at this point is to make the parseTextReader, which is the main function in the program, return a more ‘standard’ type.

and parseExpressions tokens = seq {
   let tokens = parseOptionalWs tokens
   let expr, tokens = parseExpr tokens
   let tokens = parseOptionalWs tokens
   match expr with
    | EOT   -> yield EOT
    | exp   -> yield exp; yield! parseExpressions tokens }

parseOtionalWs simply skips ahead whatever whitespaces it finds.

and parseOptionalWs tokens = match tokens with
                                | LazyList.Nil -> LazyList.empty
                                | LazyList.Cons(h, t) -> 
                                    match h with
                                       | Ws _ -> parseOptionalWs t
                                       | _ -> tokens

parseExpr is more interesting. It is the main switch that creates expression kinds.

let rec parseExpr tokens = match tokens with
                            | LazyList.Nil -> EOT, LazyList.empty
                            | LazyList.Cons(h, t) ->
                                match h with
                                    | EOF -> parseEOF tokens
                                    | Name _ -> parseName  tokens
                                    | Lambda -> parseFunction tokens
                                    | OpenParens -> parseApplication tokens
                                    | token -> errorAtStart "Expression" token

parseEOF is not.

and parseEOF tokens = EOT, LazyList.empty

parseName just returns a EName, unwrapping it from Name.

and parseName tokens = EName (head tokens |> unwrapName), tail tokens

Unwrap just unwraps it.

let unwrapName = function
    | Name(s) -> s
    | tok -> errorExpecting "a Name" <| writeToken tok

parseFunction just conumes a Lambda, a name, a Dot token, a body (i.e. \x.x)and assembles them in a Function:

and parseFunction tokens =
    let tokens = consumeToken Lambda tokens
    let name, tokens = parseName tokens
    let tokens = consumeToken Dot tokens
    let body, tokens = parseExpr tokens
    Function(name, body), tokens

consumeToken tries to consume a token generating an error if it doesn’t find it:

let consumeToken token = 
    genericConsumeToken (fun token' _ -> errorExpecting (writeToken token') (writeToken token)) token

genericConsumeToken is just a generalization of the function above:

let genericConsumeToken noMatch token = function
    | LazyList.Nil -> LazyList.empty
    | LazyList.Cons(h, t) as originalTokens ->
        match h with
        | tok when tok = token -> t
        | tok -> noMatch token originalTokens

The last thing left to consume is an application which is in this form (func args):

and parseApplication tokens =
    let tokens = consumeToken OpenParens tokens
    let funExpr, tokens = parseExpr tokens
    let tokens = parseOptionalWs tokens
    let argExpr, tokens = parseExpr tokens
    let tokens = consumeToken CloseParens tokens
    Application(funExpr, argExpr), tokens

Various error and utility functions are defined below:

let errorEOF expecting = failwith  ("Expected " + expecting + ", got EOF")
let errorExpecting expecting gotToken = failwith ("Expected " + expecting + ", got" + gotToken)
let errorAtStart expecting gotToken = failwith ("Expected " + expecting + " which cannot start with" + writeToken gotToken)

let tail = LazyList.tail
let head = LazyList.head

And that is the parser. All 100+ lines of it. As you can tell it is rather formulaic to go from a grammar to a lexer and a parser, which is why you shouldn’t do it, but instead let a tool generate the code for you given the grammar or use FParsec.

We have written 200+ code and I don’t think we can be too proud of our achievement. It is:

  • Certainly buggy
  • Primitive in error handling
  • Not tail recursive (big text is likely to blow up our stack)
  • Probably inefficient

So let’s look next at a better way to do it.

Adventure in parserland – parsing lambda expressions in F# – Part III

Let’s start from the lexer. Remember, I wrote this code based on my memory of how a lexer ought to look like. I didn’t read again the relevant chapters in the Dragon book. But I think it came out all right after all.

The tokenStream function we looked at last time takes a LazyList<char> and returns a LazyList<Token>. It uses the unfold method on LazyList to call matchToken on each char until the stream is empty.

let rec tokenStream chars = 
    LazyList.unfold
        (fun chList ->
            match chList with
            | LazyList.Nil -> None
            | chList ->  
                let token, chList' = matchToken chList
                Some(token, chList')
        )
        chars 

A token is what gets passed up to the parser to do syntactic analysis on. It is the vocabulary of our language. The lexer divide a phrase in words, the parser put together the words in a phrase. So, these are the words.

type Token =
    | Name of string
    | Dot
    | OpenParens
    | CloseParens
    | Lambda
    | Def
    | Ws of string
    | NewLine
    | EOF

Matching is a process whereas you try to return the token that you have read plus the list of characters yet to be read. Matching a Token is defined below:

let matchToken = function 
    | LazyList.Nil                 -> EOF, LazyList.empty 
    | LazyList.Cons(h, t) as chars ->
        match h with
        | ch when isWs ch -> matchWs chars
        | ch when isSpecialChar ch -> matchSpecialChar ch t
        | _ -> matchString chars

A token is either nothing, a whitespace, a special char or anything else.

Let’s look at what matching each one of them means.  Matching whitespaces means consuming them and remembering what was consumed.

let matchWs chars = 
    let value, remainingChars = matchSeriesOfChars isWs chars
    Ws value, remainingChars

matchSeriesOfChars takes a predicate and a LazyList of chars and returns the string composed of all the consecutive chars for which the predicate is true, plus, as always, the remaining chars to be matched. In this case the predicate returns true if the char is a whitespace.

To write matchSeriesOfChars I need a function that reverses a LazyList. Not having found such thing, I wrote it.

let reversell l =
    let rec go l' a = match l', a with
                        | LazyList.Nil, a -> a
                        | LazyList.Cons(h, t), a -> go t (LazyList.cons h a)
    go l LazyList.empty

Then I wrote matchSeriesOfChars. The function uses an accumulator. It adds to the front whenever the predicate is true, it reverses it and translates it to a string (I could have reversed the string instead, it might have been better).

let matchSeriesOfChars comparer chars =
    let rec go result = function
        | LazyList.Nil    -> charListToString(reversell result), LazyList.empty 
        | LazyList.Cons(h, t) -> if comparer h then go (LazyList.cons h result) t
                                 else charListToString (reversell result), LazyList.cons h t
    go LazyList.empty chars

These are  predicates we’ll use later on to recognize characters:

let isInString (ch: char) (s: string) = s.IndexOf(ch) <> -1
let isWs (chr: char) = isInString chr wsChars
let isNameChar (chr: char) = not (isInString chr (wsChars + specialChars))
let isSpecialChar ch = isInString ch specialChars

wsChar and specialChars are defined below:

    let wsChars = " \t"
    let charTokens =
        Map.ofList [
            '.' , Dot
            '(' , OpenParens
            ')' , CloseParens
            '\\', Lambda
            '\n', NewLine
         ]
let specialChars = charTokens |> Map.fold (fun s k v -> s + k.ToString()) ""

Getting back to the more important matching functions, matching a special character is defined as a simple lookup in the charToken map:

let matchSpecialChar ch chars = Map.find ch charTokens, chars

We are left with matchString, this simply matches the characters until it finds a char that cannot be part of a name. It then looks it up in a list of special strings. If it finds it, it returns it, otherwise it just returns the name.

let stringTokens =
    Map.ofList [
        "Def", Def    
    ]
let matchString chars =
    let value, remainingChars = matchSeriesOfChars isNameChar chars
    let specialString = Map.tryFind value stringTokens
    if specialString.IsSome
        then specialString.Value, remainingChars
        else Name(value), remainingChars

And we are done with the lexer, all of 100+ lines of it …