Funky C for literate programming

1 Main ideas

This is a port of LLIte in C. The reason for it is to experiment with writing functional code in standard C and compare the experience with using a functional language like F#. It is in a way a continuation of this and this posts.

I will be using glib and an header of convenient macros/functions to help me (lutils.h). I don’t think that is cheating. Any modern C praticoner has its bag of tricks …

Don’t tell me this is not idiomatic C. I already know that.

#include <string.h>
#include <stdbool.h>

#include <glib.h>
#include <glib/gprintf.h>

#ifdef ARENA
#include "arena.h"
#endif

#include "lutils.h"

2 Lack of tuples

In the snippet below I overcomed such deficiency by declaring a struct. Using the new constructor syntax makes initializing a static table simple.

typedef struct LangSymbols { char language[40]; char start[10]; char end[10];} LangSymbols;

static
LangSymbols* s_lang_params_table[] = {
    &(LangSymbols) {.language = "fsharp",   .start = "(*" "*", .end = "*" "*)"},
    &(LangSymbols) {.language = "c",        .start = "/*" "*", .end = "*" "*/"},
    &(LangSymbols) {.language = "csharp",   .start = "/*" "*", .end = "*" "*/"},
    &(LangSymbols) {.language = "java",     .start = "/*" "*", .end = "*" "*/"},
    NULL
};

3 Folding over arrays

I need to gather all the languages, aka perform a fold over the array. You might have noticed the propensity to add a NULL terminator marker to arrays (as for strings). This allows me to avoid passing a size to functions and makes simpler writing utility macros (as foreach below) more simply.

In the rest of the program, every time I end a function with _z, it is because I consider it generally usable and I add a version of it without the _z to lutils.h.

#define array_foreach_z(p) for(; *symbols != NULL; ++symbols)

static
char* summary(LangSymbols** symbols) {

    GString* langs = g_string_sized_new(20);
    array_foreach(symbols) g_string_append_printf(langs, "%s ", (*symbols)->language);

    g_string_truncate(langs, strlen(langs->str) - 1);

    GString* usage = g_string_sized_new(100);

    g_string_printf(usage,
        "You should specify:nt. either -l or -o and -pn"
        "t. either -indent or -P and -Cn"
        "t. -l supports: %s"
        ,langs->str);

    return usage->str;
}

Find an item in an array based on some expression. Returns NULL if not found. Again, this is a common task, hence I’ll abstract it out with a macro (that ends up being a cute use of gcc statment expressions).

#define array_find_z(arr, ...)                          
    ({                                                  
        array_foreach(arr) if (__VA_ARGS__) break;      
        *arr;                                           
    })

static
LangSymbols* lang_find_symbols(LangSymbols** symbols, char* lang) {
    g_assert(symbols);
    g_assert(lang);

    return array_find(symbols, !strcmp((*symbols)->language, lang));
}

4 Deallocating stuff

You might wonder why I don’t seem overly worried about deallocating the memory that I allocate. I haven’t gone crazy(yet). You’ll see.

5 Discriminated unions

Here are the discriminated unions macros from a previous blog post of mine. I’ll need a couple of these and pre-declare two functions.

union_decl(CodeSymbols, Indented, Surrounded)
    union_type(Indented,    int indentation;)
    union_type(Surrounded,  char* start_code; char* end_code;)
union_end(CodeSymbols);

typedef struct Options {
    char*           start_narrative;
    char*           end_narrative;
    CodeSymbols*    code_symbols;
} Options;

static
gchar* translate(Options*, gchar*);

union_decl(Block, Code, Narrative)
    union_type(Code,        char* code)
    union_type(Narrative,   char* narrative)
union_end(Block);

6 Main data structure

We want to use higher level abstractions that standard C arrays, hence we’ll pick a convenient data structure to use in the rest of the code. A queue lets you to insert at the front and back, with just a one pointer overhead over a single linked list. Hence it is my data structure of choice for this program.

static
GQueue* blockize(Options*, char*);

There is already a function in glib to check if a string has a certain prefix (g_str_has_prefix). We need one that returns the remaining string after the prefix. We also define a g_slow_assert that is executed just if G_ENABLE_SLOW_ASSERT is defined

static
char* str_after_prefix(char* src, char* prefix) {
    g_assert(src);
    g_assert(prefix);
    g_slow_assert(g_str_has_prefix(src, prefix));

    while(*prefix != '0')
        if(*src == *prefix) ++src, ++prefix;
        else break;

    return src;
}

7 Tokenizer

The structure of the function is identical to the F# version. The big bread-winners are statement expressions and local functions …

It is interesting how you can replicate the shape of an F# function by substituting ternary operators for match statements.

It is nothing magic, just a way to have a case statment as an expression, but it is suggestive of its more functional counterpart.

#define NL "n"

union_decl(Token, OpenComment, CloseComment, Text)
    union_type(OpenComment, int line)
    union_type(CloseComment,int line)
    union_type(Text,        char* text)
union_end(Token);

GQueue* tokenize(Options* options, char* source) {
    g_assert(options);
    g_assert(source);

    struct tuple { int line; GString* acc; char* rem;};

    bool is_opening(char* src)      { return g_str_has_prefix(src, options->start_narrative);}
    bool is_closing(char* src)      { return g_str_has_prefix(src, options->end_narrative);}
    char* remaining_open (char* src){ return str_after_prefix(src, options->start_narrative);}
    char* remaining_close(char* src){ return str_after_prefix(src, options->end_narrative);}

    struct tuple text(char* src, GString* acc, int line) {
        inline struct tuple stop_parse_text()
            { return (struct tuple) {.line = line, .acc = acc, .rem = src};}

        return  str_empty (src)? stop_parse_text() :
                is_opening(src)? stop_parse_text() :
                is_closing(src)? stop_parse_text() :
                                ({
                                  int line2         = g_str_has_prefix(src, NL) ? line + 1
                                                                                : line;
                                  GString* newAcc   = g_string_append_c(acc, *src);
                                  char* rem         = src + 1;
                                  text(rem, newAcc, line2);
                                });
    }

    GQueue* tokenize_rec(char* src, GQueue* acc, int line) {
        return  str_empty(src)  ?   acc                     :
                is_opening(src) ?   tokenize_rec(remaining_open(src),
                                        g_queue_push_back(acc, union_new(
                                                    Token, OpenComment, .line = line)),
                                        line)        :
                is_closing(src) ?   tokenize_rec(remaining_close(src),
                                               g_queue_push_back(acc, union_new(
                                                    Token, CloseComment, .line = line)),
                                        line)        :
                                ({
                                    struct tuple t = text(src, g_string_sized_new(200), line);
                                    tokenize_rec(t.rem,
                                        g_queue_push_back(acc, union_new(
                                                    Token, Text, .text = t.acc->str)), t.line);
                                 });
    }

    return tokenize_rec(source, g_queue_new(), 1);
}

8 Parser

This again has a similar structure to the F# version, just longer. It is very long because it contains 3 (nested) functions which are on the verbose side in C.

The creation of a error macro is unfortunate. I just don’t know how to adapt g_assert_e so that it works for not pointer returning functions.

I also need a simple function report_error to exit gracefully giving a message to the user. I didn’t found such thing in glib (?)

#define report_error_z(...) G_STMT_START { g_print(__VA_ARGS__); exit(1); } G_STMT_END                                                            

union_decl(Chunk, NarrativeChunk, CodeChunk)
    union_type(NarrativeChunk,  GQueue* tokens)
    union_type(CodeChunk,       GQueue* tokens)
union_end(Chunk);

static
GQueue* parse(Options* options, GQueue* tokens) {
    g_assert(options);
    g_assert(tokens);

    struct tuple { GQueue* acc; GQueue* rem;};

    #define error(...) 
        ({ report_error(__VA_ARGS__); (struct tuple) {.acc = NULL, .rem = NULL}; })

    struct tuple parse_narrative(GQueue* acc, GQueue* rem) {

        bool isEmpty    = g_queue_is_empty(rem);
        Token* h        = g_queue_pop_head(rem);
        GQueue* t       = rem;

        return  isEmpty                 ?
                                    error("You haven't closed your last narrative comment") :
                h->kind == OpenComment  ?
                    error("Don't open narrative comments inside narrative comments at line %i",
                          h->OpenComment.line)                                              :
                h->kind == CloseComment ? (struct tuple) {.acc = acc, .rem = t}             :
                h->kind == Text         ? parse_narrative(g_queue_push_back(acc, h), t)     :
                                          error("Should never get here");
    };

    struct tuple parse_code(GQueue* acc, GQueue* rem) {

        bool isEmpty    = g_queue_is_empty(rem);
        Token* h    = g_queue_pop_head(rem);
        GQueue* t   = rem;

        return  isEmpty                 ? (struct tuple) {.acc = acc, .rem = t}         :
                h->kind == OpenComment  ?
                    (struct tuple) {.acc = acc, .rem = g_queue_push_front(rem, h)}      :
                h->kind == CloseComment ? parse_code(g_queue_push_back(acc, h), rem)    :
                h->kind == Text         ? parse_code(g_queue_push_back(acc, h), rem)    :
                                          error("Should never get here");
    };
    #undef error

    GQueue* parse_rec(GQueue* acc, GQueue* rem) {

        bool isEmpty    = g_queue_is_empty(rem);
        Token* h    = g_queue_pop_head(rem);
        GQueue* t   = rem;

        return  isEmpty                 ? acc                                           :
                h->kind == OpenComment  ? ({
                                           GQueue* emp = g_queue_new();
                                           struct tuple tu = parse_narrative(emp, t);
                                           Chunk* ch = union_new(
                                                Chunk, NarrativeChunk, .tokens = tu.acc );
                                           GQueue* newQ = g_queue_push_back(acc, ch);
                                           parse_rec(newQ, tu.rem);
                                           })                                            :
                h->kind == CloseComment ?
                    report_error_e(
                        "Don't insert a close narrative comment at the start of your"
                        " program at line %i",
                                            h->OpenComment.line)                         :
                h->kind == Text         ?
                                        ({
                                           GQueue* emp = g_queue_new();
                                           struct tuple tu =
                                                parse_code(g_queue_push_front(emp, h), t);
                                           parse_rec(g_queue_push_back
                                            (acc,
                                             union_new(Chunk, CodeChunk, .tokens = tu.acc)),
                                             tu.rem);
                                          })                                                               :
                                          g_assert_no_match;
    }

    return parse_rec(g_queue_new(), tokens);
}

9 Flattener

This follows the usual practice of representing fold as foreach statments (and maps to). Pheraps I shall build better abstractions for them at some point. I also introduce a little macro to simplify writing of GFunc lambdas, given how pervasive they are.

Again, note how heavy ternary operated this is …

#define g_func_z(type, name, ...) lambda(void,                                              
                                        (void* private_it, G_GNUC_UNUSED void* private_no){ 
                                       type name = private_it;                              
                                       __VA_ARGS__                                          
                                })

static
GQueue* flatten(Options* options, GQueue* chunks) {
    GString* token_to_string_narrative(Token* tok) {
        return  tok->kind == OpenComment ||
                tok->kind == CloseComment   ?
                    report_error_e("Cannot nest narrative comments at line %i",
                                   tok->OpenComment.line)                                   :
                tok->kind == Text           ? g_string_new(tok->Text.text)                  :
                                              g_assert_no_match;
    }
    GString* token_to_string_code(Token* tok) {
        return  tok->kind == OpenComment    ?
                report_error_e(
                    "Open narrative comment cannot be in code at line %i."
                    " Pheraps you have an open comment "
                    "in a code string before this comment tag?"
                    , tok->OpenComment.line)                                                :
                tok->kind == CloseComment   ? g_string_new(options->end_narrative)          :
                tok->kind == Text           ? g_string_new(tok->Text.text)                  :
                                              g_assert_no_match;
    }
    Block* flatten_chunk(Chunk* ch) {
        return  ch->kind == NarrativeChunk  ? ({
                               GQueue* tokens = ch->NarrativeChunk.tokens;
                               GString* res = g_string_sized_new(256);
                               g_queue_foreach(tokens, g_func(Token*, tok,
                                                g_string_append(
                                                    res,
                                                    token_to_string_narrative(tok)->str);
                                                ), NULL);
                               union_new(Block, Narrative, .narrative = res->str);
                                               })   :
                ch->kind == CodeChunk       ? ({
                               GQueue* tokens = ch->CodeChunk.tokens;
                               GString* res = g_string_sized_new(256);
                               g_queue_foreach(tokens, g_func(Token*, tok,
                                                        g_string_append(
                                                            res,
                                                            token_to_string_code(tok)->str);
                                                        ), NULL);
                               union_new(Block, Code, .code = res->str);
                                               })   :
                               g_assert_no_match;
    }

    GQueue* res = g_queue_new();
    g_queue_foreach(chunks, g_func(Chunk*, ch,
                                Block* b = flatten_chunk(ch);
                                g_queue_push_tail(res, b);
                                ) ,NULL);
    return res;
}

Now we can tie everything together to build blockize, which is our parse tree.

static
GQueue* blockize(Options* options, char* source) {
    GQueue* tokens  = tokenize(options, source);
    GQueue* blocks  = parse(options, tokens);
    return flatten(options, blocks);
}

10 Define the phases

In C you can easily forward declare function, so you don’t have to come up with some clever escabotage like we had to do in F#.

static
GQueue* remove_empty_blocks(Options*, GQueue*);
static
GQueue* merge_blocks(Options*, GQueue*);
static
GQueue* add_code_tags(Options*, GQueue*);

static
GQueue* process_phases(Options* options, GQueue* blocks) {

    blocks          = remove_empty_blocks(options, blocks);
    blocks          = merge_blocks(options, blocks);
    blocks          = add_code_tags(options, blocks);
    return blocks;
}

static
char* extract(Block* b) {
    return  b->kind == Code         ? b->Code.code          :
            b->kind == Narrative    ? b->Narrative.narrative:
                                      g_assert_no_match;
}

There must be a higher level way to write this utility function …

static
bool is_str_all_spaces(const char* str) {
    g_assert(str);
    while(*str != '0') {
        if(!g_ascii_isspace(*str))
            return false;
        str++;
    }
    return true;
}

static
GQueue* remove_empty_blocks(G_GNUC_UNUSED Options* options, GQueue* blocks) {

    g_queue_foreach(blocks, g_func(Block*, b,
        if(is_str_all_spaces(extract(b)))
            g_queue_remove(blocks, b);
                                   ), NULL);
    return blocks;
}

static
GQueue* merge_blocks(G_GNUC_UNUSED Options*options, GQueue* blocks) {
    return  g_queue_is_empty(blocks)            ? blocks            :
            g_queue_get_length(blocks) == 1     ? blocks            :
                ({
                 Block* h1 = g_queue_pop_head(blocks);
                 Block* h2 = g_queue_pop_head(blocks);
                 h1->kind == Code && h2->kind == Code ? ({
                     char* newCode =
                        g_strjoin("", h1->Code.code, NL, h2->Code.code, NULL);
                     Block* b = union_new(Block, Code, .code = newCode);
                     merge_blocks(options, g_queue_push_front(blocks, b));
                                                         })         :
                 h1->kind == Narrative && h2->kind == Narrative ? ({
                     char* newNarr =
                        g_strjoin(
                            "", h1->Narrative.narrative, NL, h2->Narrative.narrative, NULL);
                     Block* b = union_new(Block, Narrative, .narrative = newNarr);
                     merge_blocks(options, g_queue_push_front(blocks, b));
                                                         })         :
                                                         ({
                     GQueue* newBlocks =
                        merge_blocks(options, g_queue_push_front(blocks, h2));
                     g_queue_push_front(newBlocks, h1);
                                                         });
                 });
}

This really should be in glib …

inline static
gint g_asprintf_z(gchar** string, gchar const *format, ...) {
	va_list argp;
	va_start(argp, format);
	gint bytes = g_vasprintf(string, format, argp);
	va_end(argp);
    return bytes;
}

static
char* indent(int n, char* s) {
    g_assert(s);

    char* ind       = g_strnfill(n, ' ');
    char* tmp;
    g_asprintf(&tmp, "%s%s", ind, s);

    char* withNl;
    g_asprintf(&withNl, "n%s", ind);

    return g_strjoinv(withNl, g_strsplit(tmp, NL, -1));
}

And finally I ended up defining map. See if you like how the usage looks in the function below.

#define g_queue_map_z(q, type, name, ...) ({                                
        GQueue* private_res = g_queue_new();                                
        g_queue_foreach(q, g_func(type, name,                               
            name = __VA_ARGS__;                                             
            g_queue_push_tail(private_res, name);                           
            ), NULL);                                                       
        private_res;                                                        
                                      })

static
GQueue* add_code_tags(Options* options, GQueue* blocks) {

    GQueue* indent_blocks(GQueue* blocks) {
        return g_queue_map(blocks, Block*, b,
                b->kind == Narrative ? b                                                                                                    :
                b->kind == Code      ?
                    union_new(Block, Code, .code =
                        indent(options->code_symbols->Indented.indentation, b->Code.code))    :
                    g_assert_no_match;);
    }

    GQueue* surround_blocks(GQueue* blocks) {
        return g_queue_map(blocks, Block*, b,
                b->kind == Narrative ?
                    union_new(Block, Narrative, .narrative =
                        g_strjoin("", NL, g_strstrip(b->Narrative.narrative), NL, NULL))   :
                b->kind == Code      ?
                    union_new(Block, Code, .code = g_strjoin("",
                                                 NL,
                                                 options->code_symbols->Surrounded.start_code,
                                                 NL,
                                                 g_strstrip(b->Code.code),
                                                 NL,
                                                 options->code_symbols->Surrounded.end_code,
                                                 NL,
                                                 NULL))    :
                                       g_assert_no_match;);

    }

    return  options->code_symbols->kind == Indented     ?   indent_blocks(blocks)   :
            options->code_symbols->kind == Surrounded   ?   surround_blocks(blocks) :
                                                            g_assert_no_match;
}

char* stringify(GQueue* blocks) {
    GString* res = g_string_sized_new(2048);
    g_queue_foreach(blocks, g_func(Block*, b,
        g_string_append(res, extract(b));
    ), NULL);
    return g_strchug(res->str);
}

void deb(GQueue* q);

static
char* translate(Options* options, char* source) {
    g_assert(options);
    g_assert(source);

    GQueue* blocks  = blockize(options, source);
    blocks          = process_phases(options, blocks);
    return stringify(blocks);
}

11 Parsing the command line

In glib there is a command line parser that accept options in unix-like format and automatically produces professional --help messages and such. We shoudl really have something like this in .NET. Pheraps we do and I’m not aware of it?

typedef struct CmdOptions { char* input_file; char* output_file; Options* options;} CmdOptions;

static
CmdOptions* parse_command_line(int argc, char* argv[]);

static char *no = NULL, *nc = NULL, *l = NULL, *co = NULL, *cc = NULL, *ou = NULL;
static char** in_file;

static int ind = 0;
static bool tests = false;

// this is a bug in gcc, fixed in 2.7.0 not to moan about the final NULL
#pragma GCC diagnostic push
#pragma GCC diagnostic ignored "-Wmissing-field-initializers"

static GOptionEntry entries[] =
{
  { "language"          , 'l', 0, G_OPTION_ARG_STRING, &l ,
                                "Language used", "L"  },
  { "output"            , 'o', 0, G_OPTION_ARG_FILENAME, &ou,
                                "Defaults to the input file name with mkd extension", "FILE" },
  { "narrative-open"    , 'p', 0, G_OPTION_ARG_STRING, &no,
                                "String opening a narrative comment",   "NO" },
  { "narrative-close"   , 'c', 0, G_OPTION_ARG_STRING, &nc,
                                "String closing a narrative comment",   "NC" },
  { "code-open"         , 'P', 0, G_OPTION_ARG_STRING, &co,
                                "String opening a code block",          "CO" },
  { "code-close"        , 'C', 0, G_OPTION_ARG_STRING, &cc,
                                "String closing a code block",          "CC" },
  { "indent"            , 'i', 0, G_OPTION_ARG_INT,    &ind,
                                "Indent the code by N whitespaces",    "N"  },
  { "run-tests"         , 't', G_OPTION_FLAG_HIDDEN, G_OPTION_ARG_NONE,   &tests,
                                "Run all the testcases", NULL },
  { G_OPTION_REMAINING  ,   0, 0, G_OPTION_ARG_FILENAME_ARRAY, &in_file,
                                "Input file to process",   "FILE" },
  { NULL }
};
#pragma GCC diagnostic pop

Brain damaged way to run tests with a -t hidden option. Not paying the code size price in release.

#ifndef NDEBUG
#include "tests.c"
#endif

Here is my big ass command parsing function. It could use a bit of refactoring …

void destroy_arena_allocator();

static
CmdOptions* parse_command_line(int argc, char* argv[]) {

    GError *error = NULL;
    GOptionContext *context;

    context =
        g_option_context_new ("- translate source code with comemnts to an annotated file");
    g_option_context_add_main_entries (context, entries, NULL);
    g_option_context_set_summary(context, summary(s_lang_params_table));

    if (!g_option_context_parse (context, &argc, &argv, &error))
        report_error("option parsing failed: %s", error->message);

    CmdOptions* opt = g_new(CmdOptions, 1);
    opt->options = g_new(Options, 1);

    #ifndef NDEBUG
    if(tests) {
        int i = run_tests(argc, argv);
        exit(i);
    }
    #endif

    if(!in_file) report_error("No input file");
    opt->input_file = *in_file;

    // Uses input file without extension, adding extension .mkd (assume markdown)
    opt->output_file = ou ? ou :  ({
                                  char* output      = g_strdup(*in_file);
                                  char* extension   = g_strrstr(output, ".");
                                  extension ? ({
                                               *extension = '0';
                                               g_strjoin("", output, ".mkd", NULL);
                                                }) :
                                               g_strjoin("", output, ".mkd", NULL);
                                  });

    if(l) { // user passed a language
        LangSymbols* lang = lang_find_symbols(s_lang_params_table, l);
        if(!lang) report_error("%s is not a supported language", l);

        opt->options->start_narrative  = lang->start;
        opt->options->end_narrative    = lang->end;

    } else {
        if(!no || !nc) report_error("You need to specify either -l, or both -p and -c");

        opt->options->start_narrative  = no;
        opt->options->end_narrative    = nc;
    }

    if(ind) { // user pass    g_option_context_free();
        opt->options->code_symbols = union_new(CodeSymbols, Indented, .indentation = ind);
    } else {
        if(!co || !cc) report_error("You need to specify either -indent, or both -P and -C");
        opt->options->code_symbols =
            union_new(CodeSymbols, Surrounded, .start_code = co, .end_code = cc);
    }

    return opt;
}

Some windows programs (i.e. notepad, VS, …) add a 3 bytes prelude to their utf-8 files, C doesn’t know anything about it, so you need to strip it. On this topic, I suspect the program works on UTF-8 files that contain non-ASCII chars, even if when I wrote it I didn’t know anything about localization.

It should work because I’m just splitting the file when I see a certain ASCII string and in UTF-8 ASCII chars cannot appear anywhere else than in their ASCII position.

char* skip_utf8_bom(char* str) {
    unsigned char* b = (unsigned char*) str;
    return  b[0] == 0xEF && b[1] == 0xBB && b[2] == 0xBF    ? (char*) &b[3]  : // UTF-8
                                                              (char*) b;
}

12 Not freeing memory (again)

The reason I haven’t been freeing memory all along is because I was planning on using an arena allocator (a kind of linear allocator).

Memory management is fully hortogonal to the style of programming described in this post. You can do it whatever way you prefer, but there is a certain affinity between an arena allocator (or garbage collection) and functional programming because of the temporary objects created in expressions. You could create the temporary objects explicitely, but that would diminish the conciseness of the paradigm.

I have an arena allocator implementation here. In the code below I comment it out so that you don’t have a dependency from that code if you want to try this. The program runs so quickly and it does so little that you can probably let the operating system reclame memory at the end of the process life.

If you ended up integrating this with an editor (i.e. literate programming editing), you’d need to be more careful.

#ifdef ARENA

Arena_T the_arena;

inline static
gpointer arena_malloc(gsize n_bytes) {
    return Arena_alloc(the_arena, n_bytes, __FILE__, __LINE__);
}

inline static
gpointer arena_calloc(gsize n_blocks, gsize n_block_bytes) {
    return Arena_calloc(the_arena, n_blocks, n_block_bytes, __FILE__, __LINE__);
}

inline static
gpointer arena_realloc(gpointer mem, gsize n_bytes) {
    return Arena_realloc(the_arena, mem, n_bytes, __FILE__, __LINE__);
}

void arena_free(G_GNUC_UNUSED gpointer mem) {
    // NOP
}

void set_arena_allocator() {
    GMemVTable vt = (GMemVTable) { .malloc = arena_malloc,      .calloc = arena_calloc,
                                   .realloc = arena_realloc,    .free = arena_free,
                                   .try_malloc = arena_malloc,  .try_realloc = arena_realloc};
    g_mem_set_vtable(&vt);

    the_arena = Arena_new();
}

void destroy_arena_allocator() {
    Arena_dispose(&the_arena);
}

#endif

13 Summary

I have to say, it didn’t feel too cumbersome to structure C code in a functional way, assuming that you can use GLib and a couple of GCC extensions to the language. It certainly doesn’t have the problems that C++ has in terms of debugging STL failures.

There are a couple of things I don’t like about GLib and I’m working on an hobby project to overcome them. Eventually I’ll post it.

int main(int argc, char* argv[])
{
#ifdef ARENA
    set_arena_allocator();
#endif

    CmdOptions* opt = parse_command_line(argc, argv);

    char* source    = NULL;
    GError* error   = NULL;

    if(!g_file_get_contents(opt->input_file, &source, NULL, &error))
        report_error(error->message);

    source = skip_utf8_bom(source);

    char* text              = translate(opt->options, source);

    if(!g_file_set_contents(opt->output_file, text, -1, &error))
        report_error(error->message);

#ifdef ARENA
    destroy_arena_allocator();
#endif

    return 0;
}

LLite : language friendly literate programming

1 Main ideas

The code for this post is here. The source used to generate it is here. I also attached a pdf file to give an idea of the final result.

My interest in literate programming comes from some realizations on my part:

  • When I go back to code that I have written some time ago, I don’t remember my reasoning
  • When I write a blog post, my code seems to be better. Perhaps explaining things to people encourages me to be more precise
  • I like to think top down, but the compiler forces me to write code bottom up, starting from details and going to higher level concepts

1.1 Unhappiness with existing tools

Many of the existing literate programming tools work similarly to the original CWeb.

  • They have a tangle program that goes over your file and extract something that the compiler can understand
  • They have a weave program that extracts from your file something that the document generator can understand

This scheme has the unfortunate limitation of breaking your code editor. Given that your file is not a valid code file anymore, the editor starts misbehaving (i.e. intellisense breaks). The debugger starts to get confused (albeit people tried to remediate that with cleaver use of #line. If your language has an interactive console, that would not work either.

1.2 A different interpretation

The main idea of this program is to add your narrative to the comment part of a code file by extending the comment tag (i.e. in C you could use /** ). This keeps editor, debugger and interactive console working.

The weave phase as been retained and what you are reading is the program that goes over your code file and extracts a nicely formatted (for this program in markdown format) file that can then be translated to HTML, PDF, latex, etc…

You got that? The document you are reading now is the program.

1.3 Multi-language, multi-document format

LLite works for any programming language, assuming it has open and close comment character sequences, and any documentation format, assuming it has open and close code character sequences (aka allows you to delimitate your code somehow), or it needs the code to be indented. This document uses markdown (with Pandoc extensions to generate table of contents and titles).

1.4 Usage

You invoke the program as documented below. The first set of parameters lets you choose the symbols that delimitate your language comments (or the default symbols below). The second set of parameters lets you choose how your target documentation language treats code. Either it delimits it with some symbols or it indents it.

module LLite

let langParamsTable     = [ "fsharp", ("(*" + "*", "*" + "*)") // The + is not to confuse the parser
                            "c", ("/**", "**/")
                            "csharp", ("/**", "**/")
                            "java", ("/**", "**/")] |> Map.ofList

let languages = langParamsTable |> Map.fold (fun state lang _ -> state + lang + " ") ""

let usage   = sprintf @"
Usage: llite inputFile parameters
where:
One of the following two sets of parameters is mandatory
    -no string : string opening a narrative comment
    -nc string : string closing a narrative comment
or
    -l language: where language is one of (%s)

One of the following two sets of parameters is mandatory
    -co string : string opening a code block
    -cc string : string closing a code block
or
    -indent N  : indent the code by N whitespaces

The following parameters are optional:
    -o outFile : defaults to the input file name with mkd extension" languages

let getLangNoNC lang    =
    match Map.tryFind lang langParamsTable with
    | Some(no, nc) -> no, nc
    | None -> failwith (lang + " is not a valid programming language")

1.5 Programming Languages limitations

One of the main tenets of literate programming is that the code should be written in the order that facilitates exposition to a human reader, not in the order that makes the compiler happy. This is very important.

If you have written a blog post or tried to explain a codebase to a new joiner, you must have noticed that you don’t start from the top of the file and go down, but jump here and there trying to better explain the main concepts. Literate programming says that you should write your code the same way. But in our version of it, the compiler needs to be kept happy because the literate file is the code file.

Some ingenuity is required to achieve such goal:

  • In C and C++ you can forward declare functions and classes, also class members can be in any order
  • In C#, Java, VB.NET, F# (the object oriented part) you can write class members in any order
  • In the functional part of F# you do have a problem (see later in this doc)

The F# trick below is used in the rest of the program. You’ll understand its usage naturally by just reading the code

let declare<'a>  = ref Unchecked.defaultof<'a>

2 Implementation

At the core, this program is a simple translator that takes some code text and return a valid markdown/whatever text. We need to know:

  • The strings that start and end a narrative comment (input symbols)
  • How to translate a code block into a document. We support these variations:
    • Indented: indent them by N spaces
    • Surrounded by startCode/endCode strings
type CodeSymbols =
    | Indent of int                 // indentation level in whitespaces
    | Surrounded of string * string // start code * end code

type Options = {
    startNarrative  : string
    endNarrative    : string
    codeSymbols     : CodeSymbols
}

let translate   = declare<Options -> string -> string>

2.1 Going over the parse tree

We need a function that takes a string and returns a list with the various blocks. We can then go over each block, perform some operations and, in the end, transform it back to text

type Block =
| Code      of string
| Narrative of string

let blockize = declare<Options -> string -> Block list>

I could have used regular expressions to parse the program, but it seemed ugly. I could also have used FsParsec, but that brings with it an additional dll. So I decided to roll my own parser. This has several problems:

  • It is probably very slow
  • It doesn’t allow narrative comments inside comments, in particular it doesn’t allow the opening comment
  • It doesn’t allow opening comments in the program code (not even inside a string)

The latter in particular is troublesome. You’ll need to use a trick in the code (i.e. concatenating strings) to foul this program in not seeing an opening comment, but it is inconvenient.

With all of that, it works.

TODO: consider switching to FsParsec

2.1.1 Lexer

The lexer is going to process list of characters. We need functions to check if a list of characters starts with certain chars and to return the remaining list after having removed such chars.

BTW: these functions are polymorphic and tail recursive

let rec startWith startItems listToCheck =
    match startItems, listToCheck with
    | [], _             -> true
    | _ , []            -> false
    | h1::t1, h2::t2  when h1 = h2  -> startWith t1 t2
    | _, _              -> false

let rec remove itemsToRemove listToModify =
    match itemsToRemove, listToModify with
    | [], l             -> l
    | _ , []            -> failwith "Remove not defined on an empty list"
    | h1::t1, h2::t2  when h1 = h2  -> remove t1 t2
    | _, _              -> failwith "itemsToRemove are not in the list"

let isOpening options       = startWith (List.ofSeq options.startNarrative) 
let isClosing options       = startWith (List.ofSeq options.endNarrative)
let remainingOpen options   = remove (List.ofSeq options.startNarrative)
let remainingClose options  = remove (List.ofSeq options.endNarrative)

This is a pretty basic tokenizer. It just analyzes the start of the text and returns what it finds. It also keeps track of the line number for the sake of reporting it in the error message.

let NL = System.Environment.NewLine

type Token =
| OpenComment   of int
| CloseComment  of int
| Text          of string

let tokenize options source =

    let startWithNL = startWith (Seq.toList NL)

    let rec text line acc = function
        | t when isOpening options t    -> line, acc, t 
        | t when isClosing options t    -> line, acc, t
        | c :: t as full                ->
            let line' = if startWithNL full then line + 1 else line
            text line' (acc + c.ToString()) t
        | []                            -> line, acc, [] 
    let rec tokenize' line acc = function
        | []                            -> List.rev acc
        | t when isOpening options t    -> tokenize' line
                                            (OpenComment(line)::acc)  (remainingOpen options t)
        | t when isClosing options t    -> tokenize' line
                                            (CloseComment(line)::acc) (remainingClose options t)
        | t                             ->
            let line, s, t'= text line "" t
            tokenize' line (Text(s) :: acc) t'

    tokenize' 1 [] (List.ofSeq source)

2.1.2 Parser

The parse tree is just a list of Chunks, where a chunk can be a piece of narrative or a piece of code.

type Chunk =
| NarrativeChunk    of Token list
| CodeChunk         of Token list

let parse options source =

    let rec parseNarrative acc = function
        | OpenComment(l)::t         ->
            failwith ("Don't open narrative comments inside narrative comments at line "
                                                                                    + l.ToString())
        | CloseComment(_)::t        -> acc, t
        | Text(s)::t                -> parseNarrative (Text(s)::acc) t
        | []                        -> failwith "You haven't closed your last narrative comment"

    let rec parseCode acc = function
        | OpenComment(_)::t as t'   -> acc, t'
        | CloseComment(l)::t        -> parseCode (CloseComment(l)::acc) t
        | Text(s)::t                -> parseCode (Text(s)::acc) t
        | []                        -> acc, []
    let rec parse' acc = function
        | OpenComment(_)::t         ->
            let narrative, t' = parseNarrative [] t
            parse' (NarrativeChunk(narrative)::acc) t' 
        | Text(s)::t                ->
            let code, t' = parseCode [Text(s)] t
            parse' (CodeChunk(code)::acc) t'
        | CloseComment(l)::t           ->
            failwith ("Don't insert a close narrative comment at the start of your program at line "
                                                                                    + l.ToString())
        | []                -> List.rev acc

    parse' [] (List.ofSeq source)

2.1.3 Flattener

The flattening part of the algorithm is a bit unusual. At this point we have a parse tree that contains tokens, but we want to reduce it to two simple node types containing all the text in string form.

TODO: consider managing nested comments and comments in strings (the latter has to happen in earlier phases)

let flatten options chunks =
    let tokenToStringNarrative = function
    | OpenComment(l) | CloseComment(l)  -> failwith ("Narrative comments cannot be nested at line "
                                                                                    + l.ToString())
    | Text(s)                           -> s

    let tokenToStringCode = function
    | OpenComment(l)                -> failwith ("Open narrative comment cannot be in code at line"
                                                                + l.ToString()) +
                                                 ". Perhaps you have an open comment in" +
                                                 " a code string before this comment tag?"
    | CloseComment(_)               -> string(options.endNarrative |> Seq.toArray)
    | Text(s)                       -> s

    let flattenChunk = function
    | NarrativeChunk(tokens)             ->
        Narrative(tokens |> List.fold (fun state token -> state + tokenToStringNarrative token) "")
    | CodeChunk(tokens)                  ->
        Code(tokens |> List.fold (fun state token -> state + tokenToStringCode token) "")

    chunks |> List.fold (fun state chunk -> flattenChunk chunk :: state) [] |> List.rev

We are getting there, now we have a list of blocks we can operate upon

blockize := fun options source -> source |> tokenize options |> parse options |> flatten options
 

2.2 Narrative comments phases

Each phase is a function that takes the options and a block list and returns a block list that has been processed in some way.

type Phase = Options -> Block List -> Block List

let removeEmptyBlocks   = declare<Phase>
let mergeBlocks         = declare<Phase>
let addCodeTags         = declare<Phase>

let processPhases options blockList = 
    blockList
    |> !removeEmptyBlocks   options
    |> !mergeBlocks         options
    |> !addCodeTags         options

We want to manage how many newlines there are between different blocks, because we don’t trust the programmer to have a good view of how many newline to keep from comment blocks and code blocks. We’ll trim all newlines from the start and end of a block, and then add our own.

let newLines = [|'\n';'\r'|]

type System.String with
    member s.TrimNl () = s.Trim(newLines) 

2.2.1 Remove empty blocks

There might be empty blocks (i.e. between two consecutive comment blocks) in the file. For the sake of formatting the file beautifully, we want to remove them.

let extract = function
    | Code(text)        -> text
    | Narrative(text)   -> text

removeEmptyBlocks := fun options blocks ->
                        blocks |> List.filter (fun b -> (extract b).TrimNl().Trim() <> "")

2.2.2 Merge blocks

Consecutive blocks of the same kind need to be merged, for the sake of formatting the overall text correctly.

TODO: make tail recursive

let rec mergeBlockList = function
    | []        -> []
    | [a]       -> [a]
    | h1::h2::t -> match h1, h2 with
                   | Code(t1), Code(t2)             -> mergeBlockList (Code(t1 + NL + t2)::t)
                   | Narrative(n1), Narrative(n2)   -> mergeBlockList(Narrative(n1 + NL + n2)::t)
                   | _, _                           -> h1::mergeBlockList(h2::t)

mergeBlocks := fun options blocks -> mergeBlockList blocks

2.2.3 Adding code tags

Each code block needs a tag at the start and one at the end or it needs to be indented by N chars.

let indent n (s:string) =
    let pad = String.replicate n " "
    pad + s.Replace(NL, NL + pad)

addCodeTags := fun options blocks ->
    match options.codeSymbols with
    | Indent(n)         ->
        blocks |> List.map (function Narrative(s) as nar -> nar | Code(s) -> Code(indent n s))
    | Surrounded(s, e)  -> 
        blocks |> List.map (function
                            | Narrative(text)   -> Narrative(NL + text.TrimNl() + NL)
                            | Code(text)        -> Code(NL + s + NL + text.TrimNl() + NL + e + NL))

2.2.4 Flatten again

Once we have the array of blocks, we need to flatten them (transform them in a single string), which is trivial, and then finally implement our original translate function.

let sumBlock s b2 = s + extract b2

let flattenB blocks = (blocks |> List.fold sumBlock "").TrimStart(newLines)

translate := fun options text -> text |> !blockize options |> processPhases options |> flattenB

2.3 Parsing command line arguments

Parsing command lines involves writing a function that goes from a sequence of strings to an input file name, output file name and Options record

let parseCommandLine = declare<string array -> string * string * Options>

To implement it, we are going to use a command line parser taken from here. The parseArgs function takes a sequence of argument values and map them into a (name,value) tuple. It scans the tuple sequence and put command name into all subsequent tuples without name and discard the initial (,) tuple. It then groups tuples by name and converts the tuple sequence into a map of (name,value seq)

For now, I don’t need the value seq part as all my parameters take a single argument, but I left it in there in case I will need to pass multiple args later on.

open  System.Text.RegularExpressions

let (|Command|_|) (s:string) =
  let r = new Regex(@"^(?:-{1,2}|\/)(?<command>\w+)[=:]*(?<value>.*)$",RegexOptions.IgnoreCase)
  let m = r.Match(s)
  if m.Success
  then 
    Some(m.Groups.["command"].Value.ToLower(), m.Groups.["value"].Value)
  else
    None

let parseArgs (args:string seq) =
  args 
  |> Seq.map (fun i -> 
                    match i with
                    | Command (n,v) -> (n,v) // command
                    | _ -> ("",i)            // data
                  )
  |> Seq.scan (fun (sn,_) (n,v) -> if n.Length>0 then (n,v) else (sn,v)) ("","")
  |> Seq.skip 1
  |> Seq.groupBy (fun (n,_) -> n)
  |> Seq.map (fun (n,s) -> (n, s |> Seq.map (fun (_,v) -> v) |> Seq.filter (fun i -> i.Length>0)))
  |> Map.ofSeq

let paramRetrieve (m:Map<string,_>) (p:string) = 
  if Map.containsKey p m
  then Some(m.[p])
  else None

This is the main logic of parameter passing. Note that we give precedence to the -l and -indent parameters, if present.

This is a function that goes from the map of command line parameters to the input file name, output file name and options. With that we can finally define the original parseCommandLine.

let safeHead errMsg s = if s |> Seq.isEmpty then failwith errMsg else s |> Seq.head 

let paramsToInputs paramsMap =
    let single p er     = match paramRetrieve paramsMap p with | Some(k) -> Some(k |> safeHead er)
                                                               | None -> None
    let get p s         = match paramRetrieve paramsMap p with |Some(k) -> k |> safeHead s
                                                               | None -> failwith s
    let defaultP p q er = match paramRetrieve paramsMap p with | Some(k) -> k |> safeHead er
                                                               | None -> q

    let inputFile       = get "" "You need to pass an input file"
    let outputFile      = defaultP  "o"
                                    (System.IO.Path.GetFileNameWithoutExtension(inputFile) + ".mkd")
                                    "You must pass a parameter to -o"

    let no, nc          = match single "l" "You must pass a language parameter to -l" with
                          | Some(l) -> getLangNoNC l
                          | None    ->
                                get "no" "The no (narrative open) parameter is mandatory, if no -l specified",
                                get "nc" "The nc (narrative close) parameter is mandatory, if no -l specified"

    let codeSymbs       = match single "indent" "You must pass a whitespace indentation number to -indent" with
                          | Some(n)     ->
                                let success, value = System.Int32.TryParse n
                                if success
                                    then Indent(value)
                                    else failwith "-i accepts just an integer value as parameter"                          
                          | None        ->
                                Surrounded(
                                    get "co" "The co (code open) parameter is mandatory, if no -indent specified",
                                    get "cc" "The cc (code close) parameter is mandatory")
    inputFile, outputFile, {
        startNarrative  = no
        endNarrative    = nc
        codeSymbols     = codeSymbs
        }

parseCommandLine := parseArgs >> paramsToInputs

2.4 Main method

We can then write main as the only side effect function in the program. Here is where the IO monad would live …

let banner  = "LLite : language friendly literate programming\n"

let myMain args =
    try
        printfn "%s" banner

        let inputFile, outputFile, options = !parseCommandLine args
        let input       = System.IO.File.ReadAllText inputFile
        let output      = !translate options input
        System.IO.File.WriteAllText (outputFile, output)
        0
    with
    | e ->
        printfn "%s" "Failure"
        printfn "%s" e.Message 
        printfn "%s" usage
#if DEBUG 
        printfn "\nDetailed Error Below:\n%A" e
#endif
        -1

3 An aside: forward declaring functions in F#

3.1 A simple solution

You can achieve something somehow similar to forward declaration by the ’declare ’dirty trick used in this program. Whenever you want to do a forward declaration of a function , or variable, you can type:

let testDeclare() =

    let add = declare<float -> float>

    let ``function where I want to use add without having defined it`` nums = nums |> Seq.map !add

This creates a ref to a function from float to float. It looks a bit like an Haskell type declaration. You can then use such function as if it were actually define and delay his definition to a later point in time when you are ready to explain it.

When you are ready to talk about it, you can then define it with:

    add := fun x -> x + 1.

The syntax is not too bad. You get that often-sought Haskell like explicit type declaration and you can regex the codebase to create an index at the end of the program (maybe).

But is it too slow? After all, there is one more indirection call for each function call.

Let’s test it: enable #time in F# interactive and execute timeNormalF and timeIndirectF varying sleepTime and howManyIter until you convince yourself that it is ok (or not).

    let sleepTime   = 50
    let howManyIter = 100
    let normalF x   = System.Threading.Thread.Sleep sleepTime
    let indirectF   = declare<int -> unit>
    indirectF      := fun x -> System.Threading.Thread.Sleep sleepTime

    let timeNormalF     = [1..howManyIter] |> List.iter normalF
    let timeIndirectF   = [1..howManyIter] |> List.iter !indirectF
    ()

3.2 A correct solution (but ugly)

Unfortunately, there is a big problem with all of the above: it doesn’t work with generic functions and curried function invocations. The code below works in all cases, but it is ugly for the user to use. In this program I’ve used the beautiful, but incorrect, syntax.

type Literate() =
    static member Declare<'a, 'b>  (ref : obj ref) (x : 'a) : 'b =
        unbox <| (unbox<obj -> obj> !ref) x
    static member Define<'a, 'b> (func : 'a -> 'b) (ref : obj ref) (f : 'a -> 'b) =
        ref := box (unbox<'a> >> f >> box)

// Declaration    
let rec id (x : 'a) : 'a = Literate.Declare idImpl x
and idImpl = ref null

// Usage
let f () = id 100 + id 200

// Definition
Literate.Define id idImpl (fun x -> x)

3.3 The traditional way

The traditional way of doing something like this is by passing the function as a parameter in a manner similar to the below.

// Declaration and usage intermingled
let calculate' (add: int -> int -> int) x y = add x y * add x y

// Definition
let add x y = x + y

let calculate = calculate' add

To my way of seeing, this is the most cumbersome solution and conceptually dishonest because you are not parametrizing your function for technical reasons, but just for the sake of explaining things. In a way, you are changing the signature of your functions for the sake of writing a book. That can’t be right …

Exceptions vs. Return Values to represent errors (in F#) – IV – Implementation

The Critical monad is defined as follows. First there is the type that propagates through the monad:

type Result<'a, 'b> =
| Success of 'a
| Failure of 'b


Then we define the usual computation expression methods.

type Critical() =
       // a -> m a
        member o.Return x       = Success x
        // m a -> (a -> m b) -> m b
        member o.Bind (m, f)    = match m with
                                    | Failure e -> Failure e
                                    | Success x -> f x
        // m a -> m a
        member o.ReturnFrom m   = m

Explaining how computational expressions work in F# is a blog onto itself. And several chapters in many books. Sufficient to say that conceptually this propagates the success value and returns the failure value.

We then define an instance of this type, so that we can use the nice ‘critical { … }’ syntax.

let critical = Critical()


We then go and define the functions that the user needs to use to annotate their function calls. The simplest one is the one to propagate any exception coming from the function ‘f’.

let fault f = f


Then it comes the one to manage contingencies. This will trap any exception for which ‘stopF ex’ is ‘true’, call ‘errF ex’ to construct the error return value and wrap it in a ‘Failure’. Otherwise it will rethrow the exception.

let contingentGen stopF errF f =
    try
        Success(f ())
    with
        | ex when stopF ex -> Failure(errF ex)
        | _                -> reraise ()

Albeit very simple, the above is the core of the system. Everything else is just details. Let’s look at them.

First we want a function that takes as parameter a list of (Exception, ReturnValue) and gives back the correct stopF errF to plug into ‘contingentGen’.

let exceptionMapToFuncs exMap =
    let tryFind ex = exMap |> List.tryFind (fun (k, _) -> k.GetType() = ex.GetType())
    (fun ex ->
        let found = tryFind ex
        match found with Some(_) -> true | None -> false),
    (fun ex ->
        let found = tryFind ex
        match found with
        | Some(k, v)    -> v ex
        | None          -> raise ex)

Then ugliness comes. For the sake of getting a decent syntax (not great) on the calling site, we need to fake overloading of functions by the old trick of adding a number at the end. Thanks to Tobias to point out this (my api was even worse earlier).

I often wondered about the trade-off between currying and overloading for functions. I seem to always paint myself in a situation where I need overloading. In any case, here it goes:

let contingent1 exMap f x =
    let stopF, errF = exceptionMapToFuncs exMap
    contingentGen stopF errF (fun _ -> f x)

let contingent2 exMap f x y =
    let stopF, errF = exceptionMapToFuncs exMap
    contingentGen stopF errF (fun _ -> f x y)

let contingent3 exMap f x y z =
    let stopF, errF = exceptionMapToFuncs exMap
    contingentGen stopF errF (fun _ -> f x y z)


Sometimes you want to trap all exceptions from a function and return your own error value:

let neverThrow1 exc f x     = contingentGen (fun _ -> true) (fun ex -> exc ex) (fun _ -> f x)
let neverThrow2 exc f x y   = contingentGen (fun _ -> true) (fun ex -> exc ex) (fun _ -> f x y)
let neverThrow3 exc f x y z = contingentGen (fun _ -> true) (fun ex -> exc ex) (fun _ -> f x y z)


Other times you need to go from a function that returns return values to one that throws exceptions. You need translating from contingencies to faults:

let alwaysThrow exc f x =
    match f x with
    | Success(ret)              -> ret
    | Failure(e)                -> raise (exc e)      

And that’s it. Hopefully we have bridged the gap between exceptions and return values without making the code too ugly (just a little bit). Or perhaps not.

I need to add that I haven’t used this library myself (yet). I’m sure when I do I’ll discover many things to change.

Exceptions vs. Return Values to represent errors (in F#) – III–The Critical monad

Code for this post is here.

In the last post we looked at some Critical code and decided that, albeit correct, it is convoluted. The error management path obfuscates the underlying logic. Also we have no way of knowing if a developer had thought about the error path or not when invoking a function.

Let’s tackle the latter concern first as it is easier. We want the developer to declaratively tag each method call with something that represents his intent about managing the Contingencies or Faults of the function.  Moreover if the function has contingencies, we want to force the developer to manage them explicitly.

We cannot use attributes for this as function calls happen in the middle of the code, so there is no place to stick attributes into. So we are going to use higher level functions to wrap the function calls.

The first case is easy. If the developer thinks that the caller of his code has no way to recover from all the exceptions thrown by a function, he can prepend his function call with the ‘fault’ word as in:

fault parseUser userText

That signals readers of the code that the developer is willing to propagate up all the exceptions thrown by the function parseUser. Embarrassingly, ‘fault’ is implemented as:

let fault f = f

So it is just a tag. Things get trickier when the function has contingencies. We want to find a way to manage them without introducing undue complexity in the code.

We’d like to catch some exceptions thrown by the function and convert them to return values and then either return such return values or manage the contingency immediately after the function call. On top of that, we’d want all of the code written after the function call to appear as clean as if no error management were taking place. Monads (computation values) can be used to achieve these goals.

Last time we introduced a type to represent error return values:

type Result<'a, 'b> =
| Success of 'a
| Failure of 'b
type UserFetchError =
| UserNotFound  of exn
| NotAuthorized of int * exn 

We can then create a computation expression that ‘abstracts out’ the Failure case and let you write the code as cleanly as if you were not handling errors. Let’s call such thing ‘critical’. Here is how the final code looks like:

let tryFetchUser3 userName =
    if String.IsNullOrEmpty userName then invalidArg "userName" "userName cannot be null/empty"

    critical {
        let Unauthorized (ex:exn) = NotAuthorized (ex.Message.Length, ex)
        let! userText = contingent1
                            [FileNotFoundException()        :> exn, UserNotFound;
                             UnauthorizedAccessException()  :> exn, Unauthorized]
                            dbQuery (userName + ".user")
        
        return fault parseUser userText
    }

You can compare this with the code you would have to write without the ‘critical’ library (from last post):

let tryFetchUser1 userName =
    if String.IsNullOrEmpty userName then invalidArg "userName" "userName cannot be null/empty"

    // Could check for file existence in this case, but often not (i.e. db)
    let userResult =    try
                            Success(dbQuery(userName + ".user"))
                        with
                        | :? FileNotFoundException as ex        -> Failure(UserNotFound ex)
                        | :? UnauthorizedAccessException as ex  -> Failure(NotAuthorized(2, ex))
                        | ex                                    -> reraise ()
    
    match userResult with
    | Success(userText) -> 
        let user        = Success(parseUser(userText))
        user
    | Failure(ex)       -> Failure(ex)

And with the original (not critical) function:

let fetchUser userName =
    let userText            = dbQuery (userName + ".user")
    let user                = parseUser(userText)
    user

Let’s go step by step and see how it works. First of all, you need to enclose the Critical parts of your code (perhaps your whole program) in a ‘critical’ computation:

    critical {
}

This allows you to call functions that return a Result and manage the return result as if it were the successful result. If an error were generated, it would be returned instead. We will show how to manage contingencies immediately after the function call later.

The above is illustrated by the following:

        let! userText = contingent1
                            [FileNotFoundException()        :> exn, UserNotFound;
                             UnauthorizedAccessException()  :> exn, Unauthorized]
                            dbQuery (userName + ".user")

Here ‘contingent1’ is a function that returns a Result, but userText has type string. The Critical monad, and in particular the usage of ‘let!’ is what allows the magic to happen.

‘contingentN’ is a function that you call when you want to manage certain exceptions thrown by a function as contingencies. The N part represents how many parameters the function takes.

The first parameter to ‘contingent1’ is a list of pairs (Exception, ErrorReturnConstructor). That means: when an exception of type Exception is thrown, return the result of calling ‘ErrorReturnConstructor(Exception)’ wrapped inside a ‘Failure’ object. The second parameter to ‘contingent1’ is the function to invoke and the third is the argument to pass to it.

Conceptually, ‘ContingentN’ is a tag that says: if the function throws one of these exceptions, wrap them in these return values and propagate all the other exceptions. Notice that Unauthorized takes an integer and an exception as parameters while the ErrorReturnConstructor takes just an exception. So we need to add this line of code:

        let Unauthorized (ex:exn) = NotAuthorized (ex.Message.Length, ex) 

After the contingent1 call, we can then write code as if the function returned a normal string:

        return fault parseUser userText

This achieves that we set up to do at the start of the series:

  • Contingencies are now explicit in the signature of tryFetchUser3
  • The developer needs to indicate for each function call how he intend to manage contingencies and faults
  • The code is only slightly more complex than the non-critical one

You can also decide to manage your contingencies immediately after calling a function. Perhaps there is a way to recover from the problem. For example, if the user is not in the database, you might want to add a standard one:

let createAndReturnUser userName = critical { return {Name = userName; Age = 43}}
let tryFetchUser4 userName = if String.IsNullOrEmpty userName then invalidArg "userName" "userName cannot be null/empty" critical { let Unauthorized (ex:exn) = NotAuthorized (ex.Message.Length, ex) // depends on ex let userFound = contingent1 [FileNotFoundException() :> exn, UserNotFound; UnauthorizedAccessException() :> exn, Unauthorized] dbQuery (userName + ".user") match userFound with | Success(userText) -> return fault parseUser userText | Failure(UserNotFound(_)) -> return! createAndReturnUser(userName) | Failure(x) -> return! Failure(x) }

The only difference in this case is the usage of ‘let’ instead of ‘let!’. This exposes the real return type of the function allowing you to pattern match against it.

Sometimes a simple exception to return value mapping might not be enough and you want more control on which exceptions to catch and how to convert them to return values. In such cases you can use contingentGen:

let tryFetchUser2 userName =
    if String.IsNullOrEmpty userName then invalidArg "userName" "userName cannot be null/empty"

    critical {
        let! userText = contingentGen
                            (fun ex -> ex :? FileNotFoundException || ex :? UnauthorizedAccessException)
                            (fun ex ->
                                match ex with
                                       | :? FileNotFoundException       -> UserNotFound(ex)
                                       | :? UnauthorizedAccessException -> NotAuthorized(3, ex)
                                       | _ -> raise ex)
                            (fun _ -> dbQuery (userName + ".user"))
        
        return fault parseUser userText
    }

The first parameter is a lambda describing when to catch an exception. The second lambda translate between exceptions and return values. The third lambda represents which function to call.

Sometimes you might want to catch all the exceptions that a function might throw and convert them to a single return value:

type GenericError = GenericError of exn

 // 1. Wrapper that prevents exceptions for escaping the method by wrapping them in a generic critical result
let tryFetchUserNoThrow userName =
    if String.IsNullOrEmpty userName then invalidArg "userName" "userName cannot be null/empty"

    critical {
        let! userText = neverThrow1 GenericError dbQuery (userName + ".user")
        
        return fault parseUser userText
    }

And sometimes you might want to go the opposite way. Given a function that exposes some contingencies, you want to translate them to faults because you don’t know how to recover from them.

let operateOnExistingUser userName =
    let user = alwaysThrow GenericException tryFetchUserNoThrow userName
    ()

Next time we’ll look at how the Critical computation expression is implemented.

Exceptions vs. Return Values to represent errors (in F#) – II– An example problem

In the previous post, we talked about the difference between Critical and Normal code. In this post we are going to talk about the Critical code part. Ideally, we want:

  • A way to indicate that a particular piece of code (potentially the whole program) is Critical
  • A way to force/encourage the programmer to make an explicit decision on the call site of a function on how he wants to manage the error conditions (both contingencies and faults)
  • A way to force/encourage the programmer to expose contingencies/faults that are appropriate for the conceptual level of the function the code is in (aka don’t expose implementation details for the function,  i.e. don’t throw SQLException from a getUser method where the caller is supposed to catch it)

Remember that I can use the word ‘force’ here because the programmer has already taken the decision to analyse each line of code for error conditions. As we discussed in the previous post, In many/most cases, such level of scrutiny is unwarranted.

Let’s use the below scenario to unravel the design:

type User = {Name:string; Age:int}
let fetchUser userName =
    let userText            = dbQuery (userName + ".user")
    let user                = parseUser(userText)
    user

This looks like a very reasonable .NET function and it is indeed reasonable in Normal code, but not in Critical code. Note that the caller likely needs to handle the user-not-in-repository case because there is no way for the caller to check such condition beforehand without incurring the performance cost of two network roundtrips.

Albeit the beauty and simplicity, there are issues with this function in a Critical context:

  • The function throws implementation related exceptions, breaking encapsulation when the user needs to catch them
  • It is not clear from the code if the developer thought about error management (do you think he did?)
  • Preconditions are not checked, what about empty or null strings?

To test our design let’s define a fake dbQuery:

let dbQuery     = function
    | "parseError.user"     -> "parseError"
    | "notFound.user"       -> raise (FileNotFoundException())
    | "notAuthorized.user"  -> raise (UnauthorizedAccessException())
    | "unknown.user"        -> failwith "Unknown error reading the file"
    | _                     -> "FoundUser"

The first two exceptions are contingencies, the caller of fetchUser is supposed to manage them. The unknown.user exception is a fault in the implementation. parseError triggers a problem in the parseUser function.

ParseUser looks like this:

let parseUser   = function
    | "parseError"          -> failwith "Error parsing the user text"
    | u                     -> {Name = u; Age = 43}

Let’s now create a test function to test the different versions of fetchUser that we are going to create:

let test fetchUser =
    let p x                 = try printfn "%A" (fetchUser x) with ex -> printfn "%A %s" (ex.GetType()) ex.Message
    p "found"
    p "notFound"
    p "notAuthorized"
    p "parseError"
    p "unknown"

Running the function exposes the problems described above. From the point of view of the caller, there is no way to know what to expect by just inspecting the signature of the function. There is no differentiation between contingencies and faults. The only way to achieve that is to catch some implementation-specific exceptions.

How would we translate this to Critical code?

First, we would define a type to represent the result of a function:

type Result<'a, 'b> =
| Success of 'a
| Failure of 'b

This is called the Either type, but the names have been customized to represent this scenario. We then need to define which kind of contingencies our function could return.

type UserFetchError =
| UserNotFound  of exn
| NotAuthorized of int * exn

So we assume that the caller can manage the fact that the user is not found or not authorized. This type contains an Exception member.  This is useful in cases where the caller doesn’t want to manage a contingency, but wants to treat it like a fault (for example when some Normal code is calling some Critical code).

In such cases, we don’t lose important debugging information. But we still don’t break encapsulation because the caller is not supposed to ‘catch’ a fault.

Notice that NotAuthorized contains an int member. This is to show that contingencies can carry some more information than just their type. For example, a caller could match on both the type and the additional data.

With that in place, let’s see how the previous function looks like:

let tryFetchUser1 userName =
    if String.IsNullOrEmpty userName then invalidArg "userName" "userName cannot be null/empty"

    // Could check for file existence in this case, but often not (i.e. db)
    let userResult =    try
                            Success(dbQuery(userName + ".user"))
                        with
                        | :? FileNotFoundException as ex        -> Failure(UserNotFound ex)
                        | :? UnauthorizedAccessException as ex  -> Failure(NotAuthorized(2, ex))
                        | ex                                    -> reraise ()
    
    match userResult with
    | Success(userText) -> 
        let user        = Success(parseUser(userText))
        user
    | Failure(ex)       -> Failure(ex)

Here is what changed:

  • Changed name to tryXXX to convey the fact that the method has contingencies
  • Added precondition test, which generates a fault
  • The signature of the function now conveys the contingencies that the user is supposed to know about

But still, there are problems:

  • The code became very long and convoluted obfuscating the success code path
  • Still, has the developer thought about the error conditions in parseUser and decided to let exceptions get out, or did she forget about it?

The return value crowd at this point is going to shout: “Get over it!! Your code doesn’t need to be elegant, it needs to be correct!”. But I disagree, obfuscating the success code path is a problem because it becomes harder to figure out if your business logic is correct. It is harder to know if you solved the problem you set out to solve in the first place.

In the next post we’ll see what we can do about keeping beauty and being correct.

Exceptions vs. Return Values to represent errors (in F#) – I – Conceptual view

Recently I’ve been reading numerous articles on the age old question of exceptions vs. return values. There is a vast literature on the topic with very passionate opinions on one side or the other. Below is my view on it.

First of all, I’ll define my terms.

  • Success code path: chunk of code that is responsible to perform the main task of a function, without any regard for error conditions
  • Contingency: an event that happens during the success code path execution that is of interest for the caller of the function.
    • It happens rarely
    • It can be described in terms that the caller understands, it doesn’t refer to the implementation of the function
    • The caller stands a chance to recover from it
  • Fault: an event that happens during the success code path execution that is not expected by the caller of the function.
    • It happens rarely
    • It cannot be described in terms that the caller understands, it requires reference to the implementation of the function
    • The caller has no way to recover from it

Examples of the above for a FetchUser(userName) function:

  • Success code path: the code that retrieves the user from the database
  • Contingency: the fact that the requested user is not in the database
  • Fault: userName = null, divide by zero, stack overflow, …

The difference between Contingency and Fault is not sharp in practice and requires common sense, but it is useful none less. When in doubt, it would appear prudent to consider an event as a Contingency, so that the caller gets a chance to recover.

Ideally, you would like a Contingency to be part of the signature of a function, so that the caller knows about it. On the other end, a Fault shouldn’t be part of the signature of a function for two reasons:

  • The user cannot recover from it
  • Being part of the signature would break encapsulation by exposing implementation details of the function

The above seems to suggest that Contingencies should be represented as return values and Faults as exceptions. As an aside, in Java the former is represented as checked exceptions, which is part of the signature. We’ll tackle checked exceptions later on.

An important point that is often neglected in the discussions on this topic is that there are two categories of applications: applications that care about Contingencies (Critical apps) and applications that don’t (Normal apps). I am of the opinion that the latter category is the largest.

In many cases you can indeed write just the success code path and, if anything goes wrong, you just clean up after yourself and exit. That is a perfectly reasonable thing to do for very many applications. You are trading off speed of development with stability.  Your application can be anywhere on that continuum.

Examples of Normal apps are: build scripts, utility applications, departmental applications where you can fix things quickly on the user machine, intranet web sites, internet web sites that are purely informative, etc …

Examples of Critical apps are: servers, databases, operating systems, web site that sell stuff,  etc …

For Normal apps, treating Contingencies as Fault is the right thing to do. You just slap a try … catch around your event loop/ thread/ process and you do your best to get the developer to fix the problem quickly. I think a lot of the angst of the ‘return value crowd’ is predicated on not having this distinction in mind. They are making very valid point regarding Critical apps to a crowd that is thinking about Normal apps. So the two sides are cross-talking.

Also, in my opinion, the main problem with Java checked exceptions is that they make writing Normal apps as cumbersome as writing Critical apps. So, reasonably, people complain.

The .NET framework decided to use Exceptions as the main way to convey both Faults and Contingencies. By doing so, it makes it easier to write Normal apps, but more difficult to write Critical apps.

For a Critical app or section of code, you’d want to:

  • Express contingencies in the function signature
  • Force/Encourage people to take an explicit decision at the calling site on how they want to manage both Contingencies and Faults for each function call

In the next post, let’s see how we can represent some of this in F#.

Retrieving SQL Server data with type providers and exposing it with ASP.NET Web APIs in F#

For a good introduction on how to use Web APIs in F#, read here. The starting point for type providers is here. This post is about how I solved a practical problem using these technologies.

First, let’s discuss the scenario. In my company, we needed to log usage information for our various applications to a central repository and build a web site to access such information. I went through three different architectures for such a requirement, ending with the set of technologies described in the title.

I found the intersection of the SQL type provider with the ASP.NET Web Api to be very sweet. Personally, I think this mix is much better than using the WFC Data Services, because the code you have to write and the infrastructure to maintain is significantly less.

I suspect that the F# team and ASP.Net team didn’t talk to each other. It all happens because of a well defined interface (IQueryable) that both teams happen to work against.

1st version, heavy Javascript

·         SQL Server Express backend

·         WFC Data Services middle tier in C# (autogenerated REST service from a table, you can query from the JS code)

·         Plenty of Javascript in the browser, calling back to the REST web services through Ajax

 

Here is an example of the kind of manipulation done in the JS layer:

 

function extractVersions(sessions) {

    var versions = {}

    $.each(sessions, function (i, s) {

        if (typeof versions[s.Version] == ‘undefined’) {

            versions[s.Version] = 0

        } else {

            versions[s.Version] = versions[s.Version] + 1

        }

    })

    var vs = []

    for (var prop in versions) {

        if (versions.hasOwnProperty(prop))

            vs.push({ Version: prop, Sessions: versions[prop] })

    }

    return vs;

}

 

Pros:

·         Back-end autogenerated

·         I’ve learned JS pretty well

Cons:

·         Plenty of autogenerated code (crap) to debug into when things go wrong

·         Difficult to customize the back-end

·         Somehow slow, even if today’s JS engines are pretty good

·         More time consuming to create and maintain JS code (compared to F# or C#)

·         Aesthetically unsatisfying, as you really would like to do most manipulations on the server

 

2nd version, SOAP like web services

·         SQL Server Express backend

·         SOAP like web services middle tier in C# (coded by hand), still using the WCF Data Services infrastructure

·         Little Javascript in the browser exclusively for presentation, calling back to the REST web services through Ajax

 

Here is how one of the web services would look like:

 

        [WebGet]

        public IEnumerable<VersionSessions> GetVersionsForMonth(string application,
                                                                bool isQS, bool isAD)

        {

            var sessions = GetSessionsForMonth(application, isQS, isAD);

 

            var hash = new Dictionary<string, VersionSessions>();

 

            foreach (var s in sessions)

            {

                VersionSessions vs;

                var found = hash.TryGetValue(s.Version, out vs);

                if (!found)

                    hash[s.Version] = new VersionSessions { Version = s.Version,
                                                                       Sessions = 0 };

                else

                    vs.Sessions += 1;

            }

            return hash.Values;

        }

 

Pros:

·         Maintainable business logic code (C# instead of JS)

·         Somehow faster

·         Easier to customize the back-end (just add another web service with the signature you want)

Cons:

·         Still plenty of manipulation code (at least now on the back-end), not very RESTy

·         Feels that the webservice are very ad-hoc for this particular presentation (i.e. a ViewModel, not a model)

·         Still not trivial to customize the back-end logic, mostly because I was using WCF Data Services, which are opaque

 

3rd version, using SQL Views and F# type providers

·         Moved the whole app to a dedicated VM

·         SQL Server backend with a significant layer of Views and inline table-value functions

·         Proper REST web services created through ASP.NET Web API and F# type providers

·         Little Javascript in the browser exclusively for presentation, calling back to the REST web services through Ajax

·         Moved all external ‘reference file’ (i.e. people in QS) to the DB. Before I was keeping business info in the database and config info in files.

 

Here is an example of an inline table value function:

 

ALTER FUNCTION [dbo].[FUsers] (@App nvarchar(40))

RETURNS TABLE

AS

RETURN

(

SELECT     UserAlias, MAX(Application) AS Application, MAX(Start) AS LastSessionStart, MAX(Finish) AS LastSessionFinish, DATEDIFF(MINUTE, MAX(Start), MAX(Finish))

                      AS LastSessionTotalTime, COUNT(*) AS Sessions, MAX(Machine) AS Machine, MAX(Version) AS Version, MAX(Qs) AS Qs

FROM         dbo.VTotalSessions AS s

WHERE     (TotalTime > 30) and (Application = @App)

GROUP BY UserAlias)

 

And here is an example of a type provider based web service. Most of my code works like this:

 

type usersController() =

    inherit ApiController()

 

    let db = dbSchema.GetDataContext()

   

    member x.Get() =

        query {

            for u in db.VUsers do

            select u

        }

    member x.Get(app) =

        query {

            for u in db.FUsers(app) do

            select u

        }

 

    member x.Post (value:string) = ()

    member x.Put (id:int) (value:string) = ()

    member x.Delete (id:int) = ()

 

Here is how a query is represented in JS (the whole query gets passed through to SQL, both the JS part and the F# part, IQueryable magick):

 

function loadOpenSessions(app, qs,  cont) {

    var query = new oQuery()

        .From("/users/" + app)

        .Let("lastMonth", lastMonth)

        .Let("twoMinutesAgo", twoMinutesAgo)

        .Where("item => item.LastSessionFinish > $twoMinutesAgo")

        .Orderby("UserAlias")

    commonCall(query, qs, cont)

 

}

 

Pros:

·         Purely functional business logic code (SQL), very easy to debug problems by just running the query/view

·         Maximum usage of SQL Server optimizer. As they say in the SQL Engine Team: “we’ve spent decades optimizing sql query engines. You are unlikely to do better with your ‘for’ loops …”

·         Very easy to customize the back-end, just write F# code to implement GET/POST/PUT/DELETE etc…

·         Moving all the state for the app (reference files included) in the DB, makes it much easier to integrate it into the business logic. It all becomes a big query for the optimizer to sort out.

·         No autogenerated code anywhere in the architecture

·         More separation between model and viewmodel. Tables are model, Views are ViewModel, F# is just a way to expose such view model to the world at large

Cons:

·       Routing algorithm for ASP.NET Web API is mysterious. It’s the only black box piece in the architecture (aka that I don’t understand it). 

·        Sometimes either SQL is not powerful enough (I abhor Stored procedures on religious grounds) or something doesn’t map well to a REST metaphor. In such cases I have diluted the metaphor as in the code below. The good thing is that, being just code, I can do that. I’m not blocked.

type statsController() =

    inherit ApiController()

 

    let db = dbSchema.GetDataContext()

 

    member x.Get(app:string, isQs:string, call:string) =

        match call with

        | "WeeklyUsers" ->

            let sessions = query {

                for s in db.VTotalSessionsAndWeeks do

                where (s.Application = app && s.Qs = Nullable (Int32.Parse(isQs)))

                select s

            }

            weeklyUsers sessions

        | "WeeklyTime"  ->

            let sessions = query {

                for s in db.FTotalTime(app, Nullable (Int32.Parse(isQs))) do

                select s

            }

            weeklyTime sessions

        | "HourlyUsers"->

            let users = query {

                for u in db.FHourlyUsers(app, Nullable (Int32.Parse(isQs))) do

                select u }

            hourlyUsers users

        | _ -> failwith "No call with that name"

Adventure in parserland – parsing lambda expressions in F# – Part V

We are now going to look at a solution which is concise, efficient and gives sophisticated error messages. It is also less than 20 lines of code. We’ll be using FParsec.

FParsec is a port of an Haskell library. It is a parser combinator library or, as I like to think of it, an internal DSL to build parsers in F#. My usual disclaimer: I’m not an expert in FParsec. It is likely that, if you are an expert, you can come up with more maintainable/efficient/elegant version of this parser.

The whole code is below:

let ws = " \t\n"
let specialChars = ".)(\\\n"

let pWs = spaces
let pName = manyChars (noneOf (ws + specialChars)) |>> EName

let pExpr, pExprRef = createParserForwardedToRef<Expression, Unit>()

let curry2 f a b = f(a,b)
let pFunction = pchar '\\' >>. pipe2 pName (pchar '.' >>. pExpr) (curry2 Function)

let pApplication = pchar '(' >>. pipe2 pExpr (pWs >>. pExpr) (curry2 Application)
                          .>> pWs .>> pchar ')'

do pExprRef := pFunction <|> pApplication <|> pName

let pExpressions = sepBy pExpr spaces1

This mirrors pretty closely the grammar we are trying to parse. A program is a bunch of expressions separated by whitespaces.

let pExpressions = sepBy pExpr spaces1

sepBy is a combinator that takes a parser that defines what to parse and a parser that defines what the separator should be. And the separator should be one or more spaces …

pExpr is either a function, an application or a name. the operator <|> is a choice combinator that tries each parser in order. It tries the right parser just if the left parser fails and it doesn’t consume input. So it doesn’t backtrack. If you need a backtracking parser, you’ll need to get acquainted with the attempt combinator.

do pExprRef := pFunction <|> pApplication <|> pName

A function starts with a ‘\’, then comes a name, a dot and an expression.

let pFunction = pchar '\\' >>. pipe2 pName (pchar '.' >>. pExpr)
(curry2 Function)

I know that might look crazy (and maybe it is), but just bear with me. Someone , who I’m not sure I can name, once told me that functional programming is great to write code, but terrible to read it and debug it. The phrase stayed with me as containing a grain of truth. In any case, in the code above:

  • >>. is a combinator that says “use the left parser, discard its value and then use the right one, returning its value”. Try to guess what .>> does …
  • pipe2 is a combinator that says “apply the first parser, the second parser and then call a function passing as parameters the values returned by the two parsers
  • curry2 is a function combinator that transform a function that takes a tuple to a function that takes the parameters as untupled
let curry2 f a b = f(a,b)

An application works similarly, but differently …

let pApplication = pchar '(' >>. pipe2 pExpr (pWs >>. pExpr) (curry2 Application)
                              .>> pWs .>> pchar ')'

The only difference is that now we have to consume the optional whitespaces and the ‘)’ at the end. A rule of thumb that I use is to use >>.  to flow the result through and pipeX when I need more than one result.

The last thing is pName, which consume chars until it finds either a whitespace or a special char.

let ws = " \t\n"
let specialChars = ".)(\\\n"

let pWs = spaces
let pName = manyChars (noneOf (ws + specialChars)) |>> EName

And there you have it, a lexer, a parser all in 20 lines of code. I don’t like the code that I wrote above much. I’m sure I could refine it plenty and it probably contains some bugs, but it gives an idea of what is possible with FParsec.

Adventure in parserland – parsing lambda expressions in F# – Part IV

Let’ now look at the parser. First let’s review the grammar:

    (*
        <expression> ::= <name> | <function> | <application> 
        <name> ::= non­blank character sequence 
        <function> ::= \ <name> . <body> 
        <body> ::= <expression> 
        <application> ::= ( <function expression> <argument expression> ) 
        <function expression> ::= <expression> 
        <argument expression> ::= <expression> 
    *)

And the data type to represent it:

type Name = string
and Body = Expression
and Function = Name * Expression
and FunctionExpression = Expression
and ArgumentExpression = Expression
and Expression = 
| EName of string
| Function of Expression * Body
| Application of FunctionExpression * ArgumentExpression
| EOT

In essence, the data type need to store all the information needed for subsequent stages of computation (i.e. beta reductions and such). The closer it is to the grammar, the better. In this case it looks pretty close.

Remember what is the main goal of our parser:

let parseTextReader: TextReader -> seq<Expression> =
                    textReaderToLazyList >> tokenStream >> parseExpressions

We have already looked at TextReaderToLazyList and tokenStream. Now it is the time to look at parseExpressions. It’s goal is to  parse the LazyList<Token> and return a sequence of expressions. The choice of returning a sequence at this point is to make the parseTextReader, which is the main function in the program, return a more ‘standard’ type.

and parseExpressions tokens = seq {
   let tokens = parseOptionalWs tokens
   let expr, tokens = parseExpr tokens
   let tokens = parseOptionalWs tokens
   match expr with
    | EOT   -> yield EOT
    | exp   -> yield exp; yield! parseExpressions tokens }

parseOtionalWs simply skips ahead whatever whitespaces it finds.

and parseOptionalWs tokens = match tokens with
                                | LazyList.Nil -> LazyList.empty
                                | LazyList.Cons(h, t) -> 
                                    match h with
                                       | Ws _ -> parseOptionalWs t
                                       | _ -> tokens

parseExpr is more interesting. It is the main switch that creates expression kinds.

let rec parseExpr tokens = match tokens with
                            | LazyList.Nil -> EOT, LazyList.empty
                            | LazyList.Cons(h, t) ->
                                match h with
                                    | EOF -> parseEOF tokens
                                    | Name _ -> parseName  tokens
                                    | Lambda -> parseFunction tokens
                                    | OpenParens -> parseApplication tokens
                                    | token -> errorAtStart "Expression" token

parseEOF is not.

and parseEOF tokens = EOT, LazyList.empty

parseName just returns a EName, unwrapping it from Name.

and parseName tokens = EName (head tokens |> unwrapName), tail tokens

Unwrap just unwraps it.

let unwrapName = function
    | Name(s) -> s
    | tok -> errorExpecting "a Name" <| writeToken tok

parseFunction just conumes a Lambda, a name, a Dot token, a body (i.e. \x.x)and assembles them in a Function:

and parseFunction tokens =
    let tokens = consumeToken Lambda tokens
    let name, tokens = parseName tokens
    let tokens = consumeToken Dot tokens
    let body, tokens = parseExpr tokens
    Function(name, body), tokens

consumeToken tries to consume a token generating an error if it doesn’t find it:

let consumeToken token = 
    genericConsumeToken (fun token' _ -> errorExpecting (writeToken token') (writeToken token)) token

genericConsumeToken is just a generalization of the function above:

let genericConsumeToken noMatch token = function
    | LazyList.Nil -> LazyList.empty
    | LazyList.Cons(h, t) as originalTokens ->
        match h with
        | tok when tok = token -> t
        | tok -> noMatch token originalTokens

The last thing left to consume is an application which is in this form (func args):

and parseApplication tokens =
    let tokens = consumeToken OpenParens tokens
    let funExpr, tokens = parseExpr tokens
    let tokens = parseOptionalWs tokens
    let argExpr, tokens = parseExpr tokens
    let tokens = consumeToken CloseParens tokens
    Application(funExpr, argExpr), tokens

Various error and utility functions are defined below:

let errorEOF expecting = failwith  ("Expected " + expecting + ", got EOF")
let errorExpecting expecting gotToken = failwith ("Expected " + expecting + ", got" + gotToken)
let errorAtStart expecting gotToken = failwith ("Expected " + expecting + " which cannot start with" + writeToken gotToken)

let tail = LazyList.tail
let head = LazyList.head

And that is the parser. All 100+ lines of it. As you can tell it is rather formulaic to go from a grammar to a lexer and a parser, which is why you shouldn’t do it, but instead let a tool generate the code for you given the grammar or use FParsec.

We have written 200+ code and I don’t think we can be too proud of our achievement. It is:

  • Certainly buggy
  • Primitive in error handling
  • Not tail recursive (big text is likely to blow up our stack)
  • Probably inefficient

So let’s look next at a better way to do it.

Adventure in parserland – parsing lambda expressions in F# – Part III

Let’s start from the lexer. Remember, I wrote this code based on my memory of how a lexer ought to look like. I didn’t read again the relevant chapters in the Dragon book. But I think it came out all right after all.

The tokenStream function we looked at last time takes a LazyList<char> and returns a LazyList<Token>. It uses the unfold method on LazyList to call matchToken on each char until the stream is empty.

let rec tokenStream chars = 
    LazyList.unfold
        (fun chList ->
            match chList with
            | LazyList.Nil -> None
            | chList ->  
                let token, chList' = matchToken chList
                Some(token, chList')
        )
        chars 

A token is what gets passed up to the parser to do syntactic analysis on. It is the vocabulary of our language. The lexer divide a phrase in words, the parser put together the words in a phrase. So, these are the words.

type Token =
    | Name of string
    | Dot
    | OpenParens
    | CloseParens
    | Lambda
    | Def
    | Ws of string
    | NewLine
    | EOF

Matching is a process whereas you try to return the token that you have read plus the list of characters yet to be read. Matching a Token is defined below:

let matchToken = function 
    | LazyList.Nil                 -> EOF, LazyList.empty 
    | LazyList.Cons(h, t) as chars ->
        match h with
        | ch when isWs ch -> matchWs chars
        | ch when isSpecialChar ch -> matchSpecialChar ch t
        | _ -> matchString chars

A token is either nothing, a whitespace, a special char or anything else.

Let’s look at what matching each one of them means.  Matching whitespaces means consuming them and remembering what was consumed.

let matchWs chars = 
    let value, remainingChars = matchSeriesOfChars isWs chars
    Ws value, remainingChars

matchSeriesOfChars takes a predicate and a LazyList of chars and returns the string composed of all the consecutive chars for which the predicate is true, plus, as always, the remaining chars to be matched. In this case the predicate returns true if the char is a whitespace.

To write matchSeriesOfChars I need a function that reverses a LazyList. Not having found such thing, I wrote it.

let reversell l =
    let rec go l' a = match l', a with
                        | LazyList.Nil, a -> a
                        | LazyList.Cons(h, t), a -> go t (LazyList.cons h a)
    go l LazyList.empty

Then I wrote matchSeriesOfChars. The function uses an accumulator. It adds to the front whenever the predicate is true, it reverses it and translates it to a string (I could have reversed the string instead, it might have been better).

let matchSeriesOfChars comparer chars =
    let rec go result = function
        | LazyList.Nil    -> charListToString(reversell result), LazyList.empty 
        | LazyList.Cons(h, t) -> if comparer h then go (LazyList.cons h result) t
                                 else charListToString (reversell result), LazyList.cons h t
    go LazyList.empty chars

These are  predicates we’ll use later on to recognize characters:

let isInString (ch: char) (s: string) = s.IndexOf(ch) <> -1
let isWs (chr: char) = isInString chr wsChars
let isNameChar (chr: char) = not (isInString chr (wsChars + specialChars))
let isSpecialChar ch = isInString ch specialChars

wsChar and specialChars are defined below:

    let wsChars = " \t"
    let charTokens =
        Map.ofList [
            '.' , Dot
            '(' , OpenParens
            ')' , CloseParens
            '\\', Lambda
            '\n', NewLine
         ]
let specialChars = charTokens |> Map.fold (fun s k v -> s + k.ToString()) ""

Getting back to the more important matching functions, matching a special character is defined as a simple lookup in the charToken map:

let matchSpecialChar ch chars = Map.find ch charTokens, chars

We are left with matchString, this simply matches the characters until it finds a char that cannot be part of a name. It then looks it up in a list of special strings. If it finds it, it returns it, otherwise it just returns the name.

let stringTokens =
    Map.ofList [
        "Def", Def    
    ]
let matchString chars =
    let value, remainingChars = matchSeriesOfChars isNameChar chars
    let specialString = Map.tryFind value stringTokens
    if specialString.IsSome
        then specialString.Value, remainingChars
        else Name(value), remainingChars

And we are done with the lexer, all of 100+ lines of it …