Metalua Manual

Previous Up Next

Chapter 2  Meta-programming libraries and extensions

2.1  gg, the grammar generator



gg is the grammar generator, the library with which Metalua parser is built. Knowing it allows you to easily write your own parsers, and to plug them into mlp, the existing Metalua source parser. It defines a couple of generators, which take parsers as parameters, and return a more complex parser as a result by combining them.

Notice that gg sources are thoroughly commented, and try to be readable as a secondary source of documentation.

There are four main classes in gg, which allow to generate:
  • sequences of keywords and subparsers;
  • keyword-driven sequence sets, i.e. parsers which select a sequence parser depending on an initial keyword;
  • lists, i.e. series of an undetermined number of elements of the same type, optionnaly separated by a keyword (typically ``,'');
  • expressions, built with infix, prefix and suffix operators around a primary expression element;

2.1.1  Sequences

A sequence parser combines sub-parsers together, calling them one after the other. A lot of these sub-parsers will simply read a keyword, and do nothing of it except making sure that it is indeed here: these can be specified by a simple string. For instance, the following declarations create parsers that read function declarations, thanks to some subparsers:
  • func_stat_name reads a function name (a list of identifiers separated by dots, plus optionally a semicolon and a method name);
  • func_params_content reads a possibly empty list of identifiers separated with commas, plus an optional ``...'';
  • mlp.block reads a block of statements (here the function body);
  • mlp.id reads an identifier.

-- Read a function definition statement
func_stat = gg.sequence{ "function", func_stat_name, "(",
                         func_params_content, ")", mlp.block, "end" }

-- Read a local function definition statement
func_stat = gg.sequence{ "local", "function", mlp.id, "(",
                         func_params_content, ")", mlp.block, "end" }

Constructor gg.sequence (config_table)

This function returns a sequence parser. config_table contains, in its array part, a sequence of string representing keyword parsers, and arbitrary sub-parsers. Moreover, the following fields are allowed in the hash part of the table:
  • name = <string>: the parser's name, used to generate error messages;
  • builder = <function>: if present, whenever the parser is called, the list of the sub-parsers results is passed to this function, and the function's return value is returned as the parser's result. If absent, the list of sub-parser results is simply returned. It can be updated at anytime with:
    x.builder = <newval>.
  • builder = <string>: the string is added as a tag to the list of sub-parser results. builder = "foobar" is equivalent to:
    builder = function(x) x.tag="foobar"; return x end
  • transformers = <function list>: applies all the functions of the list, in sequence, to the result of the parsing: these functions must be of type ASTAST. For instance, if the transformers list is {f1, f2, f3}, and the builder returns x, then the whole parser returns f3(f2(f1(x))).

Method :parse(lexstream)

Read a sequence from the lexstream. If the sequence can't be entirely read, an error occurs. The result is either the list of results of sub-parsers, or the result of builder if it is non-nil. In the func_stat example above, the result would be a list of 3 elements: the results of func_stat_name, func_params_content and mlp.block.

It can also be directly called as simply x(lexstream) instead of x:parse(lexstream).

Method .transformers:add(f)

Adds a function at the end of the transformers list.

2.1.2  Sequence sets

In many cases, several sequence parsers can be applied at a given point, and the choice of the right parser is determined by the next keyword in the lexstream. This is typically the case for Metalua's statement parser. All the sequence parsers must start with a keyword rather than a sub-parser, and that initial keyword must be different for each sequence parser in the sequence set parser. The sequence set parser takes care of selecting the appropriate sequence parser. Moreover, if the next token in the lexstream is not a keyword, or if it is a keyword but no sequence parser starts with it, the sequence set parser can have a default parser which is used as a fallback.

For instance, the declaration of mlp.stat the Metalua statement parser looks like:
mlp.stat = gg.multisequence{
  mlp.do_stat, mlp.while_stat, mlp.repeat_stat, mlp.if_stat... }

Constructor gg.multisequence (config_table)

This function returns a sequence set parser. The array part of config_table contains a list of parsers. It also accepts tables instead of sequence parsers: in this case, these tables are supposed to be config tables for gg.sequence constructor, and are converted into sequence parsers on-the-fly by calling gg.sequence on them. Each of these sequence parsers has to start with a keyword, distinct from all other initial keywords, e.g. it's illegal for two sequences in the same multisequence to start with keyword do; it's also illegal for any parser but the default one to start with a subparser, e.g. mlp.id.

It also accepts the following fields in the table's hash part:
  • default = <parser>: if no sequence can be chosen, because the next token is not a keyword, or no sequence parser in the set starts with that keyword, then the default parser is run instead; if no default parser is provided and no sequence parser can be chosen, an error is generated when parsing.
  • name = <string>: the parser's name, used to generate arror messages;
  • builder = <function>: if present, whenever the parser is called, the selected parser's result is passed to this function, and the function's return value is returned as the parser's result. If absent, the selected parser's result is simply returned. It can be updated at anytime with x.builder = <newval>.
  • builder = <string>: the string is added as a tag to the list of sub-parser results. builder = "foobar" is equivalent to:
    builder = function(x) return { tag = "foobar"; unpack (x) } end
  • transformers = <function list>: applies all the functions of the list, in sequence, to the result of the parsing: these functions must be of type ASTAST.

Method :parse(lexstream)

Read from the lexstream. The result returned is the result of the selected parser (one of the sequence parsers, or the default parser). That result is either returned directly, or passed through builder if this field is non-nil.

It can also be directly called as simply x(lexstream) instead of x:parse(lexstream).

Method :add(sequence_parser)

Take a sequence parser, or a config table that would be accepted by gg.sequence to build a sequence parser. Add that parser to the set of sequence parsers handled by x. Cause an error if the parser doesn't start with a keyword, or if that initial keyword is already reserved by a registered sequence parser, or if the parser is not a sequence parser.

Field .default

This field contains the default parser, and can be set to another parser at any time.

Method .transformers:add

Adds a function at the end of the transformers list.

Method :get(keyword)

Takes a keyword (as a string), and returns the sequence in the set starting with that keyword, or nil if there is no such sequence.

Method :del(keyword)

Removes the sequence parser starting with keyword kw.

2.1.3  List parser

Sequence parsers allow to chain several different sub-parser. Another common need is to read a series of identical elements into a list, but without knowing in advance how many of such elements will be found. This allows to read lists of arguments in a call, lists of parameters in a function definition, lists of statements in a block...

Another common feature of such lists is that elements of the list are separated by keywords, typically semicolons or commas. These are handled by the list parser generator.

A list parser needs a way to know when the list is finished. When elements are separated by keyword separators, this is easy to determine: the list stops when an element is not followed by a separator. But two cases remain unsolved:
  • When a list is allowed to be empty, no separator keyword allows the parser to realize that it is in front of an empty list: it would call the element parser, and that parser would fail;
  • when there are no separator keyword separator specified, they can't be used to determine the end of the list.
For these two cases, a list parser can specify a set of terminator keywords: in separator-less lists, the parser returns as soon as a terminator keyword is found where an element would otherwise have been read. In lists with separators, if terminators are specified, and such a terminator is found at the beginning of the list, then no element is parsed, and an empty list is returned. For instance, for argument lists, ``)'' would be specified as a terminator, so that empty argument lists ``()'' are handled properly.

Beware that separators are consumed from the lexstream stream, but terminators are not.

Constructor gg.list (config_table)

This function returns a list parser. config_table can contain the following fields:
  • primary = <parser> (mandatory): the parser used to read elemetns of the list;

  • separators = <list> : list of strings representing the keywords accepted as element separators. If only one separator is allowed, then the string can be passed outside the list:
    separators = "foo" is the same as separators = { "foo" }.

  • terminators = <list> : list of strings representing the keywords accepted as list terminators. If only one separator is allowed, then the string can be passed outside the list.

  • name = <string>: the parser's name, used to generate arror messages;

  • builder = <function>: if present, whenever the parser is called, the list of primary parser results is passed to this function, and the function's return value is returned as the parser's result. If absent, the list of sub-parser results is simply returned. It can be updated at anytime with x.builder = <newval>.

  • builder = <string>: the string is added as a tag to the list of sub-parser results. builder = "foobar" is equivalent to:
    builder = function(x) return { tag = "foobar"; unpack (x) } end

  • transformers = <function list>: applies all the functions of the list, in sequence, to the result of the parsing: these functions must be of type ASTAST.

  • if keyless element is found in config_table, and there is no primary key in the table, then it is expected to be a parser, and it is considered to be the primary parser.

Method :parse (lexstream)

Read a list from the lexstream. The result is either the list of elements as read by the primary parser, or the result of that list passed through builder if it is specified.

It can also be directly called as simply x(lexstream) instead of x:parse(lexstream).

Method .transformers:add

Adds a function at the end of the transformers list.

2.1.4  Method .separators:add

Adds a string to the list of separators.

2.1.5  Method .terminators:add

Adds a string to the list of terminators.

2.1.6  Expression parser

This is a very powerfull parser generator, but it ensues that its API is quite large. An expression parser relies on a primary parser, and the elements read by this parser can be:
  • combined pairwise by infix operators;
  • modified by prefix operators;
  • modified by suffix operators.
All operators are described in a way analoguous to sequence config tables: a sequence of keywords-as-strings and subparsers in a table, plus the usual builder and transformers fields. Each kind of operator has its own signature for builder functions, and some specific additional information such as precedence or associativity. As in multisequences, the choice among operators is determined by the initial keyword. Therefore, it is illegal to have two operator sequences which start with the same keyword (for instance, two infix sequences starting with keyword ``$'').

Most of the time, the sequences representing operators will have a single keyword, and no subparser. For instance, the addition is represented as:
{ "+", prec=60, assoc="left", builder= |a, _, b| `Op{ `Add, a, b } }

Infix operators
Infix operators are described by a table whose array-part works as for sequence parsers. Besides this array-part and the usual transformers list, the table accepts the following fields in its hash-part:
  • prec = <number> its precedence. The higher the precedence, the tighter the operator bind with respect to other operators. For instance, in Metalua, addition precedence is 60, whereas multiplication precedence is 70.

  • assoc = <string> is one of "left", "right", "flat" or "none", and specifies how an operator associates. If not specified, the default associativity is "left".

    Left and right describe how to read sequences of operators with the same precedence, e.g. addition is left associative (1+2+3 reads as (1+2)+3), whereas exponentiation is right-associative (1^2^3 reads as 1^(2^3)).

    If an operator is non-associative and an ambiguity is found, a parsing error occurs.

    Finally, flat operators get series of them collected in a list, which is passed to the corresponding builder as a single parameter. For instance, if ++ is declared as flat and its builder is f, then whenever 1++2++3++4 is parsed, the result returned is f{1, 2, 3, 4}.

  • builder = <function> the usual result transformer. The function takes as parameters the left operand, the result of the sequence parser (i.e. { } if the sequence contains no subparser), and the right operand; it must return the resulting AST.
Prefix operators
These operators are placed before the sub-expression they modify. They have the same properties as infix operators, except that they don't have an assoc field, and builder takes |operator, operand| instead of |left_operand, operator, right_operand|.

Suffix operators
Same as prefix operators, except that builder takes |operand, operator| instead of |operator, operand|.

Constructor gg.expr (config_table)

This function returns an expression parser. config_table is a table of fields which describes the kind of expression to be read by the parser. The following fields can appear in the table:
  • primary (mandatory): the primary parser, which reads the primary elements linked by operators. In Metalua expression parsers, that would be numbers, strings, identifiers...It is often a multisequence parser, although that's not mandatory.
  • prefix: a list of tables representing prefix operator sequences, as described above. It supports a default parser: this parser is considered to have succesfully parsed a prefix operator if it returns a non-false result.
  • infix: a list of tables representing infix operator sequences, as described above. Supports a default parser.
  • suffix: a list of tables representing suffix operator sequences, as described above. Supports a default parser.

Methods .prefix:add(), .infix:add(), .suffix:add()

Add an operator in the relevant table. The argument must be an operator sequence table, as described above.

Method :add()

This is just a shortcut for primary.add. Unspecified behavior if primary doesn't support method add.

Method :parse (lexstream)

Read a list from the lexstream. The result is built by builder1 calls.

It can also be directly called as simply x(lexstream) instead of x:parse(lexstream).

Method :tostring()

Returns a string representing the parser. Mainly useful for error message generation.

2.1.7  onkeyword parser

Takes a list of keywords and a parser: if the next token is one of the keywords in the list, runs the parser; if not, simply returns false.

Notice that by default, the keyword is consumed by the onkeyword parser. If you want it not to be consumed, but instead passed to the internal parser, add a peek=true entry in the config table.

Constructor gg.onkeyword (config_table)

Create a keyword-conditionnal parser. config_table can contain:
  • strings, representing the triggerring keywords (at least one);
  • the parser to run if one of the keywords is found (exactly one);
  • peek=<boolean> to indicate whether recognized keywords must be consumed or passed to the inner parser.
The order of elements in the list is not relevant.

Method :parse (lexstream)

Run the parser. The result is the internal parser's result, or false if the next token in the lexstream wasn't one of the specified keywords.

It can also be directly called as simply x(lexstream) instead of x:parse(lexstream).

2.1.8  optkeyword parser

Watch for optional keywords: an optkeyword parser has a list of keyword strings as a configuration. If such a keyword is found as the nex lexstream element upon parsing, the keyword is consumed and that string is returned. If not, false is returned.

Constructor gg.optkeyword (keyword1, keyword2, ...)

Return a gg.optkeyword parser, which accepts all of the keywords given as parameters, and returns either the found keyword, or false if none is found.

2.2  mlp, the metalua parser

Metalua parser is built on top of gg, and cannot be understood without some knowledge of it. Basically, gg allows not only to build parsers, but to build extensible parsers. Depending on a parser's type (sequence, sequence set, list, expression...), different extension methods are available, which are documented in gg reference. The current section will give the information needed to extend Metalua syntax:
  • what mlp entries are accessible for extension;
  • what do they parse;
  • what is the underlying parser type (and therefore, what extension methods are supported)

2.2.1  Parsing expressions

name type description
mlp.expr gg.expr Top-level expression parser, and the main extension point for Metalua expression. Supports all of the methods defined by gg.expr.
mlp.func_val gg.sequence Read a function definition, from the arguments' openning parenthesis to the final end, but excluding the initial function keyword, so that it can be used both for anonymous functions, for function some_name(...) end and for local function some_name(...) end.
mlp.expr_list    
mlp.table_content gg.list Read the content of a table, excluding the surrounding braces
mlp.table gg.sequence Read a litteral table, including the surrounding braces
mlp.table_field custom function Read a table entry: [foo]=bar, foo=bar or bar.
mlp.opt_id custom function Try to read an identifier, or an identifier splice. On failure, returns false.
mlp.id custom function Read an identifier, or an identifier splice. Cause an error if there is no identifier.

2.2.2  Parsing statements

name type description
mlp.block gg.list Read a sequence of statements, optionally separated by semicolons. When introducing syntax extensions, it's often necessary to add block terminators with mlp.block.terminators:add().
mlp.for_header custom function Read a for header, from just after the ``for'' to just before the ``do''.
mlp.stat gg.multisequence Read a single statement.
Actually, mlp.stat is an extended version of a multisequence: it supports easy addition of new assignment operator. It has a field assignments, whose keys are assignment keywords, and values are assignment builders taking left-hand-side and right-hand-side as parameters. for instance, C's ``+='' operator could be added as:
mlp.lexer:add "+="
mlp.stat.assignments["+="] = function (lhs, rhs)
  assert(#lhs==1 and #rhs==1)
  local a, b = lhs[1], rhs[1]
  return +{stat: (-{a}) = -{a} + -{b} }
end 

2.2.3  Other useful functions and variables

  • mlp.gensym() generates a unique identifier. The uniqueness is guaranteed, therefore this identifier cannot capture another variable; it is useful to write hygienic1 macros.

2.3  Extension match: structural pattern matching

Pattern matching is an extremely pleasant and powerful way to handle tree-like structures, such as ASTs. Unsurprisingly, it's a feature found in most ML-inspired languages, which excel at compilation-related tasks. There is a pattern matching extension for metalua, which is extremely useful for most meta-programming purposes.

2.3.1  Purpose

First, to clear a common misconception: structural pattern matching has absolutely nothing to do with regular expresssions matching on strings: it works on arbitrary structured data. When manipulating trees, you want to check whether they have a certain structure (e.g. a `Local node with as first child a list of variables whose tags are `Id ); when you've found a data that matches a certain pattern, you want to name the interesting sub-parts of it, so that you can manipulate them easily; finally, most of the time, you have a series of different possible patterns, and you want to apply the one that matches a given data. These are the needs addressed by pattern matching: it lets you give a list of (pattern -> code_to_execute_if_match) associations, selects the first matching pattern, and executes the corresponding code. Patterns both describe the expected structures and bind local variables to interesting parts of the data. Those variables' scope is obviously the code to execute upon matching success.
Match statement
A match statement has the form:
match <some_value> with
| <pattern_1> -> <block_1>
| <pattern_2> -> <block_2>
...
| <pattern_n> -> <block_n>
end
The first vertical bar after the "with" is optional; moreover, a pattern can actually be a list of patterns, separated by bars. In this case, it's enough for one of them to match, to get the block to be executed:
match <some_value> with
| <pattern_1> | <pattern_1_bis > -> <block_1>
...
end
When the match statement is executed, the first pattern which matches <some_value> is selected, the corresponding block is executed, and all other patterns and blocks are ignored. If no pattern matches, an error "mismatch" is raised. However, we'll see it's easy to add a catch-all pattern at the end of the match, when we want it to be failproof.

2.3.2  Patterns definition

Atomic litterals
Syntactically, a pattern is mostly identical to the values it matches: numbers, booleans and strings, when used as patterns, match identical values.
match x with
| 1 -> print 'x is one'
| 2 -> print 'x is two'
end
Tables
Tables as patterns match tables with the same number of array-part elements, if each pattern field matches the corresponding value field. For instance, {1, 2, 3} as a pattern matches {1, 2, 3} as a value. Pattern {1, 2, 3} matches value {1, 2, 3, foo=4}, but pattern {1, 2, 3, foo=4} doesn't match value {1, 2, 3}: there can be extra hash-part fields in the value, not in the pattern. Notice that field 'tag' is a regular hash-part field, therefore {1, 2, 3} matches `Foo{1, 2, 3} (but not the other way around). Of course, table patterns can be nested. The table keys must currently be integers or strings. It's not difficult to add more, but the need hasn't yet emerged. If you want to match tables of arbitrary array-part size, you can add a "..." as the pattern's final element. For instance, pattern {1, 2, ...} will match all table with at least two array-part elements whose two first elements are 1 and 2.
Identifiers
The other fundamental kind of patterns are identifiers: they match everything, and bind whatever they match to themselves. For instance, pattern 1, 2, x will match value 1, 2, 3, and in the corresponding block, local variable x will be set to 3. By mixing tables and identifiers, we can already do interesting things, such as getting the identifiers list out of a local statement, as mentionned above:
match stat with
| `Local{ identifiers, values } ->
   table.foreach(identifiers, |x| print(x[1])
... -- other cases
end
When a variable appears several times in a single pattern, all the elements they match must be equal, in the sense of the "==" operator. Fore instance, pattern { x, x } will match value { 1, 1 }, but not { 1, 2 }. Both values would be matched by pattern { x, y }, though. A special identifier is "_", which doesn't bind its content. Even if it appears more than once in the pattern, metched value parts aren't required to be equal. The pattern "_" is therefore the simplest catch-all one, and a match statement with a "| _ ->" final statement will never throw a "mismatch" error.
Guards
Some tests can't be performed by pattern matching. For these cases, the pattern can be followed by an "if" keyword, followed by a condition.
match x with
| n if n%2 == 0 -> print 'odd'
| _ -> print 'even'
end
Notice that the identifiers bound by the pattern are available in the guard condition. Moreover, the guard can apply to several patterns:
match x with
| n | {n} if n%2 == 0 -> print 'odd'
| _ -> print 'even'
end
Multi-match
If you want to match several values, let's say 'a' and 'b', there's an easy way:
match {a,b} with
| {pattern_for_a, pattern_for_b} -> block
...
end
However, it introduces quite a lot of useless tables and checks. Since this kind of multiple matches are fairly common, they are supported natively:
match a, b with
| pattern_for_a, pattern_for_b -> block
...
end
This will save some useless tests and computation, and the compiler will complain if the number of patterns doesn't match the number of values.
String regular expressions
There is a way to use Lua's regular exressions with match, through the division operator ``/'': the left operand is expected to be a literal string, interpreted as a regular expression. The variables it captures are stored in a table, which is matched as a value against the right-hand-side operand. For instance, the following case succeeds when foo is a string composed of 3 words separated by spaces. In case of success, these words are bound to variables w1, w2 and w3 in the executed block:
match foo with
| "^(%w+) +(%w+) +(%w+)$"/{ w1, w2, w3 } -> 
   do_stuff (w1, w2, w3)
end

2.3.3  Examples

There are quite a lot of samples using match in the metalua distribution, and more will come in the next one. Dig in the samples for fairly simple usages, or in the standard libs for more advanced ones. Look for instance at examples provided with the walk library.

2.4  walk, the code walker

When you write advanced macros, or when you're analyzing some code to check for some property, you often need to design a function that walks through an arbitrary piece of code, and does complex stuff on it. Such a function is called a code walker. Code walkers can be used for some punctual adjustments, e.g. changing a function's return statements into something else, or checking that a loop performs no break, up to pretty advanced transformations, such as CPS transformation (a way to encode full continuations into a language taht doesn't support them natively; see Paul Graham's On Lisp for an accessible description of how it works), lazy semantics... Anyway, code walkers are tricky to write, can involve a lot of boilerplate code, and are generally brittle. To ease things as much as possible, Metalua comes with a walk library, which intends to accelerate code walker implementation. Since code walking is intrinsically tricky, the lib won't magically make it trivial, but at least it will save you a lot of time and code, when compared to writing all walkers from scratch. Moreover, other people who took the time to learn the walker generator's API will enter into your code much faster.

2.4.1  Principles

Code walking is about traversing a tree, first from root to leaves, then from leaves back to the root. This tree is not uniform: some nodes are expressions, some others statements, some others blocks; and each of these node kinds is subdivided in several sub-cases (addition, numeric for loop...). The basic code walker just goes from root to leaves and back to root without doing anything. Then it's up to you to plug some action callbacks in that walker, so that it does interesting things for you. Without entering into the details of AST structure, here is a simplified version of the walking algorithm, pretending to work on a generic tree:
function traverse(node)
   local down_result = down(node)
   if down_result ~= 'break' then 
      for c in children(node) do
         traverse(node)
      end
   end
   up(node)
end
The algorithm behind 'walk' is almost as simple as the one above, except that it's specialized for AST trees. You can essentially specialize it by providing your own up() and down() functions. These visitor functions perform whatever action you want on the AST; moreover, down() has the option to return 'break': in that case, the sub-nodes are not traversed. It allows the user to shun parts of the tree, or to provide his own special traversal method in the down() visitor. The real walker generator is only marginally more complex than that:
  • It lets you define your up() and down(), and down() can return 'break' to cut the tree traversal; however, these up() and down() functions are specialized by node kind: there can be separate callbacks for expr nodes, stat nodes, block nodes.
  • There's also a binder() visitor for identifier binders. Binders are variables which declare a new local variable; you'll find them in nodes `Local, `Localrec, `Forin, `Fornum, `Function. The binders are visited just before the variable's scope begins, i.e. before traversing a loop or a function's body, after traversing a `Local's right-hand side, before traversing a `Localrec's right-hand side.
    Notice that separate up() and down() visitors wouldn't make sense for binders, since they're leave nodes.
  • Visitor functions don't only take the visited node as parameter: they also take the list of all expr, stat and block nodes above it, up to the AST's root. This allows a child node to act on its parent nodes.

2.4.2  API

There are 3 main tree walkers: walk.expr(), walk.stat() and walk.block(), to walk through the corresponding kinds of ASTs. Each of these walker take as parameters a table cfg containing the various visitor functions, and the AST to walk throuhg. the configuration table cfg can contain fields:
  • cfg.stat.down(node, parent, grandparent...), which applies when traversing a statement down, i.e. before its children nodes are parsed, and can modify the tree, and return nil or 'break'. The way children are traversed is decided after the down() visitor has been run: this point matters when the visitor modifies its children nodes.
  • cfg.stat.up(node, parent, grandparent...), which is applies on the way back up. It is applied even if cfg.stat.down() returned 'break', but in that case, the children have not been (and will not be) traversed.
  • cfg.expr.down() and cfg.expr.up(), which work just as their stat equivalent, but apply to expression nodes.
    Notice that in Lua, function calls and method invocations can be used as statements as well as as espressions: in such cases, they are visited only by the statement visitor, not by the expression visitor.
  • cfg.block.down() and cfg.block.up() do the same for statements blocks: loops, conditional and function bodies.
  • cfg.binder(identifier, id_parent, id_grandparent...): this is run on identifiers which create a new local variable, jsut before that variable's scope begins.
Moreover, there is a walk.guess(cfg, ast) walker which tries to guess the type of the AST it receives, and applies the appropriate walker. When an AST can be either an expression or a statement (nodes `Call and `Invoke), it is interpreted as an expression.

2.4.3  Examples

A bit of practice now. Let's build the simplest walker possible, that does nothing:
cfg = { }
walker = |ast| walk.block(cfg, ast)
Now, let's say we want to catch and remove all statement calls to function assert(). This can be done by removing its tag and content: an empty list is simply ignored in an AST. So we're only interested by `Call nodes, and within these nodes, we want the function to be `Id 'assert'. All of this is only relevant to stat nodes:
function cfg.stat.down (x)
   match x with
   | `Call{ `Id 'assert', ... } -> x.tag=nil; x <- { }
   | _ -> -- not interested by this node, do nothing
   end
end
You'll almost always want to use the 'match' extension to implement visitors. The imperative table overrider (x <- y a.k.a. table.override(x, y) also often comes handy to modify an AST. We'll now remove assert() calls in non-statement; we cannot replace an expression by nothing, so we'll replace these nodes by these will simply be replaced by nil:
function cfg.expr.down (x)
   match x with
   | `Call{ `Id 'assert', ... } -> x <- `Nil
   | _ -> -- not interested by this node, do nothing
   end
end
Here's a remark for functional programmers: this API is very imperative; you might cringe at seeing the `Call nodes transformed in-place. Well, I tend to agree but it's generally counter-productive to work against the grain of the wood: Lua is imperative at heart, and design attempts at doing this functionally sucked more than approaches that embraced imperativeness.
Cuts
By making down() return 'break', you can prevent the traversal to go further down. This might be either because you're not interested by the subtrees, or because you want to traverse them in a special way. In that later case, just do the traversal by yourself in the down() function, and cut the walking by returning 'break', so that nodes aren't re-traversed by the default walking algorithm. We'll see that in the next, more complex example, listing of free variables. This example is exclusively there for demonstration purposes. For actual work on identifiers that require awareness of an identifier's binder of freedom, there is a dedicated walk.id library. We'll progressively build a walker that gathers all global variables used in a given AST. This involves keeping, at all times, a set of the identifiers currently bound by a "local" declaration, by function parameters, as for loop variables etc. Then, every time an identifier is found in the AST, its presence is checked in the current set of bound variables. If it isn't in it, then it's a free (global) identifier. The first thing we'll need is a scope handling system: something that keeps track of what identifiers are currently in scope. It must also allow to save the current scope (e.g. before we enter a new block) and restore it afterwards (e.g. after we leave the block). This is quite straightforward and unrelated to code walking; here is the code:

require 'std'
require 'walk'

-{ extension 'match' }

--------------------------------------------------------------------------------
-- Scope handling: ':push()' saves the current scope, ':pop()'
-- restores the previously saved one. ':add(identifiers_list)' adds
-- identifiers to the current scope. Current scope is stored in
-- '.current', as a string->boolean hashtable.
--------------------------------------------------------------------------------

local scope = { }
scope.__index = scope

function scope:new()
   local ret = { current = { } }
   ret.stack = { ret.current }
   setmetatable (ret, self)
   return ret
end

function scope:push()
   table.insert (self.stack, table.shallow_copy (self.current))
end

function scope:pop()
   self.current = table.remove (self.stack)
end

function scope:add (vars)
   for id in values (vars) do
      match id with `Id{ x } -> self.current[x] = true end
   end
end
(There is an improved version of that class in library walk.scope; cf. its documentation for details). Now let's start designing the walker. We'll keep a scope object up to date, as well as a set of found free identifiers, gathered every time we find an `Id node. To slightly simplify matter, we'll consider that the AST represent a block.

local function fv (term)
   local freevars = { }
   local scope    = scope:new()
   local cfg      = { expr  = { } }

   function cfg.expr.down(x)
      match x with
      | `Id{ name } -> if not scope.current[name] then freevars[name] = true end
      | _ -> -- pass
      end
   end

   walk.guess(cfg, term)
   return freevars
end
Since we don't ever add any identifier to the scope, this will just list all the identifiers used in the AST, bound or not. Now let's start doing more interesting things:
  • We'll save the scope's state before entering a new block, and restore it when we leave it. That will be done by providing functions cfg.block.down() and cfg.block.up(). Saving and restoring will be performed by methods :push() and :pop().
  • Whenever we find a local declaration, we'll add the list of identifiers to the current scope, so that they won't be gathered when parsing the `Id expression nodes. Notice that the identifiers declared by the 'local' statement only start being valid after the statement, so we'll add them in the cfg.stat.up() function rather than cfg.stat.down().

local cfg = { expr  = { },
              stat  = { },
              block = { } }
   [...]
   function cfg.stat.up(x)
      match x with
      | `Local{ vars, ... } -> scope:add(vars)
      | _ -> -- pass
      end
   end

   -----------------------------------------------------------------------------
   -- Create a separate scope for each block, close it when leaving.
   -----------------------------------------------------------------------------
   function cfg.block.down() scope:push() end
   function cfg.block.up()   scope:pop()  end  
This starts to be useful. We can also easily add the case for `Localrec nodes (the ones generated by "local function foo() ... end"), where the variable is already bound in the `Localrec statement's right-hand side; so we do the same as for `Local, but we do it in the down() function rather than in the up() one. We'll also take care of `Function, `Forin and `Fornum nodes, which introduce new bound identifiers as function parameters or loop variables. This is quite straightforward; the only thing to take care of is to save the scope before entering the function/loop body (in down()), and restore it when leaving (in up()). The code becomes:

local function fv (term)
   local freevars = { }
   local scope    = scope:new()
   local cfg      = { expr  = { },
                      stat  = { },
                      block = { } }

   -----------------------------------------------------------------------------
   -- Check identifiers; add functions parameters to newly created scope.
   -----------------------------------------------------------------------------
   function cfg.expr.down(x)
      match x with
      | `Id{ name } -> if not scope.current[name] then freevars[name] = true end
      | `Function{ params, _ } -> scope:push(); scope:add (params)
      | _ -> -- pass
      end
   end

   -----------------------------------------------------------------------------
   -- Close the function scope opened by 'down()'.
   -----------------------------------------------------------------------------
   function cfg.expr.up(x)  
      match x with
      | `Function{ ... } -> scope:pop()
      | _ -> -- pass
      end
   end

   -----------------------------------------------------------------------------
   -- Create a new scope and register loop variable[s] in it
   -----------------------------------------------------------------------------
   function cfg.stat.down(x)
      match x with
      | `Forin{ vars, ... }    -> scope:push(); scope:add(vars)
      | `Fornum{ var, ... }    -> scope:push(); scope:add{var}
      | `Localrec{ vars, ... } -> scope:add(vars)
      | `Local{ ... }          -> -- pass
      | _ -> -- pass
      end
   end

   -----------------------------------------------------------------------------
   -- Close the scopes opened by 'up()'
   -----------------------------------------------------------------------------
   function cfg.stat.up(x)
      match x with
      | `Forin{ ... } | `Fornum{ ... } -> scope:pop()
      | `Local{ vars, ... }            -> scope:add(vars)
      | _ -> -- pass
      end
   end

   -----------------------------------------------------------------------------
   -- Create a separate scope for each block, close it when leaving.
   -----------------------------------------------------------------------------
   function cfg.block.down() scope:push() end
   function cfg.block.up()   scope:pop()  end

   walk.guess(cfg, term)
   return freevars
end
This is almost correct now. There's one last tricky point of Lua's semantics that we need to address: in repeat foo until bar loops, "bar" is included in "foo"'s scope. For instance, if we write repeat local x=foo() until x>3, the "x" in the condition is the local variable "x" declared inside the body. This violates our way of handling block scopes: the scope must be kept alive after the block is finished. We'll fix this by providing a custom walking for the block inside `Repeat, and preventing the normal walking to happen by returning 'break':

   -----------------------------------------------------------------------------
   -- Create a new scope and register loop variable[s] in it
   -----------------------------------------------------------------------------
   function cfg.stat.down(x)
      match x with
      | `Forin{ vars, ... }    -> scope:push(); scope:add(vars)
      | `Fornum{ var, ... }    -> scope:push(); scope:add{var}
      | `Localrec{ vars, ... } -> scope:add(vars)
      | `Local{ ... }          -> -- pass
      | `Repeat{ block, cond } -> -- 'cond' is in the scope of 'block'
         scope:push()
         for s in values (block) do walk.stat(cfg)(s) end -- no new scope
         walk.expr(cfg)(cond)
         scope:pop()
         return 'break' -- No automatic walking of subparts 'cond' and 'body'
      | _ -> -- pass
      end
   end
That's it, we've now got a full free variables lister, and have demonstrated most APIs offered by the basic 'walk' library. If you want to walk through identifiers in a scope-aware way, though, you'll want to look at the walk.id library.

2.4.4  Library walk.id, the scope-aware walker

This library walks AST to gather information about the identifiers in it. It call distinct visitor functions depending on whether an identifier is bound or free; moreover, when an identifier is bound, the visitor also receives its binder node as a parameter. For instance, in +{function(x) print(x) end}, the bound identifier walker will be called on the +{x} in the print call, and the visitor's second parameter will be the `Function node which created the local variable x.
API
The library is loaded with require 'walk.id'. The walkers provided are:
  • walk_id.expr();
  • walk_id.stat();
  • walk_id.block();
  • walk_id.guess().
They take the same config tables as regular walkers, except that they also recognize the following entries:
  • cfg.id.free(identifier, parent, grandparent...), which is run on free variables;
  • cfg.id.bound(identifier, binder, parent, grandparent...), which is run on bound variables. The statement or expression which created this bound veriable's scope is passed as a second parameter, before the parent nodes.
Examples
Let's rewrite the free variables walker above, with the id walker:

function fv (term)
   local cfg = { id = { } }
   local freevars = { }
   function cfg.id.free(id)
      local id_name = id[1]
      freevars[id_name] = true
   end
   walk_id.guess (cfg, term)
   return freevars
end
Now, let's a-rename all bound variables in a term. This is slightly trickier than one could think: we need to first walk the whole tree, then perform all the replacement. If we renamed binders as we went, they would stop binding their variables, and something messy would happen. For instance, if we took +{function(x) print(x) end} and renamed the binder x into foo, we'd then start walking the function body on the tree +{function(foo) print(x) end}, where x isn't bound anymore.

--------------------------------------------------------------------------------
-- bound_vars keeps a binder node -> old_name -> new_name dictionary.
-- It allows to associate all instances of an identifier together,
-- with the binder that created them
--------------------------------------------------------------------------------
local bound_vars    = { }

--------------------------------------------------------------------------------
-- local_renames will keep all identifier nodes to rename as keys,
-- and their new name as values. The renaming must happen after 
-- the whole tree has been visited, in order to avoid breaking captures.
--------------------------------------------------------------------------------
local local_renames = { }

--------------------------------------------------------------------------------
-- Associate a new name in bound_vars when a local variable is created.
--------------------------------------------------------------------------------
function cfg.binder (id, binder)
   local old_name         = id[1]
   local binder_table     = bound_vars[binder]
   if not binder_table then
      -- Create a new entry for this binder:
      binder_table        = { }
      bound_vars[binder]  = binder_table
   end
   local new_name         = mlp.gensym(old_name)[1] -- generate a new name
   binder_table[old_name] = new_name -- remember name for next instances
   local_renames[id]      = new_name -- add to the rename todo-list
end

--------------------------------------------------------------------------------
-- Add a bound variable the the rename todo-list
--------------------------------------------------------------------------------
function cfg.id.bound (id, binder)
   local old_name    = id[1]
   local new_name    = bound_vars[binder][old_name]
   local_renames[id] = new_name
end

-- walk the tree and fill laocal_renames:
walk_id.guess(cfg, ast)

-- perform the renaming actions:
for id, new_name in pairs(local_renames) do id[1] = new_name end

2.4.5  Library walk.scope, the scope helper

This library allows to easily store data, in an AST walking function, in a scope aware way. Cf. comments in the sources for details.

2.5  dollar: generic syntax for macros

When you write a short-lived macro which takes reasonably short arguments, you generally don't want to write a supporting syntax for it. The dollar library allows you to call it in a generic way: if you store your macro in the table mlp.macros, say as function mlp.macros.foobar, then you can call it in your code as $foobar(arg1, arg2...): it will receive as parameters the ASTs of the pseudo-call's arguments.
Example
 
-{ block:
   require 'dollar'
   function doller.LOG(id)
      match id with `Id{ id_name } -> 
         return +{ printf("%s = %s", id_name, 
                          table.tostring(-{id})) }
      end
   end }

local x = { 1, 2, 3 }
$LOG(x) -- prints "x = { 1, 2, 3 }" when executed



1
Hygienic macros are macros which take care not to use names that might interfere with user-provided names. The typical non-hygienic macro in C is #define SWAP( a, b) { int c=a; a=b; b=c; }: this macro will misearbly fail if you ever call it with a parameter named c. There are well-known techniques to automatically make a macro hygienic. Without them, you'd have to generate a unique name for the temporary variable, if you had a gensym() operator in C's preprocessor

Previous Up Next