| DSL-WIP Home |

WORK-IN-PROGRESS: - this material is still under development

Syntax Directed Translation

Translate source text by defining a grammar, and using that grammar to structure translation.

Computer languages naturally tend to follow a hierarchic structure with multiple levels of context. We can define the legal syntax of such a language by writing a grammar that describes how elements of a language get broken down into sub-elements.

Syntax Directed Translation uses this grammar to define the creation of a parser that can turn input text into a parse tree that mimics the definition in the grammar.

How it Works

If you've read any book on programming languages, you'll have come across the notion of a grammar. A grammar is a way of defining the legal syntax of a programming language. Consider the part of my opening state machine example that declares events and commands.

events
  doorClosed  D1CL
  drawOpened  D2OP
#  ...
end

commands
  unlockPanel PNUL
  lockPanel   PNLK
#  ...
end

These declarations have a syntactic form that can be defined using the following grammar.

declarations : eventList commandList;
eventBlock    : Event-keyword eventDec* End-keyword;
eventDec     : Identifier Identifier;
commandBlock  : Command-keyword commandDec* End-keyword;
commandDec   : Identifier Identifier; 

A grammar like this provides a human-readable definition of a language. This makes it easier for people to understand what is legal syntax in the language. With Syntax Directed Translation we can take the grammar further and use it as the basis for how we design a program to process this language.

This processing can be derived from the grammar in a couple of ways. One approach is to use the grammar as a specification and implementation guide for a hand-written parser. If you build a Recursive Descent Parser, the design of the parser flows naturally from the grammar definition.

Since the link is so strong between the grammar and the parser implementation, the next step is to treat the grammar as a DSL and generate a parser directly from the grammar description. This is the approach that parser-generator tools take. In this case, you don't write any of the core parser code yourself, all of that is generated from the grammar.

Useful though the grammar is, it only handles part of the problem. It can tell you how to turn the input text into a parse-tree data structure. Almost always you'll need to do more with the input than that - so parser generators also provide ways to embed further behavior in the parser, so you can do something like populate a Semantic Model. So although the parser-generator does a lot of work for you, you still have to do a fair bit of programming to create something truly useful. In this way, as in many others, parser generators are an excellent example of the practical use of DSLs - one with a long history.

For the rest of this discussion, I'm going to assume you're using a parser-generator. There are times when writing your own parser by hand is useful, but I'll leave the hows and whys of that to Recursive Descent Parser. There are many parser generators out there, with many differences on how they work. In this discussion I'll try to concentrate on core principles, but it's important to be aware that there are a lot of different approaches to building a parser generator.

In particular I'm going to follow the traditional approach of parser generators, where the DSL is an external DSL and the parser generator generates code to create the parser. A particularly interesting alternative to this approach is the use of Parser Combinators - which is essentially expressing the grammar as an internal DSL. Parser Combinators are popular in the functional programming community and begining to make their way into other communities. [TBD: Consider adding more on parser combinators.]

The Lexer

Almost always when you use parser generators you'll see a separation between a lexer and the parser. A lexer, also called a tokenizer or scanner, is the first stage in processing the input text. The lexer splits the characters of the input into tokens, which represent more reasonable chunks of the input.

The tokens generally are defined using regular expressions; here's what such lexing rules might look like for the commands and events example above.

    event-keyword: 'events';
    command-keyword: 'commands';
    end-keyword: 'end';
    identifier: [a-z A-Z 0-9]*;
  

Here's a very small bit of input.

  events
    doorClosed  D1CL
    drawOpened  D2OP
  end

The lexer rules would turn this input into a series of tokens

[Event-keyword: "events"] [Identifier: "doorClosed"] [Identifier: "D1CL"] [Identifier: "drawOpened"] [Identifier:"D2OP"] [End-keyword: "end"]

Each token is an object with essentially two properties: type and payload. The type is the kind of token we have, eg event-keyword, or identifier. The payload is the text that was matched as part of the lexer: "events" or "doorClosed". For keywords the payload is pretty much irrelevant, all that matters is the type. For identifiers the payload does matter, as that's data that will be important later on in the parse.

Lexing is separated out for a few reasons. One is that is makes the parser simpler, because the parser can now be written in terms of tokens rather than raw characters. Another is efficiency: the implementation needed to chunk up characters into tokens is different to that used by the parser. (In automata theory the lexer is a state machine while the parser is a push-down stack machine.) As a result, most parser generators make this split.

The lexer rules are tested in order with the first match succeeding. Thus you can't use the string "events" as an identifier because the lexer will always recognize it as a keyword. This is generally considered a Good Thing to reduce confusion (avoiding such things as PL/1's notorious if if = then then then = if;).

If you've been suitably careful in comparing the tokens to the input text, you'll notice something is missing from the token list. Nothing is missing - the nothing here being white-space: spaces, tabs, and newlines. For many languages the lexer will strip out whitespace so that the parser doesn't have to deal it. This is a big difference to Delimiter Directed Translation where the white-space usually plays a key structuring role.

If the white-space is syntactically significant - such as new-lines as statement separators or indentation signifying block structure - then the lexer can't just ignore it. Instead it must generate some form of token to indicate what's happening - such as a newline token. Often, however, languages that are intended to be processed with Syntax Directed Translation try to make whitespace be ignorable. Indeed many DSLs can do without any form of statement separator - the state machine language can safely discard all white-space in the lexer.

Another thing that the lexer can discard is comments. It's always useful to have comments in even the smallest DSL, the lexer can easily get rid of these.

I've said that tokens have properties for type and payload. In practice they may carry more. Often this information is useful for error diagnostics such as line number and character position.

When deciding on tokens, there is often a tension to fine tune the matching process. In the state controller example, I've said that the event codes are a four character sequence of capital letters and numbers. So I might consider using a specific type of token for this, something like.

    code: [A-Z 0-9]{4}
  

The problem with this is that the tokenizer would result in the wrong tokens for this case

    events
      FAIL FZ17
    end
  

In that input FAIL would be tokenized as a code rather than an identifier because the lexer only looks at the characters not the overall context of the expression. This kind of distinction is best left to the parser to deal with as it has the information to tell the difference between the name and the code. This does mean that checking that the code matches the four character rule needs to be done later on in the parse. In general it's best to keep lexing as simple as possible.

Most of the time, I like to think of the lexer as dealing with three different kinds of tokens:

Most parser generator tools provide generators for the lexer, using the kinds of regular expression rules I've shown above. However many people prefer to write their own lexers. They are fairly straightforward to write, particularly with languages that have good support for regular expressions. With hand-written lexers you have more flexibility for more complex interactions between the parser and the lexer, which often can be useful.

One particular parser-lexer interaction that can be useful is that of supporting multiple modes in the lexer and allowing the parser to indicate a switch between modes. This allows the parser to alter how tokenizing occurs within certain points of the language. [TBD: consider example here]

Syntactic Analyzer

Once you have a stream of tokens, the next part of Syntax Directed Translation is the parser itself. The parser's behavior can be divided into two main sections: which I'll call syntactic analysis and actions. Syntactic analysis takes the stream of tokens and arranges them into a parse tree. This work can be derived entirely from the grammar itself, and in a parser generator system this code will be automatically generated by the tool. The actions take that syntax tree and does more with it - such as populating a Semantic Model.

The actions cannot be generated from a grammar, and are usually executed while the parse tree is being built up. Usually a parser generator grammar file combines the grammar definition with additional code to specify the actions. Often these actions are in a general purpose programming language, although some actions can be expressed in additional DSLs.

For the moment, I'll ignore the actions and just look at the syntactic analysis. If we build a parser using only the grammar and thus only doing syntactic analysis the result of the parse will be either a successful run or a failure. This tells us if the input text matches the grammar or not.

So with the text I've used so far, I have this grammar.

declarations : eventList commandList;
eventBlock    : Event-keyword eventDec* End-keyword;
eventDec     : Identifier Identifier;
commandBlock  : Command-keyword commandDec* End-keyword;
commandDec   : Identifier Identifier; 

And this input.

  events
    doorClosed  D1CL
    drawOpened  D2OP
  end

I've shown above that the tokenizer splits the input into this token stream.

[Event-keyword: "events"] [Identifier: "doorClosed"] [Identifier: "D1CL"] [Identifier: "drawOpened"] [Identifier:"D2OP"] [End-keyword: "end"]

Syntactic analysis then takes these tokens, and the grammar and arranges them into the following tree structure.

Figure 1

Figure 1: Parse tree for the event input.

As you see the syntactic analysis introduces extra nodes (which I've shown as rectangles) in order to form the parse tree. These nodes are defined by the grammar.

It's important to realize that any given language can be matched by many grammars. So for our case here we could also use the following grammar.

eventBlock    : Event-keyword eventList End-keyword;
eventList     : eventDec*
eventDec     : Identifier Identifier;

This will match all the inputs that the previous formation will match, however it will produce this different parse tree.

Figure 2

Figure 2: Alternative parse tree for the event input.

So in Syntax Directed Translation a grammar defines how an input text is turned into a parse tree and we can often choose different grammars depending on how we want to control the parse. Different grammars also appear due to differences in parser generator tools.

So far I've talked about the parse tree as if it's something that is explicitly produced by the parser as an output of the parse. However this is usually not the case. In most cases you don't ever access the parse tree directly. The parser will build up pieces of the parse tree and executes actions in the middle of the parsing. Once it's done with bits of the parse tree it'll discard them (historically important to reduce memory consumption). If you are doing Tree Construction, then you will produce a full syntax tree. However in this case you usually don't produce the full parse tree, but a simplification of it called an abstract syntax tree.

You may run into a particular terminological confusion around this point. Academic books in this area often use "parse" as a synonym for syntactic analysis only, calling the overall process something like translation, interpretation or compilation. I tend to use "parse" much more broadly in this book, reflecting what I see as common usage in the field. Parser generator tools tend to refer to the parser as the activity that consumes tokens - so they talk about the lexer and parser as separate tools. Since that's overwhelmingly common I do that in this section too, although you could argue that to be completely consistent with other sections of this book parsing should include lexing too. As ever the terminology in software is much less precise than we would like.

Embedding Actions

Syntactic analysis produces a parse tree, to do something with that tree we need to embed further code. We place the code in the grammar using Foreign Code. Where we place it in the grammar indicates when the code is executed. Embedded code is placed in rule expressions to be executed as a consequence of the recognition of that rule.

Lets take an example of registering events when we see event declarations.

eventBlock    : Event-keyword eventDec* End-keyword;
eventDec     : Identifier Identifier {registerEvent($1, $2);}
             ;

This would tell us to invoke the registerEvent method just after the parser has recognized the second identifier in an eventDec. In order to pass data from the parse tree into registerEvent we need some way to refer to the tokens mentioned in the rule. In this case I'm using $1, $2 to indicate the identifiers by position - which is the style of the yacc parser generator.

The actions are usually woven into the generated parser while the parser is being generated. As a result the embedded code is usually in the same language as the language generated for the parser.

Different parser generator tools have different facilities for how to embed code and how to link the actions to the rules. While I don't want to go through all the different features that the many tools have, I think it's worth highlighting a couple. I've already talked about linking the embedded code to identifiers. Since the nature of parsers is to create a parse tree, it's often useful to move data around this tree. A common and useful facility is thus to allow a sub-rule to return data to its parent. To illustrate consider the following grammar using the Antlr parser generator.

eventBlock 
  :  K_EVENT (e = eventDec {registerEvent($e.result);})* K_END
  ;    
eventDec returns [Event result]
  : name = ID code = ID {$result = createEvent($name, $code);}
  ;

Here the eventDec rule is set to return a value which the higher level rule can access and use. (With Antlr the actions refer to grammar elements by name, which is usually better than by position.) The ability to return values from rules can make it much easier to write parsers - in particular it can remove a lot of Context Variables. Some parser generators, including Antlr, also have the ability to push data down as arguments to sub-rules - which allows a lot of flexibility in providing context to sub-rules.

This fragment also illustrates that the placement of the actions in the grammar drives when it's called. Here the action in eventBlock is in the middle of the right hand side, indicating that it should be called after each eventDec sub-rule is recognized. Placing actions like this is a common feature in parser generators.

The actions define what we do with the DSL and how you write them therefore depends on the overall DSL parsing approach: Tree Construction, Embedded Interpretation, or Embedded Translation. As Syntax Directed Translation isn't really too interesting without one of these you won't find any examples in this pattern and should take a look at those for examples.

Top Down and Bottom Up Parsing

[TBD: Consider separating these into their own topics. This would include longer and slower examples of how they work.]

There are many kinds of parser generators out there, and many differences between them, most of which I won't attempt to detail. One kind of difference that I do think is worth discussing here is that between top down and bottom up parsing. This affects not just they way parsers work, but also the kinds of grammars they can work with.

A top down parser begins with the highest level rule in the grammar, and then uses that to decide what to try and match. So in an event list grammar like this

eventBlock    : Event-keyword eventDec* End-keyword;
eventDec     : Identifier Identifier;

With input like this

  events
    doorClosed  D1CL
    drawOpened  D2OP
  end

The parser would work by first knowing it wanted to match an eventBlock and therefore look for an event keyword. Once it saw the event keyword it then knows it wants to match an eventDec and looks inside that rule to know it needs to match an identifier. In short a top down parser looks uses the rules as goals to direct it what to look for.

It'll not shock you to know that a bottom up parser does the opposite. It starts by reading the event keyword, and looks to see if the input so far is enough to match a rule. As it isn't (yet) it puts it to one side (called shifting) and takes the next token (and identifier). This is still not enough to match anything, so it shifts again. But with the second identifier it can now match the eventDec rule so it can now reduce the two identifiers to an eventDec. It similarly can recognize the second line of input, then when it reaches the end keyword it can reduce the whole expression to an event block.

You often hear top-down parsers referred to as LL parsers and bottom up parsers as LR parsers. The first letter refers to which direction the input is scanned and the second to how the rules are recognized (L is left-to-right, ie top-down, R is right to left hence bottom-up). You also hear bottom-up parsing referred to as shift-reduce parsing as the shift-reduce approach is usually what you run into with bottom-up parsing. There are a number of variants of LR parsers, such as LALR, GLR, and SLR. I'm not going to go into the details of these variations here.

Bottom-up parsers are usually considered to harder to write and understand than top-down ones. This is because most people find it harder to visualize the order in which the rules are processed with a bottom-up approach. While you don't have to worry about writing a parser when you use a parser-generator, you often have to understand how it works in order to debug problems. Probably the most well known family of parser-generators is the yacc family, which is a bottom-up (LALR) parser.

The big disadvantage for top down parsers is that they cannot deal with left recursion, which is a rule of the form

    expr: expr '+' expr;
  

Something like this forces the parser to get in an endless recurse trying to match expr. People disagree about how big a problem this limitation is in practice. There is a simple, mechanical technique called left-factoring that you can use to get rid of left recursion. However the result is a grammar that isn't as easy to follow. This is mostly an issue for infix operators that allow the same sub-rules: such as addition above. So as a result left-recursion is an issue with arithmetic and boolean expressions, but it's arguable that you rarely run into it outside those cases.

In general different parser generators have various restrictions on the kind of grammars they can handle, these restrictions are driven by the parsing algorithms they use. There are also plenty of other differences such as how you write actions, how you can move data up or down the parse tree, and what the grammar syntax is like (BNF vs EBNF). All of these effect how you write your grammar. Perhaps the most important point is to realize that you shouldn't treat the grammar as a fixed definition of the DSL. Often you'll need to alter the grammar to make the actions work better, so like any other code the grammar will change depending on what you want to do with it.

When to use it

Syntax Directed Translation is an alternative approach to Delimiter Directed Translation. The principle disadvantage of Syntax Directed Translation is the need to get used to driving parsing via a grammar, while chopping up via delimiters is usually a more familiar approach. It doesn't take long, however, to get used to grammars and once you have they provide a technique which is much easier to use as the DSLs get more complex.

In particular the grammar file, a DSL itself, provides a clear documentation of the syntactic structure of the DSL it's processing. This makes it easier to evolve the DSL's syntax over time.

Significant Revisions

09 Sep 08: First draft