WORK-IN-PROGRESS: - this material is still under development
The parser creates and returns a syntax tree representation of the source text that is manipulated by later tree walking code.
Any parser using Syntax Directed Translation builds up a syntax tree while it's doing the parsing. It builds the tree up on the stack, pruning branches when it's done with them. With Tree Construction we create parser actions that build up a syntax tree in memory during the parse. Once the parse is complete, we have a syntax tree for the DSL script. We can then carry out further manipulations based on that syntax tree. If we are using a Semantic Model, we run a separate routine that walks our syntax tree and populates the Semantic Model.
The syntax tree that we create in memory doesn't need to correspond directly to the actual parse tree that the parser creates as it goes. Instead we can build up an abstract syntax tree that does simplifications of the parse tree, and may even ignore large parts of the parse if we are only looking to sample a subset of the information in the DSL script.
In some cases we may be able to treat the AST as our Semantic Model. This is a good approach if we can easily build the Semantic Model we like using the tree building facilities of the parser. I would encourage you to be wary of doing this, however. If the transformation has any wrinkles you are usually better off having two simple transformations rather than over-complicating the building of the AST. As a result my default approach is to have a separate Semantic Model and only use the AST as an Semantic Model if they really are the same.
You can do Tree Construction with any parser by writing actions that populate the AST you'd like. However many parser generators now provide specific facilities that help you build ASTs more easily. Some parser generators (eg SableCC) build a syntax tree automatically. Antlr provides special syntax that you can use either to build a parse tree automatically or to specify simple transformations to create an AST.
Once you've created the AST you then have to walk it to do any further processing. If you build the AST yourself, you can create any kind of object you like within the tree. These object can contain the further processing behavior, or be simple tree node objects that only act as data sources for separate processing.
If you use a the facilities of a parser generator, then that parser generator will provide you with the data structure. Exactly what you get will depend on the parser generator. SableCC provides a strongly types set of nodes, meaning a different node class for each kind of node in the tree with specific fields for that node's attributes. Antlr provides a generic node type with fields for parents and children. The basic node class has very limited behavior but Antlr allows you to substitute your own subclass of the tree node which allows you to add more useful behavior to your nodes.
Both self and Embedded Translation are useful approaches to populating a Semantic Model while parsing. Embedded Translation does the transformation in a single step, while Tree Construction uses two steps with the AST as an intermediate model. The argument for using Tree Construction is that it breaks down a single transformation into two, simpler transformations. Whether this is worth the effort of dealing with an intermediate model depends very much on the complexity of the transformation. The complex the transformation, the more useful it is to use an intermediate model.
A particular driver of complexity is the need to make several passes through the DSL script. Things like forward references can be a bit more awkward to use if you are doing all the processing in a single step. With Tree Construction it's easy to walk the tree many times as part of later processing.
Another factor that encourages you to use Tree Construction is if your parser generator provides useful tools to allow you to build an AST really easily. Some parser generators give you no choice, you have to use Tree Construction. Most give you the option to use Embedded Translation, but if the parser generator makes it really easy to build an AST, then that makes Tree Construction a more attractive option.
Tree Construction is likely to consume more memory than alternative approaches, as it needs to store the AST. In most cases, however, this won't make any appreciable difference. (Although that certainly used to be a big factor in earlier days.)
You process the same AST in several different ways to populate different Semantic Models if you need them and reuse the parser. This can be handy, but if the parsers' tree construction is easy then it may be easier to use different ASTs for different purposes.
I'll use the state machine DSL from the opening chapter, specifically with Miss Grant's controller. Here's the specific text that I'll be using for the example.
events
doorClosed D1CL
drawOpened D2OP
lightOn L1ON
doorOpened D1OP
panelClosed PNCL
end
resetEvents
doorOpened
end
commands
unlockPanel PNUL
lockPanel PNLK
lockDoor D1LK
unlockDoor D1UL
end
state idle
actions {unlockDoor lockPanel}
doorClosed => active
end
state active
drawOpened => waitingForLight
lightOn => waitingForDraw
end
state waitingForLight
lightOn => unlockedPanel
end
state waitingForDraw
drawOpened => unlockedPanel
end
state unlockedPanel
actions {unlockPanel lockDoor}
panelClosed => idle
end
Tokenizing for this is very simple. We have a few keywords ("event", "end" etc) and a bunch of identifiers. Antlr allows us to put the keywords as literal text in the grammar rules, which generally reads more easily. So we only need lexer rules for identifiers.
fragment LETTER : ('a'..'z' | 'A'..'Z' | '_');
fragment DIGIT : ('0'..'9');
ID : LETTER (LETTER | DIGIT)* ;
Strictly the lexing rules for names and codes are different -
names can be any length but codes must be four upper case
letters. So we could define different lexer rules for
them. However in this case it gets tricky. The string 'ABC1' is a
valid code, but is also a valid name. If we see ABC1 in the DSL
program we can tell which it is by its context: state
ABC1 is different to event unlockDoor
ABC1. The parser will also be able to use this context to
tell the difference, but the lexer can't. So the best option here
is to use the same token for both of them and let the parser sort
it out. This does mean that the parser won't generate an error
for five letter codes, we have to sort that out in our own
semantic processing.
We also need lexer rules to strip out the white-space
WHITE_SPACE : (' ' |'\t' | '\r' | '\n')+ {skip();} ;
COMMENT : '#' ~'\n'* '\n' {skip();};
In this case white space includes newlines. I have laid the DSL out in a fashion that suggests there are meaningful line endings to end statements in the DSL, but as you can now see this isn't true. All white space, including the line endings, are removed. This allows me to layout the DSL code in any fashion I like. This is a notable contrast to Delimiter Directed Translation. Indeed it's worth remarking that there are no statement separators at all, unlike most general purpose languages that need something like a newline or ';' to end statements. Often DSLs can get away with no statement separators because the statements are very limited. Things like infix expressions will prevent you from avoiding statement separators, but for many DSLs you can do without them. As with most things, don't put them in until you actually need them.
For this example I'm skipping the white space which means that it's lost completely to the parser. This is reasonable as the parser doesn't need it - all it needs are the meaningful tokens. However there is a case when white space ends up being handy again - when things go wrong. To give good error reports you need line numbers and column numbers, to provide these we need to keep the white space. Antlr allows you to do this by sending white space tokens on a different channel with syntax like WS : ('\r' | '\n' | ' ' | '\t')+ {$channel=HIDDEN}. This sends the white space through on a separate channel so it can be used for error handling but doesn't affect the parsing rules.
The lexing rules work the same in Antlr whether you are using Tree Construction or not, it's the parser that operates differently. In order to use Tree Construction we need to tell Antlr to produce an AST.
options {
output=AST;
ASTLabelType = MfTree;
}
As well as telling Antlr to produce an AST, I've also told it
to populate the AST with nodes of a particular type:
MfTree. This is a subclass of the generic Antlr
CommonTree class that allows me to add some
additional behavior that I prefer to have on my tree nodes. The
naming here is a little confusing. The class represents a node
and its children, so you can either think of it as a node or as a
(sub)tree. Antlr's naming calls it a tree, so I've followed that
in my code although I think of it as a node in the tree.
Now we'll move onto the grammar rules. I'll begin with the top level rule that defines the structure of the whole DSL file.
machine : eventList resetEventList? commandList? state*;
This rule lists the main clauses in sequence. If I don't give Antlr any tree construction rules then it will just return a node for each term on the right-hand side in sequence. Usually this isn't what we want, but it does work right here.
Taking the terms in order, the first one gets into events.
eventList : 'events' event* 'end' -> ^(EVENT_LIST event*); event : n=ID c=ID -> ^(EVENT $n $c);
There's a few new things to talk about with these two rules. The first thing is that these rules introduce Antlr's syntax for tree construction, which is the code following "->" in each rule.
The eventList rule uses two string constants -
this is us putting the keyword tokens directly into the parser
rules without making separate lexer rules for them.
The tree construction rules allow us to say what goes in the
AST. In both cases here we use "^(list...)" to create and return a new node in the AST. The first
item in the parenthesized list is the token type of the node. In
this case we've made a new token type. Each item following the
token type are the other nodes in the tree. For the event list, we
just put all the events as siblings in the list. For the event we
name tokens in the BNF and reference them in the tree construction
to show how they are placed.
The EVENT_LIST and EVENT tokens are
special tokens that I've created as part of parsing - they
weren't tokens produced by the lexer. When I create tokens like
this I need to declare them in the grammar file
tokens { EVENT_LIST; EVENT; COMMAND_LIST; COMMAND; STATE; TRANSITION_LIST; TRANSITION; ACTION_LIST; RESET_EVENT_LIST;}
The commands are treated in the same way, and the reset events are just a simple list.
commandList : 'commands' command* 'end' -> ^(COMMAND_LIST command*); command : ID ID -> ^(COMMAND ID+); resetEventList : 'resetEvents' ID* 'end' -> ^(RESET_EVENT_LIST ID*);
The states are a bit more involved, but really are only the same basic approach.
state
: 'state' ID actionList? transition* 'end'
-> ^(STATE ID ^(ACTION_LIST actionList?) ^(TRANSITION_LIST transition*) )
;
transition : ID '=>' ID -> ^(TRANSITION ID+);
actionList : 'actions' '{' ID* '}' -> ID*;
Each time what I'm doing is collecting together appropriate clumps of the DSL and putting them under a node that describes what that clump represents. The result is an AST that is very similar to the parse tree, but not exactly the same. My aim is to keep my tree construction rules very simple and my syntax tree easy to walk.
Once the parser has made a tree, the next step is to walk this tree and populate the Semantic Model. The Semantic Model is the same state machine model that I used in the introduction. The interface for building it is pretty simple, so I won't go into it more here.
I create a loader class to populate the Semantic Model.
class StateMachineLoader...
private Reader input;
private MfTree ast;
private StateMachine machine;
public StateMachineLoader(Reader input) {
this.input = input;
}
I use the loader as a command class, here is its run method, that indicates the sequence of steps I use to carry out the translation.
class StateMachineLoader...
public void run() {
loadAST();
loadSymbols();
createMachine();
}
Explaining that a touch in English. First I use the Antlr generated parser to parse the input stream and create an AST. Then I navigate through the AST to build Symbol Tables. Finally I assemble the objects into a state machine.
The first step is just the incantations to get Antlr to build the AST.
class StateMachineLoader...
private void loadAST() {
try {
stateMachineLexer lexer = new stateMachineLexer(new ANTLRReaderStream(input));
stateMachineParser parser = new stateMachineParser(new CommonTokenStream(lexer));
parser.helper = this;
parser.setTreeAdaptor(new MyNodeAdaptor());
ast = (MfTree) parser.machine().getTree();
} catch (IOException e) {
throw new RuntimeException(e);
} catch (RecognitionException e) {
throw new RuntimeException(e);
}
}
class MyNodeAdaptor extends CommonTreeAdaptor {
public Object create(Token token) {
return new MfTree(token);
}
}
MyNodeApaptor is a second step in telling Antlr to create
the AST with MfTree rather than
CommonTree
The next step is to build the symbol table. This involves navigating the AST to find all the events, commands, and states and load them into maps so we can look them up easily to make the links when we create the state machine.
class StateMachineLoader...
private void loadSymbols() {
loadEvents();
loadCommands();
loadStateNames();
}
Here's the code for the events.
class StateMachineLoader...
private Map<String, Event> events = new HashMap<String, Event>();
private void loadEvents() {
MfTree eventList = ast.getSoleChild(EVENT_LIST);
for (MfTree eventNode : eventList.getChildren()) {
String name = eventNode.getText(0);
String code = eventNode.getText(1);
events.put(name, new Event(name, code));
}
}
class MfTree...
List<MfTree> getChildren() {
List<MfTree> result = new ArrayList<MfTree>();
for (int i = 0; i < getChildCount(); i++)
result.add((MfTree) getChild(i));
return result;
}
MfTree getSoleChild(int nodeType) {
List<MfTree> matchingChildren = getChildren(nodeType);
assert 1 == matchingChildren.size();
return matchingChildren.get(0);
}
List<MfTree> getChildren(int nodeType) {
List<MfTree> result = new ArrayList<MfTree>();
for (int i = 0; i < getChildCount(); i++)
if (getChild(i).getType() == nodeType)
result.add((MfTree) getChild(i));
return result;
}
String getText(int i) {
return getChild(i).getText();
}
The node types are defined in the generated code in the parser. When I use them in the loader I can use a static import to make them easy to refer to.
The commands are loaded in a similar way (I'm sure you can guess what the code for that looks like.) I load the states in a similar way, but at this point I only have the name in the state objects.
class StateMachineLoader...
private void loadStateNames() {
for (MfTree node : ast.getChildren(STATE))
states.put(stateName(node), new State(stateName(node)));
}
I have to do something like this because the states are used in forward references. In the DSL I can mention a state in a transition before I declare the state. This is a case where tree construction works very nicely - there's no problem in taking as many passes through the AST as I need to wire things up.
The final step is to actually create the state machine.
class StateMachineLoader...
private void createMachine() {
machine = new StateMachine(getStartState());
for (MfTree node : ast.getChildren(stateMachineParser.STATE)) loadState(node);
loadResetEvents();
}
The start state is the first declared state.
class StateMachineLoader...
private State getStartState() {
return states.get(getStartStateName());
}
private String getStartStateName() {
return stateName((MfTree) ast.getFirstChildWithType(STATE));
}
Now we can wire up the transitions and actions for all the states.
class StateMachineLoader...
private void loadState(MfTree stateNode) {
for (MfTree t : stateNode.getSoleChild(TRANSITION_LIST).getChildren()) {
getState(stateNode).addTransition(events.get(t.getText(0)), states.get(t.getText(1)));
}
for (MfTree t : stateNode.getSoleChild(ACTION_LIST).getChildren())
getState(stateNode).addAction(commands.get(t.getText()));
}
private State getState(MfTree stateNode) {
return states.get(stateName(stateNode));
}
And finally we add the reset events, which the state machine API expects us to do at the end.
class StateMachineLoader...
private void loadResetEvents() {
if (!ast.hasChild(RESET_EVENT_LIST)) return;
MfTree resetEvents = ast.getSoleChild(RESET_EVENT_LIST);
for (MfTree e : resetEvents.getChildren())
machine.addResetEvents(events.get(e.getText()));
}
class MfTree...
boolean hasChild(int nodeType) {
List<MfTree> matchingChildren = getChildren(nodeType);
return matchingChildren.size() != 0;
}