WORK-IN-PROGRESS: - this material is still under development
Embed translation code into the grammar.
In Syntax Directed Translation a pure parser merely creates an internal parse tree, you need to do more to populate a Semantic Model.
Embedded Translation populates the Semantic Model by embedding code into the parser that populates the Semantic Model at appropriate points in the parse.
Syntactic analyzers are all about recognizing syntactic structures. Using Embedded Translation we place code to populate a Semantic Model into the parser so that we gradually populate the Semantic Model as we carry out the parse. Most times this implies that model population code is placed once an clause of the input language is recognized, although in practice you may place hunks of population code at various points.
When using Embedded Translation with parser generators, you usually see Host Code Embedment to incorporate the population code. Most parser generators provide a facility to use Host Code Embedment, the only one I've used that doesn't is intended to work with Tree Construction.
The biggest appeal of Embedded Translation is that it provides a simple way to handle both syntactic analysis and model population in one pass. With Tree Construction you both provide code to build up the AST and write a populator that walks the tree. Particularly for simple cases, which many DSLs are, this two stage process can be more trouble than it's worth.
The facilities of your parser generator have an impact on your choice. The better the tree building features of your parser generator, the more appealing Tree Construction becomes.
One of the biggest problems with Embedded Translation is that it can encourage complex grammar files, usually due to a poor use of Host Code Embedment. If you are disciplined in using Host Code Embedment well, this is less likely to be a problem - but a strength of Tree Construction is that it helps to enforce the discipline.
Embedded Translation fits in with a single pass parse, as all the work is done during syntactic analysis. This means that things like forward references, which are tricky on a single pass, are also tricky with Embedded Translation. To handle them you often need Context Variable, which can further complicate parsing.
The upshot of all this is that the simpler the language and parser, the more Embedded Translation is appealing.
With this example I'll take the same example I used for Tree Construction, with the same tools (Java and Antlr) but this time handle the parse with Embedded Translation. The first thing to say is that this only changes the syntactic analysis - to the tokenizing is the same. I won't repeat the tokenizing discussion here, I'm sure you can pop over to that section if you need a refresher.
Another similarity between the two examples is the core BNF grammar. Most of the time the BNF rules don't vary if you use different parsing patterns, what changes is the supporting code around the BNF. While Tree Construction uses Antlr's facilities for declaring ASTs, Embedded Translation uses Host Code Embedment to put in java snippets to populate the Semantic Model directly.
Embedded Translation involves putting arbitrary general purpose code into a grammar file. As with most cases when embedding one language into another I like to use an Embedment Helper. I've read many grammar files and I find this pattern helps keep the grammar clear so that it isn't buried in the translation code. I do this by declaring a helper in my grammar.
@members {
StateMachineLoader helper;
//...
The top level of the grammar defines the state machine file.
machine : eventList resetEventList commandList state*;
It shows the same sequence of declarations.
To see some real translation going on, let's look at the handling for the event list.
eventList : 'events' event* 'end';
event : n=ID c=ID {helper.addEvent($n, $c);};
Here we see the typical nature of using embedded translation. Much of the grammar file remains plain, but at appropriate points we introduce general purpose code to perform the translation. Since I'm using an Embedment Helper all I do is call a single method on that helper.
class StateMachineLoader...
void addEvent(Token name, Token code) {
events.put(name.getText(), new Event(name.getText(), code.getText()));
}
private Map<String, Event> events = new HashMap<String, Event>(); private Map<String, Command> commands = new HashMap<String, Command>(); private Map<String, State> states = new HashMap<String, State>(); private List<Event> resetEvents = new ArrayList<Event>();
The call creates a new event object and places it in the symbol table - which is a collection of dictionaries on the loader. The call to the helper passes in the tokens for the name and code. Antlr uses assignment syntax to mark elements of the grammar so the embedded code can refer to them. The placement of the embedded code indicates when it runs - in this case the embedded code is run once both child nodes have been recognized.
The commands are done in exactly the same way. The states, however introduce a couple of interesting issues: hierarchic context and forward references.
I'll start with talking about hierarchic context. The issue here is that the various elements of a state - actions and transitions - occur within the state definition so when we want to process an action we need to know which state the action is declared in.
Earlier on I drew an analogy to embedded translation being rather like processing XML with SAX. This is somewhat true, in that the embedded code just works with a rule at a time. But it's also misleading in that parser generators provide facilities to give you much more context during the execution of the code so you don't need to keep it around so much yourself.
In Antlr, you can pass parameters into rules in order to push down this kind of context.
state : 'state' n=ID {helper.addState($n);}
actionList[$n]?
transition[$n]*
'end';
actionList [Token state]
: 'actions' '{' actions+=ID* '}' {helper.addAction($state, $actions);}
;
Here the state token is passed into the rule for recognizing an action. This way the embedded translation code can pass in both the state token and the command tokens (the '*' indicates they are a list). This provides the right context for the helper.
class StateMachineLoader...
public void addAction(Token state, List actions) {
for (Token action : (Iterable<Token>) actions)
getState(state).addAction(getCommand(action));
}
private State getState(Token token) {
return states.get(token.getText());
}
The second issue is that the transition declarations involve forward references to states that haven't been declared yet. In many DSLs you can arrange things so that no item refers to an identifier that hasn't been declared yet, but the state model can't do this, resulting in forward references. Tree Construction allows us to process the AST in multiple passes, so we can make one pass to pick up the declarations and another pass to populate the states. With multiple passes, forward references aren't a problem since they are resolved on later passes through the AST. With Embedded Translation we don't have that option.
Our solution here is to use an obtain (the term I use for find-or-create) operation on both the references and declarations. Essentially this means that whenever we mention a state we implicitly declare it if it doesn't already exist.
stateMachine.g...
transition [Token state]
: trigger = ID '=>' target = ID {helper.addTransition($state, $trigger, $target);};
class StateMachineLoader...
public void addTransition(Token state, Token trigger, Token target) {
getState(state).addTransition(getEvent(trigger), obtainState(target));
}
private State obtainState(Token token) {
String name = token.getText();
if (!states.containsKey(name))
states.put(name, new State(name));
return states.get(name);
}
One of the consequences of this approach is that if we misspell a state in our transition, we will just get a blank state as the transition target. If we're happy with that, we can leave it. It's common, however, to check declarations against usage. In which case we need to keep track of states created by use and ensure that they are all declared too.
Our language defines the start state as the first state mentioned in the program. This kind of context isn't so well handled by the parser generator, so we need to resort to what is effectively a context variable.
class StateMachineLoader...
public void addState(Token n) {
obtainState(n);
if (null == machine)
machine = new StateMachine(getState(n));
}
Handling of reset events is pretty trivial, we just add them to a separate list.
stateMachine.g...
resetEventList : 'resetEvents' resetEvent* 'end' ;
resetEvent : n=ID {helper.addResetEvent($n);};
class StateMachineLoader...
public void addResetEvent(Token name) {
resetEvents.add(getEvent(name));
}
The single pass nature of the parser also complicates reset events, as they can be defined before we get the first state, and thus before we have a machine to put them on. So I keep them in a field to add them at the end.
class StateMachineLoader...
public void addResetEvent(Token name) {
resetEvents.add(getEvent(name));
}
The run method of the loader shows the overall sequence of tasks: lexing, running the generated parser, and finishing the model population with the reset events.
class StateMachineLoader...
public StateMachine run() {
try {
stateMachineLexer lexer = new stateMachineLexer(new ANTLRReaderStream(input));
stateMachineParser parser = new stateMachineParser(new CommonTokenStream(lexer));
parser.helper = this;
parser.machine();
machine.addResetEvents(resetEvents.toArray(new Event[0]));
return machine;
} catch (IOException e) {
throw new RuntimeException(e);
} catch (RecognitionException e) {
throw new RuntimeException(e);
}
}
private Reader input;
private StateMachine machine;
It's not unusual to have code like this following the syntactic analysis - this is also where any semantic analysis would happen. [TBD: consider showing this - checking the model for correct states.]