| EAA-dev Home |

WORK-IN-PROGRESS: - this material is still under development

Symbol Table

A location to store all identifiable objects during a parse to resolve references.

Many languages need to refer to objects at multiple points within the code. If we have a language that defines a configuration of tasks and their dependencies, we need a way for one task to refer to its dependent tasks in its definition.

In order to do this we come up with some form of symbol for the task and while processing the DSL script we store these symbol in a Symbol Table that stores the link between the symbol and an underlying object that holds the full information.

How it Works

The essential purpose of a Symbol Table is to map between a symbol used to refer to an object in a DSL script and an object that that symbol refers to. A mapping like this naturally fits with the notion of a map data structure, so it's no surprise to find that the most common implementation of a Symbol Table is a map with the symbol as a key and the framework object as a value.

One question to consider is the kind of object that should be used as the key in the Symbol Table. For many languages the most obvious choice is a string, since the text of the DSL is a string.

The main case to use something other than a string is with languages that support a symbol data type. Symbols are like strings in a structural sense, the are fundamentally a sequence of characters, but usually differ in behavior. Many string operations - concatenation, sub-strings etc - do not make sense for symbols. The vital task for symbols is to be used for lookup, so symbol types are usually designed with that in mind. So while "foo" and "foo" strings are often different objects and are compared for equality by looking at their contents, :foo and :foo symbols always resolve to the same object and can be compared for equality much faster.

Performance can be a good reason for preferring symbol data types to strings, but often it isn't going to make a difference for small DSLs. The big reason to prefer to use a symbol data-type is that it clearly communicates your intention in using it. By declaring something as a symbol, you state clearly what you are using it for and thus make your code easier to understand.

Languages that support symbols usually have a particular literal syntax for them. Ruby uses :aSymbol, Smalltalk #aSymbol, and Lisp treats any bare identifier as a symbol. This makes symbols stand out all the more in internal DSLs - a further reason to use them.

The values of a symbol table can either be final framework object or an intermediate builder. Using a framework object makes the Symbol Table act as result data, which is good for simple situations. Often putting a builder object as the value provides more flexibility, at the cost of a bit more work.

Many languages have different kinds of objects that need to be referred to. The introductory state model needs to identify states, commands, and events. Having multiple kinds of things to refer to means you have to choose between a single map, multiple maps, or a special class.

Using a single map for your Symbol Table means that all lookups for any symbol use a single map. An immediate consequence of this is that you can't use the same symbol name for different kinds of things - you can't have an event with the same name a state. This may be a useful constraint to reduce confusion in the DSL. However using a single map makes it harder to read the processing code as it's less clear what kind of thing is being manipulated as you refer to the symbol. I thus don't suggest this option.

With multiple maps, you have a separate map for each kind of object you are referring to. So a state model may have three maps for events, commands, and states. You may think of this as one logical Symbol Table or three Symbol Tables. Either way I prefer this option over a single map as it's now clear in the code which kind of object you are referring to in the processing steps.

Using a special class means having a single object for the Symbol Table with different methods to refer to the different kinds of object stored in there: getEvent(String code), getState(String code), registerEvent(String code, Event object) etc. This can sometimes be useful, and it does give a natural place to add any specific symbol processing behavior. Most of the time, however, I don't find a compelling need to do this.

In some cases objects are referred to before they are properly defined - these are called forward references. DSLs don't usually have strict rules about declaring identifiers before you use them, so forward references often make sense. If you allow forward references you need to ensure that a reference will populate the entry in the symbol table if it isn't already there. This will often push you to use builders as the value in the symbol table, unless the framework objects are very flexible.

Statically Typed Field Implementation

If you are doing an internal DSL in a statically typed language, such as C# or Java, you can easily use a hashmap as your Symbol Table with strings as keys. A line of such a DSL might look something like this

task("drinkCoffee").dependsOn("make_coffee", "wash");

Using strings like this certainly will work, but it comes with some disadvantages

An alternative implementation, that overcomes these problems, is to use a class and fields as a Symbol Table. To let this work you need to define all of your DSL script in a single Expression Builder. You then define fields for each symbol that you need, typed according to their types. So for our tasks example you need fields for each task and a type for them to belong to, something like this.

	Tasks drinkCoffee, makeCoffee, wash;

A class named Tasks is, like so many things in DSL processing, an unconventional name. Again the readability of the DSL is trumping my usual code style rules. By defining fields like this I can now refer to them in the DSL script as fields, the IDE will offer completion for them, and the compiler will check them.

Just defining the fields, however, is not enough. When we refer to fields in the DSL script, it refers to the contents of the field not the field definition. Just defining the field leaves with a blank value, usually null, and there's no way for the processing code to figure out which field is actually being referred to in the script. To deal with this we need to populate each field field with a suitable object before the script is executed. A good way to do this is to use the the class instance as the active script, put code in the constructor to populate the fields and the script inside an instance method. The contents of the fields are usually small Expression Builders that link to the underlying framework object and also contain the field name to help with cross-referencing. In terms of a Symbol Table the field name acts as the key and the builder acts as the value, but occasionally you will need another kind of key access, which is why it's handy for the builders in the field to keep the field name.

The DSL script will usually refer to the field by the field literal itself - which is the whole point. To refer to the wash task I can just type the wash field name in the DSL script. However as we're processing the DSL script we'll need the builders in the fields to refer to each other. This will sometimes involve looking up fields by name, or iterating through all fields of a certain type. Doing this will require more tricky code, usually using reflection. Usually there's not too much of it and providing it's well encapsulated it shouldn't make the language too difficult to process.

When to use it

Example: Hashmap Symbol Table with symbolic keys (Ruby)

[TBD: I'll probably do another simple example with an external DSL before this one. I'll use the same task dependency topic.]

With languages that have a symbol data type, this makes a natural choice for a symbol table. Here's a simple DSL script with breakfast tasks and prerequisites.

task :go_to_work => [:drink_coffee, :dress]
task :drink_coffee => [:make_coffee, :wash]
task :dress => [:wash]

Each task is referred to in the DSL by ruby's symbol data type. I use Function Sequence to declare the list of tasks with the details of each task shown using a hash Literal Collection Expression.

The framework is simple to describe, just a single task class.

class Task
  attr_reader :name
  attr_accessor :prerequisites
  
  def initialize name, *prereqs
    @name = name
    @prerequisites = prereqs
  end

  def to_s
    name
  end 
end

The DSL script is read by an Expression Builder that uses uses Object Scoping with instance_eval.

class TaskBuilder...
  def load aStream
    instance_eval aStream
    return self
  end

The Symbol Table is a simple dictionary.

class TaskBuilder...
  def initialize
    @tasks = {}
  end

The task clause takes the single hash association argument and uses to populate the task information.

class TaskBuilder...
  def task argMap
    raise "syntax error" if argMap.keys.size != 1
    key = argMap.keys[0]
    newTask = obtain_task(key)
    prereqs = argMap[key].map{|s| obtain_task(s)}
    newTask.prerequisites = prereqs    
  end
  def obtain_task aSymbol
    @tasks[aSymbol] = Task.new(aSymbol.to_s) unless @tasks[aSymbol]
    return  @tasks[aSymbol]
  end

The implementation of a Symbol Table using symbols is the same as it is if your identifiers are strings. However you should use symbols if they are available.

Example: Statically Typed Class Symbol Table (java)

If you are doing an internal DSL in a statically typed language, such as C# or Java, you can easily use a hashmap as your Symbol Table with strings as keys. This approach has the problems that the strings are a noisy datatype, you don't have any static type checking, and most importantly you don't get any IDE assistance. You can deal with this by making a class be your Symbol Table with fields for each symbol you use.

I used this approach for the Java example in the introduction, so this seems like a good example to use to show how this works.

[TBD: There will be another discussion of the basic implementation of this state machine using string identifiers. I will need to link back there for comparisons and for discussion of the framework]

The DSL script is in a specific class.

public class BasicStateMachine extends StateMachineBuilder {

  Events doorClosed, drawOpened, lightOn, panelClosed;
  Commands unlockPanel, lockPanel, lockDoor, unlockDoor;
  States idle, active, waitingForLight, waitingForDraw, unlockedPanel;
  ResetEvents doorOpened;

  protected void defineStateMachine() {
    doorClosed. code("D1CL");
    drawOpened. code("D2OP");
    lightOn.    code("L1ON");
    panelClosed.code("PNCL");

    doorOpened. code("D1OP");

    unlockPanel.code("PNUL");
    lockPanel.  code("PNLK");
    lockDoor.   code("D1LK");
    unlockDoor. code("D1UL");

    idle
        .actions(unlockDoor, lockPanel)
        .transition(doorClosed).to(active)
        ;

    active
        .transition(drawOpened).to(waitingForLight)
        .transition(lightOn).   to(waitingForDraw)
        ;

    waitingForLight
        .transition(lightOn).to(unlockedPanel)
        ;

    waitingForDraw
        .transition(drawOpened).to(unlockedPanel)
        ;

    unlockedPanel
        .actions(unlockPanel, lockDoor)
        .transition(panelClosed).to(idle)
        ;
 }
}

The DSL script is housed in its own class. The script itself is in one method and the fields of the class represent the symbol table. I've set things up so the DSL script class is a subclass of a builder - this way I can have the superclass builder control the way the script is run. (Using a subclass like this also allows me to use Object Scoping, although I don't need it here.)

class StateMachineBuilder...
  public StateMachine build() {
    initializeIdentifiers(Events.class, Commands.class, States.class, ResetEvents.class);
    defineStateMachine();
    return produceStateMachine();
  }
  abstract protected void defineStateMachine();

I define the public method to run the script on the superclass; it executes code to setup the Symbol Table fields before running the script. In this case running the DSL script provides a basic preparation of the information for the state machine and a second pass actually produces the framework objects. This echoes the style I used for the example with string identifiers.[TBD: or rather will use] So running a script has three distinct stages: initializing the identifiers (generic), running the DSL script (specific), and finally producing the framework state machine (generic).

I need the first step of initializing the identifiers since any reference to a field in the DSL script refers to the contents of the field rather than the field itself. In this case the suitable objects are specific identifier objects that hold the name of the identifier and refer to the underlying framework object. Doing this ends up being a bit more messy than I'd like as I want to write generic code for setting up the identifiers to avoid duplicating setup code. However any generic code doesn't know about the specific type of the identifier that's being setup and so has to determine it dynamically.

Hopefully this will become a little clearer when we look at an example, in this case the event builder class (Events). The first thing to talk about here is the name of the class. Any style book on object will wisely tell you to avoid plural class names, and I agree with that advice. However here a plural name reads better in the context of the DSL, so this is another case of general code rules being broken to make a good DSL script. The DSL naming doesn't alter the fact that it is truly a builder of events, so I'll refer to it as the event builder class in my text (and similarly for its siblings).

The event builder extends a general identifier class.

class Identifier...
  private String name;
  protected StateMachineBuilder builder;

  public Identifier(String name, StateMachineBuilder builder) {
    this.name = name;
    this.builder = builder;
  }
  public String getName() {
    return name;
  }
public class Events extends Identifier {
  private Event event;
  public Events(String name, StateMachineBuilder builder) {
    super(name, builder);
  }
  Event getEvent() {
    return event;
  }

There is a simple division of responsibility here, with the identifier class carrying responsibilities needed for all identifiers and the subclasses carrying what's needed for specific types.

Let's look at the first step of running the script - initializing the identifiers. Since many identifier classes need to be initialize, I have some generic code to do that. This way I can provide a list of classes which are identifiers and the code will initialize all fields of those classes.

class StateMachineBuilder...
  private void initializeIdentifiers(Class... identifierClasses) {
    List<Class> identifierList = Arrays.asList(identifierClasses);
    for (Field f : this.getClass().getDeclaredFields()) {
      try {
        if (identifierList.contains(f.getType())) {
          f.setAccessible(true);
          f.set(this, Identifier.create(f.getType(), f.getName(), this));
        }
      } catch (Exception e) {
        throw new RuntimeException(e);
      }
    }
  }
class Identifier...
  static Identifier create(Class type, String name, StateMachineBuilder builder)
      throws NoSuchMethodException, InvocationTargetException, IllegalAccessException, InstantiationException
  {
      Constructor ctor = type.getConstructor(String.class, StateMachineBuilder.class);
      return (Identifier) ctor.newInstance(name, builder);
  }

Doing it this way is more tricky code than I like, but avoids having to write duplicate initializing methods. Essentially I look through every field on the DSL script object and if the type of the field is one of those I've passed in I initialize it with a special static utility method that finds and calls the right constructor. As a result once I've called initializeIdentifiers I have all of these fields populated with objects that will help me construct the state machine.

The next step is to execute the DSL script itself. The DSL script executes by building up suitable intermediate objects to capture all the information about the state machine.

The first step is defining the codes for the events and commands.

class Events...
  public void code(String code) {
    event = new Event(getName(), code);
  }

Since the code is all the information I need to create a framework event object, I can create it on calling code and put it inside the identifier (the command builder looks just the same).

The event and command builders are degenerately simple Expression Builders. The state builder is a bit more of a builder as it needs several steps.

Since a state framework object isn't immutable I can create it in the constructor.

class States...
  private State content;
  private List<TransitionBuilder> transitions = new ArrayList<TransitionBuilder>();
  private List<Commands> commands = new ArrayList<Commands>();

  public States(String name, StateMachineBuilder builder) {
    super(name, builder);
    content = new State(name);
  }

The first building behavior I'll show is creating the actions. The basic behavior here is simple - I go through the supplied command identifiers and store them in the state builder.

class States...
  public States actions(Commands... identifiers) {
    builder.definingState(this);
    commands.addAll(Arrays.asList(identifiers));
    return this;
  }

If the DSL script always defines the codes before it defines the states (as I've done here) I could save myself the need to store command builders in the state builder and instead put the command framework objects into the state framework object. However this would lead to errors if I define a state before the its action codes. Using the builder as an intermediate object allows me to work it either way.

There is a bit of trickiness here. The DSL makes the assumption that the first mentioned state is the start state. As a result I have to check whenever I begin defining a state to see if this is the first state I start to define and if so make it the start state. Since it's only the overall state machine builder that can really tell if a state is the first one to be defined I want the machine builder to make the decision about whether to set a state as first.

class StateMachineBuilder...
  protected void definingState(States identifier) {
    if (null == start) start = identifier.getState();
  }

The state builder does need to call the machine builder to tell it that it's being defined but it shouldn't know what the machine builder is going to do with that information as that's the secret of the machine builder. So I make what is effectively an event notification call from the state builder (since that is all it knows) and let the machine builder decide what to do on that event. This is a good example of where naming is important in communicating what I think I think the responsibilities and relative knowledge in the objects should be.

The other thing we can do with a state builder is to define a transition. As this requires a couple of steps, it's a dash more complicated. I begin with the transition method, which creates a separate transition builder object.

class States...
  public TransitionBuilder transition(Events identifier) {
    builder.definingState(this);
    return new TransitionBuilder(this, identifier);
  }
class TransitionBuilder...
  private Events trigger;
  private States targetState;
  private States source;

  TransitionBuilder(States state, Events trigger) {
    this.trigger = trigger;
    this.source = state;
  }

Since I don't need to mention the transition builder's type in the DSL script, I can give it a more meaningful name. Its only builder method is the "to" clause.

class TransitionBuilder...
  public States to(States targetState) {
    this.targetState = targetState;
    source.addTransition(this);
    return source;
  }

These are all the elements I need to capture all the specific information in the DSL script. When the script is run I have a data structure of intermediate data: the builders - captured in the fields of the DSL script object itself. I now need to run through this structure to produce a fully wired up framework state machine.

class StateMachineBuilder...
  private StateMachine produceStateMachine() {
    assert null != start;
    StateMachine result = new StateMachine(start);
    for (States s : getStateIdentifers())
      s.produce();
    produceResetEvents(result);
    return result;
  }

Most of the work here is going through all the state builders and getting them to produce their wired-up framework objects. To find all these states I need to get all the objects out of the fields of the script class, so I again I use some reflective trickery to find all fields of the state builder's type.

class StateMachineBuilder...
  private List<States> getStateIdentifers() {
    return getIdentifiers(States.class);
  }
  private <T extends Identifier> List<T> getIdentifiers(Class<T> klass) {
    List<T> result = new ArrayList<T>();
    for (Field f : this.getClass().getDeclaredFields()) {
      if (f.getType().equals(klass))
        try {
          f.setAccessible(true);
          result.add(((T) f.get(this)));
        } catch (IllegalAccessException e) {
          throw new RuntimeException(e);
        }
    }
    return result;
  }

To produce its framework object, the state builder wires up the commands and produces its transitions.

class States...
  void produce() {
    for (Commands c : commands)
      content.addCommand(c.getCommand());
    for (TransitionBuilder t : transitions)
      t.produce();
  }
class TransitionBuilder...
  void produce() {
    source.getState().addTransition(trigger.getEvent(), getTargetState().getState());
  }

The last step of the production is to produce the reset events.

class StateMachineBuilder...
  private void produceResetEvents(StateMachine result) {
    result.addResetEvents(getResetEvents());
  }
  private Event[] getResetEvents() {
    List<Event> result = new ArrayList<Event>();
    for (Events identifier : getIdentifiers(ResetEvents.class))
      result.add(identifier.getEvent());
    return result.toArray(new Event[result.size()]);
  }

Using a class and fields as a symbol table does involve a bit of tricky code in places, but the benefit is full static typing and IDE support. That's usually a worthwhile trade off.

Significant Revisions

27 Mar 08: First Draft