Hello Cup

13 May 2007

As I explore parser generator tools for external DomainSpecificLanguages, I've said HelloAntlr and HelloSablecc. If you spend much time looking at parser generators you can't really avoid looking at the old stalwarts lex and yacc (or their gnu counterparts flex and bison). I want to explore the way lex and yacc operate, but my C has got too rusty. As Erich Gamma quipped, I've got too lazy to take out my own garbage. Fortunately there is an implementation of a yaccish system for Java, which is just what I need.

The Java implementation, like the classic lex and yacc, is two independent tools: JFlex and CUP. Although they are developed separately they do provide hooks to work together.

As with my previous posts along these lines, this is a overly-simple example just to get the tools working. I take an input file which says:

item camera
item laser

and turn it them into item objects inside a configuration using the following model:

public class Configuration {
  private Map<String, Item> items = new HashMap<String, Item>();
  public Item getItem(String key) {
    return items.get(key);
  }
  public void addItem(Item arg) {
    items.put(arg.getName(), arg);
  }
public class Item {
  private String name;
  public Item(String name) {
     this.name = name;
   }

to pass the following test

    @Test public void itemsAddedToItemList() {
      Reader input = null;
      try {
        input = new FileReader("rules.txt");
      } catch (FileNotFoundException e) {
        throw new RuntimeException(e);
      }
      Configuration config = CatalogParser.parse(input);
      assertNotNull(config.getItem("camera"));
      assertNotNull(config.getItem("laser"));
    }

The first issue is just to get the build going. As with my previous examples I want to take the grammar input files and generate the lexer and parser into a gen directory. Unlike my previous examples I don't do this directly in ant, instead I'm using ant to call a ruby script.

--- build.xml
 <target name = "gen" >
    <exec executable="ruby" failonerror="true">
      <arg line = "gen.rb"/>
    </exec>
  </target>

--- gen.rb
require 'fileutils'
include FileUtils

system "java -cp lib/JFlex.jar JFlex.Main -d gen/parser src/parser/catalog.l"

system "java -jar lib/java-cup-11a.jar src/parser/catalog.y"
%w[parser.java sym.java].each {|f| mv f, 'gen/parser'} 

Yes, I know it's a long way around, but with a lot of source files I'm using the approach in FlexibleAntlrGeneration to do my generation and I can't be bothered to sort it out in ant as well.

(When I attended CITCON recently, I was surprised to find out that people were much happier with ant than I thought. Grumpy me thinks it's a case of Stockholm Syndrome. Even when less grumpy I'm keeping my eye on things like Raven and BuildR which has now got some documentation. I'm so ready to ditch ant.)

You'll notice that CUP puts its output files in the current directory and I couldn't see how to override that behavior. So I generated them and moved them with a separate command.

Once I generate the code I then compile and test it with ant.

<target name = "compile" depends = "gen">
    <mkdir dir="${dir.build}"/>
    <javac destdir="${dir.build}" classpathref="path.compile">
      <src path = "${dir.src}"/>
      <src path = "${dir.gen}"/>
      <src path = "${dir.test}"/>
    </javac>
  </target>

  <target name = "test" depends="compile">
     <junit haltonfailure = "on" printsummary="on">
      <formatter type="brief"/>
      <classpath refid = "path.compile"/>
      <batchtest todir="${dir.build}" >
        <fileset dir = "test" includes = "**/*Test.java"/>
      </batchtest>
     </junit>
   </target>

Lex and yacc separate the lexer and parser into different files. Each is generated independently and combined during compilation. I'll start with the lexer file (catalog.l). The opening declares the output file's package and imports.

package parser;
import java_cup.runtime.*;

JFlex uses %% markers to break the file into sections. The second section consists of various declarations. The first bit names the output class and tells it to interface with CUP.

%%
%class Lexer
%cup

The next bit is code to be folded into the lexer. Here I define a function to create Symbol objects - again to hook into CUP.

%{
  private Symbol symbol(int type) {
    return new Symbol(type, yytext());
  }
%}

The Symbol class is defined in CUP and is part of its runtime jar. There are various constructors taking various information about the symbol and where it is.

Next up are some macros to define words and whitespace.

Word = [:jletter:]*
WS = [ \t\r\n]

The final section is the actual lexer rules. I define one to return the item keyword and the other to return simple words to the parser.

%%
"item"      {return symbol(sym.K_ITEM);}
{Word}      {return symbol(sym.WORD);}
{WS}        {/* ignore */}

So the lexer will send a stream of K_ITEM and WORD tokens to the parser. I define the parser in catalog.y. Again it starts with package and import declarations.

package parser;
import java_cup.runtime.*;
import model.*;

I'm parsing the data into a configuration object, so I need to declare a place to put that result. Again this code is copied directly into the parser object.

parser code {: Configuration result = new Configuration(); :}

With CUP I need to define all the rule elements that I'll use in the productions.

terminal K_ITEM;
terminal String WORD;
non terminal  catalog, item;

The terminals are the tokens I get from the lexer, the non terminals are the rules I'll build up myself. If I want to get a payload from the token, I need to declare its type, so WORD is a string.

The catalog is a list of items. Unlike with antlr or sablecc I don't have EBNF here, so I can't say item*, instead I need a recursive rule.

catalog ::= item | item catalog;

The item rule itself contains the embedded action to put the item into the configuration.

item ::= K_ITEM WORD:w {: parser.result.addItem(new Item(w)); :}
          ;

A little wrinkle to note here is that the actions are put into a separate class to the parser object, so to get to the result field I defined earlier I have to use the parser field of the actions object. I should also mention that once I do much further with this I start to use an EmbedmentHelper to keep action code simple.

People who have used yacc before might notice that I can label the elements of the rule to refer to them in the actions instead of the $1, $2 convention used in yacc. Similarly if the rule returns a value CUP uses RESULT rather than $$.

My memories of lex and yacc are dim, but these tools do seem to mimic the style of using them pretty well. My biggest beef so far is the error handling, which caused me much more fuss than antlr. My feeling so far is that if you're new to parser generators then antlr is the better choice (particularly due to its book). However if you're familiar with lex and yacc then these two are similar enough to build off that knowledge.