Regex Table Lexer

by Rebecca Parsons

Implement a lexical analyzer using a list of regular expressions.

Parsers primarily deal with the structure of a language, specifically the way components of the language can be combined. The most basic language components—such as keywords, numbers, and names—can clearly be recognized by the parser. However, we generally separate this stage out into a lexical analyzer. By using a separate pass to recognize these terminal symbols, we simplify the construction of the parser.

Directly implementing a lexical analyzer, also referred to as a lexer, is relatively straightforward. Lexical analyzers stay firmly in the space of regular languages, which means we can use standard regular expression APIs to implement them. For a Regex Table Lexer, we use a list of regular expressions, each associated with the particular terminal symbol. We scan the input, relating individual pieces of the input to the proper regular expressions and generating a stream of tokens naming the individual terminal symbols. It is this token stream that is the input to the parser.

For more details see chapter 20 of the DSL book

DSL Catalog