ANTLR Specification: Meta Language

ANTLR 2 accepts three types of grammar specifications -- parsers, lexers, and tree-parsers (also called tree-walkers). Because ANTLR 2 uses LL(k) analysis for all three grammar variants, the grammar specifications are similar, and the generated lexers and parsers behave similarly.

Note: in this document, the word "parser" usually includes tree-parsers as well as token stream parsers, except where noted.

Meta-Language Vocabulary

Whitespace. Spaces, tabs, and newlines are separators in that they can separate ANTLR vocabulary symbols such as identifiers, but are ignored beyond that. For example, "FirstName LastName" appears as a sequence of two token references to ANTLR not token reference, space, followed by token reference.

Comments. ANTLR accepts C-style block comments and C++-style line comments. Java-style documenting comments are allowed on grammar classes and rules, which are passed to the HTML output if requested. For example,

Characters. Character literals are specified just like in Java. They may contain octal-escape characters (e.g., '\377'), Unicode sequences (e.g., '\uFFFF'), and the usual special character escapes recognized by Java ('\b', '\r', '\t', '\n', '\f', '\'', '\\'). In lexer rules, single quotes represent a character to be matched on the input character stream. Single-quoted characters are not yet supported in parser rules. In the future, ANTLR will optimize for UNICODE character sets.

End of file. The EOF token is automatically generated for use in parser rules:

Strings. String literals are sequences of characters enclosed in double quotes. The characters in the string may be represented using the same escapes (octal, Unicode, etc.) that are valid in character literals.

In lexer rules, strings are interpreted as sequences of characters to be matched on the input character stream (e.g., "for" is equivalent to 'f' 'o' 'r').

In parser rules, strings represent tokens, and each unique string is assigned a token type. However, ANTLR does not create lexer rules to match the strings. Instead, ANTLR enters the strings into a literals table in the associated lexer. ANTLR will generate code to test the text of each token against the literals table, and change the token type when a match is encountered before handing the token off to the parser. You may also perform the test manually -- the automatic code-generation is controllable by a lexer option.

You may want to use the token type value of a string literal in your actions, for example in the synchronization part of an error-handler. For string literals that consist of alphabetic characters only, the string literal value will be a constant with a name like LITERAL_xxx, where xxx is the name of the token. For example, the literal "return" will have an associated value of LITERAL_return. You may also assign a specific label to a literal using the tokens section.

Token references. Identifiers beginning with an uppercase letter are token references. The subsequent characters may be any letter, digit, or underscore. A token reference in a parser rule results in matching the specified token. A token reference in a lexer rule results in a call to the lexer rule for matching the characters of the token. In other words, token references in the lexer are treated as rule references.

Token definitions. Token definitions in a lexer have the same syntax as parser rule definitions, but refer to tokens, not parser rules. For example,

Rule references. Identifiers beginning with a lowercase letter are references to ANTLR parser rules. The subsequent characters may be any letter, digit, or underscore. Lexical rules may not reference parser rules.

Actions. Character sequences enclosed in (possibly nested) curly braces are semantic actions. Curly braces within string and character literals are not action delimiters.

Arguments Actions. Character sequences in (possibly nested) square brackets are rule argument actions. Square braces within string and character literals are not action delimiters. The arguments within [] are specified using the syntax of the generated language, and should be separated by commas.

Many people would prefer that we use normal parentheses for arguments, but parentheses are best used as grammatical grouping symbols for EBNF.

Symbol	Description
`(...)`	subrule
`(...)*`	closure subrule
`(...)+`	positive closure subrule
`(...)?`	optional
`{...}`	semantic action
`[...]`	rule arguments
`{...}?`	semantic predicate
`(...)=>`	syntactic predicate
`\|`	alternative operator
`..`	range operator
`~`	not operator
`.`	wildcard
`=`	assignment operator
`:`	label operator, rule start
`;`	rule end
`class`	grammar class
`extends`	specifies grammar base class
`returns`	specifies return type of rule
`options`	options section
`tokens`	tokens section
`header`	header section
`tokens`	token definition section

Header Section

A header section contains source code that must be placed before any ANTLR-generated code in the output parser. This is mainly useful for C or C++ output due to their requirement that elements be declared before being referenced. In Java, this can be used to specify a package for the resulting parser, and any imported classes. A header section looks like:

Parser Class Definitions

All parser rules must be associated with a parser class. A grammar (.g) file may contain an arbitrary number of parser class definitions (along with lexers and tree-parsers). Each parser class specification precedes the options and rule definitions of the parser. A parser specification in a grammar file often looks like:

When generating code in an object-oriented language, parser classes result in classes in the output, and rules become member methods of the class. In C, classes would result in structs, and some name-mangling would be used to make the resulting rule functions globally unique.

The optional class preamble is some arbitrary text enclosed in {}. The preamble, if it exists, will be output to the generated class file immediately before the definition of the class.

Enclosing curly braces are not used to delimit the class because it is hard to associate the trailing right curly brace at the bottom of a file with the left curly brace at the top of the file. Instead, a parser class is assumed to continue until the next class statement.

Lexical Analyzer Class Definitions

A parser class results in parser objects that know how to apply the associated grammatical structure to an input stream of tokens. To perform lexical analysis, you need to specify a lexer class that describes how to break up the input character stream into a stream of tokens. The syntax is similar to that of a parser class:

Lexical rules contained within a lexer class become member methods in the generated class. Each grammar (.g) file may contain an arbitrary number of lexer classes. The parser and lexer classes may appear in any order.

The optional class preamble is some arbitrary text enclosed in {}. The preamble, if it exists, will be output to the generated class file immediately before the definition of the class.

Tree-parser Class Definitions

A tree-parser is like a parser, except that is processes a two-dimensional tree of AST nodes instead of a one-dimensional stream of tokens. Tree parsers are specified identically to parsers, except that the rule definitions may contain a special form to indicate descent into the tree.

Options Section

Rather than have the programmer specify a bunch of command-line arguments to the parser generator, an options section within the grammar itself serves this purpose. This solution is preferable because it associates the required options with the grammar rather than ANTLR invocation. The section is preceded by the options keyword and contains a series of option/value assignments. An options section may be specified on both a per-file, per-grammar, and per-rule basis.

Tokens Section

If you need to define an "imaginary" token, one that has no corresponding real input symbol, use the tokens section to define them. Imaginary tokens are used often for tree nodes that mark or group a subtree resulting from real input. For example, you may decide to have an EXPR node be the root of every expression subtree and DECL for declaration subtrees for easy reference during tree walking. Because there is no corresponding input symbol for EXPR, you cannot reference it in the grammar to implicitly define it. Use the following to define those imaginary tokens.

You can also define literals in this section and, most importantly, assign to them a valid label as in the following example.

Strings defined in this way are treated just as if you had referenced them in the parser.

If a grammar imports a vocabulary containing a token, say T, then you may attach a literal to that token type simply by adding T="a literal" to the tokens section of the grammar. Similarly, if the imported vocabulary defines a literal, say "_int32", without a label, you may attach a label via INT32="_int32" in the tokens section.

Grammar Inheritance

Object-oriented programming languages such as C++ and Java allow you to define a new object as it differs from an existing object, which provides a number of benefits. "Programming by difference" saves development/testing time and future changes to the base or superclass are automatically propagated to the derived or subclass. ANTLR 2.0 supports grammar inheritance as a mechanism for creating a new grammar class based on a base class. Both the grammatical structure and the actions associated with the grammar may be altered independently.

Rule Definitions

Because ANTLR considers lexical analysis to be parsing on a character stream, both lexer and parser rules may be discussed simultaneously. When speaking generically about rules, we will use the term atom to mean an element from the input stream (be they characters or tokens).

The structure of an input stream of atoms is specified by a set of mutually-referential rules. Each rule has a name, optionally a set of arguments, optionally an init-action, optionally a return value, and an alternative or alternatives. Each alternative contains a series of elements that specify what to match and where.

where type is a type specifier of the generated language, and id is a valid identifier of the generated language. In Java, a single type identifier would suffice most of the time, but returning an array of strings, for example, would require brackets:

The ids of the returns statement will be passed to the output code. An action may assign directly to these ids to affect the return values.

Init-actions are specified before the colon. Init-actions differ from normal actions because they are always executed regardless of guess mode. In addition, they are suitable for variable declarations in languages like C where all declarations must be at the start of the rule.

Lexer rules. Rules defined within a lexer grammar must have a name beginning with an uppercase letter. These rules implicitly match characters on the input stream instead of tokens on the token stream. Referenced grammar elements include token references (implicit lexer rule references), characters, and strings. Lexer rules are processed in the exact same manner as parser rules and, hence, may specify arguments and return values; further, lexer rules can also have local variables and use recursion. See more about lexical analysis with ANTLR.

Parser rules. Parser rules apply structure to a stream of tokens whereas lexer rules apply structure to a stream of characters. Parser rules, therefore, must not reference character literals. Double-quoted strings in parser rules are considered token references and force ANTLR to squirrel away the string literal into a table that can be checked by actions in the associated lexer.

Tree-parser rules. In a tree-parser, an additional special syntax is allowed to specify the match of a two-dimensional structure. Whereas a parser rule may look like:

which means "match A B and C sequentially", a tree-parser rule may also use the syntax:

which means "match a node of type A, and then descend into its list of children and match B and C". This notation can be nested arbitrarily, using #(...) anywhere an EBNF construct could be used, for example:

Atomic Production elements

Character literal. A character literal can only be referred to within a lexer rule. The single character is matched on the character input stream. There are no need to escape regular expression meta symbols because regular expressions are not used to match lexical atoms. For example, '{' need not have an escape as you are specifying the literal character to match. Meta symbols are used outside of characters and string literals to specify lexical structure.

String literal. Referring to a string literal within a parser rule defines a token type for the string literal, and causes the string literal to be placed in a hash table of the associated lexer. The associated lexer must check the hash table in an action. Referring to a string within a lexer rule matches the indicated sequence of characters and is a shorthand notation. For example, consider the following lexer rule definition:

There are no need to escape regular expression meta symbols because regular expressions are not used to match characters in the lexer.

Token reference. Referencing a token in a parser rule implies that you want to recognize a token with the specified token type. This does not actually call the associated lexer rule--the lexical analysis phase delivers a stream of tokens to the parser.

A token reference within a lexer rule implies a method call to that rule, and carries the same analysis semantics as a rule reference within a parser. In this situation, you may specify rule arguments and return values. See the next section on rule references.

Wildcard. The "." wildcard within a parser rule matches any single token; within a lexer rule it matches any single character. For example, this matches any two tokens between the B and C:

Simple Production elements

Rule reference. Referencing a rule implies a method call to that rule at that point in the parse. You may pass parameters and obtain return values. For example, formal and actual parameters are specified within square brackets:

Semantic action. Actions are blocks of source code (expressed in the target language) enclosed in curly braces. The code is executed after the preceding production element has been recognized and before the recognition of the following element. Actions are typically used to generate output, construct trees, or modify a symbol table. An action's position dictates when it is recognized relative to the surrounding grammar elements.

If the action is the first element of a production, it is executed before any other element in that production, but only if that production is predicted by the lookahead.

The first action of an EBNF subrule may be followed by ':'. Doing so designates the action as an init-action and associates it with the subrule as a whole, instead of any production. It is executed immediately upon entering the subrule -- before lookahead prediction for the alternates of the subrule -- and is executed even while guessing (testing syntactic predicates). For example:

The init-action would be executed regardless of what (if anything) matched in the optional subrule.

Production Element Operators

Element complement. The "~" not unary operator must be applied to an atomic element such as a token identifier. For some token atom T, ~T matches any token other than T except end-of-file. Within lexer rules, ~'a' matches any character other than character 'a'. The sequence ~. ("not anything") is meaningless and not allowed.

The vocabulary space is very important for this operator. In parsers, the complete list of token types is known to ANTLR and, hence, ANTLR simply clones that set and clears the indicated element. For characters, you must specify the character vocabulary; the only other choice is to add elements 0..255 and clear the indicated character. For Unicode, complementing a character means creating a set with 2^16 elements unless the programmer specifies the vocabulary. The character vocabulary includes the characters specified in the charVocabulary option and any characters referenced in the lexer rules. Here is a sample use of the character vocabulary option:

Set complement. the not operator can also be used to construct a token set or character set by complementing another set. This is most useful when you want to match tokens or characters until a certain delimiter set is encountered. Rather than invent a special syntax for such sets, ANTLR allows the placement of ~ in front of a subrule containing only simple elements and no actions. In this specific case, ANTLR will not generate a subrule, and will instead create a set-match. The simple elements may be token references, token ranges, character literals, or character ranges. For example:

Range operator. The range binary operator implies a range of atoms may be matched. The expression 'c1'..'c2' in a lexer matches characters inclusively in that range. The expression T..U in a parser matches any token whose token type is inclusively in that range, which is of dubious value unless the token types are generated externally.

AST root operator. When generating abstract syntax trees (ASTs), token references suffixed with the "^" root operator force AST nodes to be created and added as the root of the current tree. This symbol is only effective when the buildAST option is set. More information about ASTs is also available.

AST exclude operator. When generating abstract syntax trees, token references suffixed with the "!" exclude operator are not included in the AST constructed for that rule. Rule references can also be suffixed with the exclude operator, which implies that, while the tree for the referenced rule is constructed, it is not linked into the tree for the referencing rule. This symbol is only effective when the buildAST option is set. More information about ASTs is also available.

Token Classes

By using a range operator, a not operator, or a subrule with purely atomic elements, you implicitly define an "anonymous" token or character class--a set that is very efficient in time and space. ANTLR will soon let you define explicit or labeled token/character classes. For example, you can define a lexer rule such as:

These describe sets of tokens and characters respectively that are easily optimized to simple, single, bit-set rather than series of token and character comparisons.

Predicates

Semantic predicate. Semantics predicates are conditions that must be met at parse-time before parsing can continue past them. The functionality of semantic predicates is explained in more detail later. The syntax of a semantic predicate is a semantic action suffixed by a question operator:

The expression must not have side-effects and must evaluate to true or false (boolean in Java, int in C, or bool in C++). Since semantic predicates can be executed while guessing, they should not rely upon the results of actions or rule parameters.

Syntactic predicate. Syntactic predicates specify the lookahead language needed to predict an alternative. Syntactic predicates are explained in more detail later. The syntax of a syntactic predicate is a subrule with a => operator suffix:

Where the lookahead-language can be any valid ANTLR construct including references to other rules. Actions are not executed, however, during the evaluation of a syntactic predicate.

Element Labels

Any atomic or rule reference production element can be labeled with an identifier (case not significant). In the case of a labeled atomic element, the identifier is used within a semantic action to access the associated Token object or character. For example,

No "$" operator is needed to reference the label from within an action as was the case with ANTLR 1.xx.

The AST node constructed for a token reference or rule reference may be accessed from within actions as label_AST.

Labels on token references can also be used in association with parser exception handlers to specify what happens when that token cannot be matched.

Labels on rule references are used for parser exception handling so that any exceptions generated while executing the labeled rule can be caught.

EBNF Rule Elements

ANTLR supports extended BNF notation according to the following four subrule syntax / syntax diagrams:

Interpretation Of Semantic Actions

Semantic actions are copied to the appropriate position in the output parser verbatim with the exception of AST action translation.

None of the $-variable notation from ANTLR 1.xx carries forward into ANTLR 2.00.

Semantic Predicates

A semantic predicate specifies a condition that must be met (at run-time) before parsing may proceed. We differentiate between two types of semantic predicates: (i) validating predicates that throw exceptions if their conditions are not met while parsing a production and (ii) disambiguating predicates that are hoisted into the prediction expression for the associated production.

Semantic predicates are syntactically semantic actions suffixed with a question mark operator:

The expression may use any symbol provided by the programmer or generated by ANTLR that is visible at the point in the output the expression appears.

The position of a predicate within a production determines which type of predicate it is. For example, consider the following validating predicate (which appear at any non-left-edge position) that ensures an identifier is semantically a type name:

Validating predicates generate parser exceptions when they fail. The thrown exception is is of type SemanticException. You can catch this and other parser exceptions in an exception handler.

Disambiguating predicates are always the first element in a production because they cannot be hoisted over actions, token, or rule references. For example, the first production of the following rule has a disambiguating predicate that would be hoisted into the prediction expression for the first alternative:

If we restrict this grammar to LL(1), it is syntactically nondeterministic because of the common left-prefix: ID. However, the semantic predicate correctly provides additional information that disambiguates the parsing decision. The parsing logic would be:

ANTLR provides a guarded predicate to allow you to specify the lookahead context under which a predicate should be evaluated. The syntax is:

A guarded predicate is useful in situations where the semantic predicate should be hoisted into the prediction decision only when the lookahead is consistent with some context. For example:

Here, the predicate is only applicable when an ID is found on the input stream. It should not be evaluated when an INT is found.

Formally, in ANTLR 1.xx, semantic predicates represented the semantic context of a production. As such, the semantic AND syntactic context (lookahead) could be hoisted into other rules. In the public-domain version of ANTLR 2.00, predicates are not currently be hoisted outside of their enclosing rule. Consequently, rules such as:

are meaningless. On the other hand, this "semantic context" feature caused considerable confusion to many ANTLR 1.xx folks.

Syntactic Predicates

There are occasionally parsing decisions that cannot be rendered deterministic with finite lookahead. For example:

The common left-prefix renders these two productions nondeterministic in the LL(k) sense for any value of k. Clearly, these two productions can be left-factored into:

without changing the recognized language. However, when actions are embedded in grammars, left-factoring is not always possible. Further, left-factoring and other grammatical manipulations do not result in natural (readable) grammars.

The solution is simply to use arbitrary lookahead in the few cases where finite LL(k) for k>1 is insufficient. ANTLR allows you to specify a lookahead language with possibly infinite strings using the following syntax:

For example, consider the following rule that distinguishes between sets (comma-separated lists of words) and parallel assignments (one list assigned to another):

If a list followed by an assignment operator is found on the input stream, the first production is predicted. If not, the second alternative production is attempted.

Syntactic predicates are a form of selective backtracking and, therefore, actions are turned off while evaluating a syntactic predicate so that actions do not have to be undone.

Syntactic predicates are implemented using exceptions in the target language if they exist. When generating C code, longjmp would have to be used.

We could have chosen to simply use arbitrary lookahead for any non-LL(k) decision found in a grammar. However, making the arbitrary lookahead explicit in the grammar is useful because you don't have to guess what the parser will be doing. Most importantly, there are language constructs that are ambiguous for which there exists no deterministic grammar! For example, the infamous if-then-else construct has no LL(k) grammar for any k. The following grammar is ambiguous and, hence, nondeterministic:

Given a choice between two productions in a nondeterministic decision, we simply choose the first one. This works out well is most situations. Forcing this decision to use arbitrary lookahead would simply slow the parse down.

Fixed depth lookahead and syntactic predicates

ANTLR cannot be sure what lookahead can follow a syntactic predicate (the only logical possibility is whatever follows the alternative predicted by the predicate, but erroneous input and so on complicates this), hence, ANTLR assumes anything can follow. This situation is similar to the computation of lexical lookahead when it hits the end of the token rule definition.

Consider a predicate with a (...)* whose implicit exit branch forces a computation attempt on what follows the loop, which is the end of the syntactic predicate in this case.

The lookahead is artificially set to "any token" for the exit branch. Normally, the P and the "any token" would conflict, but ANTLR knows that what you mean is to match a bunch of P tokens if they are present--no warning is generated.

If more than one path can lead to the end of the predicate in any one decision, ANTLR will generate a warning. The following rule results in two warnings.

The empty alternative can indirectly be the start of the loop and, hence, conflicts with the P. Further, ANTLR detects the problem that two paths reach end of predicate. The resulting parser will compile but never terminate the (P|)* loop.

The situation is complicated by k>1 lookahead. When the nth lookahead depth reaches the end of the predicate, it records the fact and then code generation ignores the lookahead for that depth.

ANTLR generates a decision of the following form inside the (..)* of the predicate:

ANTLR 2.x.x Meta-Language Grammar

See antlr/antlr.g for the grammar that describes ANTLR 2.x.x input grammar syntax in ANTLR 2.x.x meta-language itself.