MAGELANG!

ANTLR Tree Construction

ANTLR helps you build intermediate form trees, or abstract syntax trees (ASTs), by providing grammar annotations that indicate what tokens are to be treated at subtree roots, which are to be leaves, and which are to be ignored with respect to tree construction. In a future version, you will also be able to specify tree rewrite rules. As with 1.33, you may manipulate trees using grammar actions.

It is often the case that programmers either have existing tree definitions or need a special physical structure, thus, preventing ANTLR from specifically defining the implementation of AST nodes. ANTLR specifies only an interface describing minimum behavior. Your tree implementation must implement this interface so ANTLR knows how to work with your trees. Further, you must tell the parser the name of your tree nodes or provide a tree "factory" so that ANTLR knows how to create nodes with the correct type (rather than hardcoding in a new AST() expression everywhere).

Notation

In this and other documents, tree structures are represented by a LISP-like notation, for example:

#(A B C)

is a tree with A at the root, and children B and C. This notation can be nested to describe trees of arbitrary structure, for example:

#(A B #(C D E))

is a tree with A at the root, B as a first child, and an entire subtree as the second child. The subtree, in turn, has C at the root and D,E as children.

Controlling AST construction

AST construction in an ANTLR Parser, or AST transformation in a Tree-Parser, is turned on and off by the buildAST option.

Grammar annotations for building ASTs

Leaf nodes

ANTLR assumes that any nonsuffixed token reference or token-range is a leaf node in the resulting tree for the enclosing rule. If no suffixes at all are specified in a grammar, then a Parser will construct a linked-list of the tokens (a degenerate AST), and a Tree-Parser will copy the input AST.

Root nodes

Any token suffixed with the "^" operator is considered a root token. A tree node is constructed for that token and is made the root of whatever portion of the tree has been built

a : A B^ C^ ;

results in tree #(C #(B A)).

First A is matched and made a lonely child, followed by B which is made the parent of the current tree, A. Finally, C is matched and made the parent of the current tree, making it the parent of the B node. Note that the same rule without any operators results in the flat tree A B C.

Turning off standard tree construction

Suffix a token reference with "!" to prevent incorporation of the node for that token into the resulting tree (the AST node for the token is still constructed and may be referenced in actions, it is just not added to the result tree automatically). Suffix a rule reference "!" to indicate that the tree constructed by the invoked rule should not be linked into the tree constructed for the current rule.

Suffix a rule definition with "!" to indicate that tree construction for the rule is to be turned off. Rules and tokens referenced within that rule still create ASTs, but they are not linked into a result tree. The following rule does no automatic tree construction. Actions must be used to set the return AST value, for example:

begin!
    :   INT PLUS i:INT
        { #begin = #(PLUS INT i); }
    ;

For finer granularity, prefix alternatives with "!" to shut off tree construction for that alternative only. This granularity is useful, for example, if you have a large number of alternatives and you only want one to have manual tree construction:

stat:   
        ID EQUALS^ expr   // automatic construction
    ... some alternatives ...
    |!  RETURN expr
        { #stat = #([IMAGINARY_TOKEN_TYPE] expr); }
    ... more alternatives ...
    ;

Tree node construction

With automatic tree construction off (but with buildAST on), you must construct your own tree nodes and combine them into tree structures within embedded actions. There are several ways to create a tree node in an action:

use new T(arg) where T is your tree node type and arg is either a single token type, a token type and token text, or a Token.
use ASTFactory.create(arg) where T is your tree node type and arg is either a single token type, a token type and token text, or a Token. Using the factory is more general than creating a new node directly, as it defers the node-type decision to the factory, and can be easily changed for the entire grammar.
use the shorthand notation #[TYPE] or #[TYPE,"text"]. The shorthand notation results in a call to ASTFactory.create().
use the shorthand notation #id, where id is either a token matched in the rule, a label, or a rule-reference.

To construct a tree structure from a set of nodes, you can set the first-child and next-sibling references yourself or call the factory make method or use #(...) notation described below.

AST Action Translation

In parsers and tree parsers with buildAST set to true, ANTLR will translate portions of user actions in order to make it easier to build ASTs within actions. In particular, the following constructs starting with '#' will be translated:

#label

The AST associated with a labeled token-reference or rule-reference may be accessed as #label. The translation is to a variable containing the AST node built from that token, or the AST returned from the rule.

#rule

When rule is the name of the enclosing rule, ANTLR will translate this into the variable containing the result AST for the rule. This allows you to set the return AST for a rule or examine it from within an action. This can be used when AST generation is on or suppressed for the rule or alternate. For example:

r! : a:A { #r = #a; }

Setting the return tree is very useful in combination with normal tree construction because you can have ANTLR do all the work of building a tree and then add an imaginary root node such as:

decl : ( TYPE ID )+
       { #decl = #([DECL,"decl"], #decl); }
     ;

ANTLR allows you to assign to #rule anywhere within an alternative of the rule. ANTLR ensures that references of and assignments to #rule within an action force the parser's internal AST construction variables into a stable state. After you assign to #rule, the state of the parser's automatic AST construction variables will be set as if ANTLR had generated the tree rooted at #rule. For example, any children nodes added after the action will be added to the children of #rule.

#label_in

In a tree parser, the input AST associated with a labeled token reference or rule reference may be accessed as #label_in. The translation is to a variable containing the input-tree AST node from which the rule or token was extracted. Input variables are seldom used. You almost always want to use #label instead of #label_in.

#id

ANTLR supports the translation of unlabeled token references and rule references as a shorthand notation, as long as the token or rule name is unique within the scope of a single alternative. In these cases, the use of an unlabeled token reference or rule reference is identical to using a label. For example, this:


r! : A { #r = #A; }

is equivalent to:


r! : a:A { #r = #a; }

#id_in is given similar treatment to #label_in.

#[TOKEN_TYPE] or #[TOKEN_TYPE,"text"]

AST node constructor shorthand. The translation is a call to the ASTFactory.create() method. For example, #[T] is translated to:

ASFFactory.create(T)

#(root, c1, ..., cn)

AST tree construction shorthand. ANTLR looks for the comma character to separate the tree arguments. Commas within method call tree elements are handled properly; i.e., an element of "foo(#a,34)" is ok and will not conflict with the comma separator between the other tree elements in the tree. This tree construct is translated to a "make tree" call. The "make-tree" call is complex due to the need to simulate variable arguments in languages like Java, but the result will be something like:

ASTFactory.make(root, c1, ..., cn);

In addition to the translation of the #(...) as a whole, the root and each child c1..cn will be translated. Within the context of a #(...) construct, you may use:

id or label as a shorthand for #id or #label.
[...] as a shorthand for #[...].
(...) as a shorthand for #(...).

The target code generator performs this translation with the help of a special lexer that parses the actions and asks the code-generator to create appropriate substitutions for each translated item.

Invoking parsers that build trees

Assuming that you have defined a lexer L and a parser P in your grammar, you can invoke them sequentially on the system input stream as follows.

L lexer = new L(System.in);
P parser = new P(lexer);
parser.setASTNodeType("MyAST"); // if special AST node
parser.startRule();

If you have set buildAST=true in your parser grammar, then it will build an AST, which can be accessed via parser.getAST(). If you have defined a tree parser called T, you can invoke it with:

T walker = new T();
walker.startRule(parser.getAST());    // walk tree

If, in addition, you have set buildAST=true in your tree-parser to turn on transform mode, then you can access the resulting AST of the tree-walker:

AST results = walker.getAST();
DumpASTVisitor visitor = new DumpASTVisitor();
visitor.visit(results);

Where DumpASTVisitor is a predefined ASTVisitor implementation that simply prints the tree to the standard output.

You can also use get a LISP-like print out of a tree via

String s = parser.getAST().toStringList();

AST Factories

ANTLR uses a factory pattern to create and connect AST nodes. This is done to primarily to separate out the tree construction facility from the parser, but also gives you a hook in between the parser and the tree node construction. Subclass ASTFactory to alter the create methods.

If you are only interested in specifying the AST node type at runtime, use the

setASTNodeType(String className)

method on the parser or factory. By default, trees are constructed of nodes of type CommonAST.

The ASTFactory has some generically useful methods:

/** Copy a single node. clone() is not used because
 * we want to return an AST not a plain object...type
 * safety issue. Further, we want to have all AST node
 * creation go through the factory so creation can be
 * tracked. Returns null if t is null.
 */
public AST dup(AST t);

/** Duplicate tree including siblings of root. */
public AST dupList(AST t);

/**Duplicate a tree, assuming this is a root node
 * of a tree--duplicate that node and what's below;
 * ignore siblings of root node.
 */
public AST dupTree(AST t);

AST enumerations

The AST findAll and findAllPartial methods return enumerations of tree nodes that you can walk. Interface

antlr.collections.ASTEnumeration

and

class antlr.Collections.impl.ASTEnumerator

implement this functionality. Here is an example:

// Print out all instances of a-subtree-of-interest
// found within tree 't'.
ASTEnumeration enum;
enum = t.findAll(a-subtree-of-interest);
while ( enum.hasMoreNodes() ) {
  System.out.println(
    enum.nextNode().toStringList()
  );
}

A few examples


sum :term ( PLUS^ term)*
    ;

The "^" suffix on the PLUS tells ANTLR to create an additional node and place it as the root of whatever subtree has been constructed up until that point for rule sum. The subtrees returned by the term references are collected as children of the addition nodes. If the subrule is not matched, the associated nodes would not be added to the tree. The rule returns either the tree matched for the first term reference or a PLUS-rooted tree.

The grammar annotations should be viewed as operators, not static specifications. In the above example, each iteration of the (...)* will create a new PLUS root, with the previous tree on the left, and the tree from the new term on the right, thus preserving the usual associatively for "+".

Look at the following rule that turns off default tree construction.

decl!: 
    modifiers type ID SEMI;
	{ #decl = #([DECL], ID, ([TYPE] type),
                    ([MOD] modifiers) ); }
    ;

In this example, a declaration is matched. The resulting AST has an "imaginary" DECL node at the root, with three children. The first child is the ID of the declaration. The second child is a subtree with an imaginary TYPE node at the root and the AST from the type rule as its child. The third child is a subtree with an imaginary MOD at the root and the results of the modifiers rule as its child.

Labeled subrules

[THIS WILL NOT BE IMPLEMENTED AS LABELED SUBRULES...We'll do something else eventually.]

In 2.00 ANTLR, each rule has exactly one tree associated with it. Subrules simply add elements to the tree for the enclosing rule, which is normally what you want. For example, expression trees are easily built via:


expr: ID ( PLUS^ ID )*
    ;

However, many times you want the elements of a subrule to produce a tree that is independent of the rule's tree. Recall that exponents must be computed before coefficients are multiplied in for exponent terms. The following grammar matches the correct syntax.


// match exponent terms such as "3*x^4"
eterm
    :   expr MULT ID EXPONENT expr
    ;

However, to produce the correct AST, you would normally split the ID EXPONENT expr portion into another rule like this:


eterm:
    expr MULT^ exp
    ;

exp:
	ID EXPONENT^ expr
    ;

In this manner, each operator would be the root of the appropriate subrule. For input 3*x^4, the tree would look like:


#(MULT 3 #(EXPONENT ID 4))

However, if you attempted to keep this grammar in the same rule:


eterm
    :   expr MULT^ (ID EXPONENT^ expr)
    ;

both "^" root operators would modify the same tree yielding


#(EXPONENT #(MULT 3 ID) 4)

This tree has the operators as roots, but they are associated with the wrong operands.

Using a labeled subrule allows the original rule to generate the correct tree.


eterm
    :   expr MULT^ e:(ID EXPONENT^ expr)
    ;

In this case, for the same input 3*x^4, the labeled subrule would build up its own subtree and make it the operand of the MULT tree of the eterm rule. The presence of the label alters the AST code generation for the elements within the subrule, making it operate more like a normal rule. Annotations of "^" make the node created for that token reference the root of the tree for the e subrule.

Labeled subrules have a result AST that can be accessed just like the result AST for a rule. For example, we could rewrite the above decl example using labeled subrules (note the use of ! at the start of the subrules to suppress automatic construction for the subrule):


decl!: 
    m:(! modifiers { #m = #([MOD] modifiers); } )
    t:(! type { #t = #([TYPE] type); } )
    ID
    SEMI;
    { #decl = #( [DECL] ID t m ); }
    ;

What about subrules that are closure loops? The same rules apply to a closure subrule--there is a single tree for that loop that is built up according to the AST operators annotating the elements of that loop. For example, consider the following rule.


term:   T^ i:(OP^ expr)+
    ;

For input T OP A OP B OP C, the following tree structure would be created:


#(T #(OP #(OP #(OP A) B) C) )

which can be drawn graphically as


T
|
OP
|
OP--C
|
OP--B
|
A

The first important thing to note is that each iteration of the loop in the subrule operates on the same tree. The resulting tree, after all iterations of the loop, is associated with the subrule label. The result tree for the above labeled subrule is:


#(OP #(OP #(OP A) B) C)

The second thing to note is that, because T is matched first and there is a root operator after it in the rule, T would be at the bottom of the tree if it were not for the label on the subrule.

Loops will generally be used to build up lists of subtree. For example, if you want a list of polynomial assignments to produce a sibling list of ASSIGN subtrees, then the following rule you would normally split into two rules.


interp
    :   ( ID ASSIGN poly ";" )+
    ;

Normally, the following would be required


interp
    :   ( assign )+
    ;
assign
    :   ID ASSIGN^ poly ";"!
    ;

Labeling a subrule allows you to write the above example more easily as:


interp
    :   ( r:(ID ASSIGN^ poly ";") )+
    ;

Each recognition of a subrule results in a tree and if the subrule is nested in a loop, all trees are returned as a list of trees (i.e., the roots of the subtrees are siblings). If the labeled subrule is suffixed with a "!", then the tree(s) created by the subrule are not linked into the tree for the enclosing rule or subrule.

Labeled subrules within labeled subrules result in trees that are linked into the surrounding subrule's tree. For example, the following rule results in a tree of the form X #( A #(B C) D) Y.


a   :   X r:( A^ s:(B^ C) D) Y
    ;

Labeled subrules within nonlabeled subrules result in trees that are linked into the surrounding rule's tree. For example, the following rule results in a tree of the form #(A X #(B C) D Y).


a   :   X ( A^ s:(B^ C) D) Y
    ;

Reference nodes

Not implemented. A node that does nothing but refer to another node in the tree. Nice for embedding the same tree in multiple lists.

Required AST functionality and form

The data structure representing your trees can have any form or type name as long as they implement the AST interface:

package antlr.collections;

/** Minimal AST node interface used by ANTLR
 *  AST generation and tree-walker.
 */
public interface AST {
    /** Get the token type for this node */
    public int getType();

    /** Set the token type for this node */
    public void setType(int ttype);

    /** Get the token text for this node */
    public String getText();

    /** Set the token text for this node */
    public void setText(String text);

    /** Get the first child of this node;
     *  null if no children */
    public AST getFirstChild();

    /** Set the first child of a node */
    public void setFirstChild(AST c);

    /** Get the next sibling in line after this one */
    public AST getNextSibling();

    /** Set the next sibling after this one */
    public void setNextSibling(AST n);

    /** Add a (rightmost) child to this node */
    public void addChild(AST node);

// NEW FOR 2.20:

    /** Are two nodes exactly equal? */
    public boolean equals(AST t);

    /** Are two lists of nodes/subtrees exactly equal
     *  in structure and content? */
    public boolean equalsList(AST t);

    /** Are two lists of nodes/subtrees partially equal?
     *  In other words, 'this' can be bigger than 't' */
    public boolean equalsListPartial(AST t);

    /** Are two nodes/subtrees exactly equal? */
    public boolean equalsTree(AST t);

    /** Are two nodes/subtrees exactly partially equal?
     *  In other words, 'this' can be bigger than 't'.*/
    public boolean equalsTreePartial(AST t);

    /** Return an enumeration of all exact tree matches
     *  for tree within 'this'. */
    public ASTEnumeration findAll(AST tree);

    /** Return an enumeration of all partial tree
     *  matches for tree within 'this'. */
    public ASTEnumeration findAllPartial(AST subtree);

    /** Init a node with token type and text */
    public void initialize(int t, String txt);

    /** Init a node using content from 't' */
    public void initialize(AST t);

    /** Init a node using content from 't' */
    public void initialize(Token t);

    /** Convert node to printable form */
    public String toString();

    /** Treat 'this' as list (i.e., consider 'this'
     *  siblings) and convert to printable form */
    public String toStringList();

    /** Treat 'this' as tree root (i.e., don't consider
     *  'this' siblings) and convert to printable
     *  form */
    public String toStringTree();
}

This scheme does not preclude the use of heterogeneous trees versus homogeneous trees. However, you will need to write extra code to create heterogeneous trees (via a subclass of ASTFactory), whereas the homogeneous trees are free.