MAGELANG!

Home

Grammar Inheritance

Object-oriented programming languages such as C++ and Java allow you to define a new object as it differs from an existing object, which provides a number of benefits. "Programming by difference" saves development/testing time and future changes to the base or superclass are automatically propogated to the derived or subclass.

Introduction and motivation

Allowing the ANTLR programmer to define a new grammar as it differs from an existing grammar provides significant benefits. Development time goes down because the programmer only has to specify the rules that are different or that need to be added. Further, when the base grammar changes, all derived grammars will automatically reflect the change. Grammar inheritance is also an interesting way to change the behavior of an existing grammar. A rule or set of rules can be respecified with the same structure, but with different actions.

The most obvious use of grammar inheritance involves describing multiple dialects of the same language. Previous solutions would require multiple grammar versions or a single grammar that recognized all dialects at once (using semantics to constrain the input to a single dialect). With grammar inheritance, one could write a base grammar for the common components and then have a derived grammar for each dialect. Code sharing would occur at the grammar and output parser class level.

Consider a simple subset of English:

class PrimarySchoolEnglish;

sentence
    :   subject predicate
    ;
subject
    :   NOUN
    ;
predicate
    :   VERB
    ;

This grammar recognizes sentences like: Dilbert speaks.

To extend this grammar to include sentences manageable by most American college students, we might add direct objects to the definition of a sentence. Rather than copying and modifying the PrimarySchoolEnglish grammar, we can simply extend it:


class AmericanCollegeEnglish extends
        PrimarySchoolEnglish;

sentence
    :   subject predicate object
    ;
object
    :   PREPOSITION ARTICLE NOUN
    ;

This grammar describes sentences such as Dilbert speaks to a dog. While this looks trivial to implement (just add the appropriate extends clause in Java to the output parser class), it involves grammar analysis to preserve grammatical correctness. For example, to generate correct code, ANTLR needs to pull in the base grammar and modify it according to the overridden rules. To see this, consider the following grammar for a simple language:

class Simple;

stat:   expr ASSIGN expr
    |   SEMICOLON
    ;

expr:   ID
    ;

Clearly, the ID token is the lookahead set that predicts the recognition of the first alternative of stat. Now, examine a derived dialect of Simple:

class Derived extends Simple;

expr:   ID
    |   INT
    ;  

In this case, { ID, INT } predicts the first alternative of stat. Unfortunately, a derived grammar affects the recognition of rules inherited from the base grammar! ANTLR must not only override expr in Derived, but it must override stat.

Determinining which rules in the base grammar are affected is not easy, so our implementation  simply makes a copy of the base grammar and generates a whole new parser with the appropriate modifications. From the programmer's perspective, code/grammar sharing would have occurred, however, from an implementation perspective a copy of the base grammar would be made.

Functionality

Grammar Derived inherits from Grammar Base all of the rules, options, and actions of Base including formal/actual rule parameters and rule actions. Derived may override any option or rule and specify new options, rules, and member action. The subgrammar does not inherit actions outside of classes or file options. Consider rule Base defined as:

class Base extends Parser;
options {
  k = 2;
}
{
  int count = 0;
}
a : A B {an-action}
  | A C
  ;
c : C
  ;

A new grammar may be derived as follows:

class Derived extends Base;
options {
  k = 3;        // need more lookahead; override
  buildAST=true;// add an option
}
{
  int size = 0; // override; no 'count' def here
}
a : A B {an-action}
  | A C {an-extra-action}
  | Z           // add an alt to rule a
  ;
b : a
  | A B D       // requires LL(3)
  ;

ANTLR will actually interpret the subgrammar as if you had typed:

class Derived extends Parser;
options {
        k=3;
        buildAST=true;
}
{
  int size = 0; // override Base action
}
a : A B {an-action}
  | A C {an-extra-action}
  | Z           // add an alt to rule a
  ;

b : a
  | A B D       // requires LL(3)
  ;

// inherited from grammar Base
c : C
  ;

Rules may be overridden to change their signatures such as their parameters or return types:

class Base extends Parser;
a[int x] returns [int y]
  : A
  ;

class Derived extends Base;
a[float z]
  : A
  ;

ANTLR will generate a warning, however:

warning: rule Derived.a has different signature than Base.a

Because of this ability, the subgrammars do not actually inherit, in the Java-sense, from the supergrammar.  Different signatures on the generated methods would prevent the parser from compiling.

Where Are Those Supergrammars?

The set of potential "supergrammars" available to some grammar P includes any other grammar in the same file as P and any listed on the ANTLR command line with

-glib f1.g;f2.g

where the files must include path names if they are located in another directory.

How is supergrammar P found? The grammars defined in the supergrammar list are read in and an inheritance hierarchy is constructed; any repeated grammar definition in this is ignored.  The grammars in the normally specified grammar file are also included in the hierarchy.  Incomplete hierarchies results in an error message from ANTLR.   Grammars in the same file as P are given precendence to those obtained from other files.

The type of grammar (Lexer,Parser,TreeParser) is determined by the type of the highest grammar in the inheritance chain.

Error Messages

ANTLR generates a file called expandedT.g, given a grammar input file (not the -glib files) called T.g.  All error messages are relative to this as you really want to see the whole grammar when dealing with ambiguities etc...  In the future, we may have a better solution.