Compiler Architecture: GLR Parsing and Disambiguation

Today I want to talk about GLR parsing and the internals of the JS++ parser.

The Problem

In JS++, there is the potential for code to be “ambiguous”. For instance, consider the following example:

Foo<bar> baz;

There are two interpretations for the above statement:

1. A comparison operation: Foo is less than bar is greater than baz.
2. A variable declaration with type Foo<bar> (where Foo is a generic type with type argument bar)

Since JS++ is a superset of the JavaScript programming language, we would naturally expect the first case since JS++ inherited this from JavaScript. However, in order to achieve a concise syntax for generic types, we also need to consider how we can enable the second case.

The Language Design and Why It Matters

Other JavaScript supersets use an ActionScript-like syntax:

var baz : Foo<bar>;

I have mentioned in the past that JavaScript supersets have a preference for this syntax because it is easier to parse. However, I never specified why. I also have a certain disdain for this syntax because it involves more typing. var is unnecessary and redundant. You also have to unnecessarily reach for the shift key to type a colon. In C-style languages (where JavaScript inherits its syntactic style), variables are traditionally declared with their type. Compare the “colon syntax” with JS++ which uses a more traditional C-style syntax:

var foo : string = "abc"; // Other languages
string foo = "abc";         // JS++

In JS++, you type less and there’s no need to reach for the shift key to declare variables. It might seem like a trivial matter, but variable declarations will be used so many times throughout a program that it’s critical to get this detail right. It shows we cared when we designed JS++. The “colon syntax” has the benefit of not creating ambiguity; the drawback is that it degrades the developer experience.

In a nutshell: the colon syntax is easier for the compiler writers and harder for the developers, and the traditional C syntax is harder for the compiler writers and easier for the developers.

Basic Disambiguation

We can achieve basic disambiguation with the following rule: If Foo is declared as a generic class, we have a variable declaration; otherwise, we have a comparison operation.

Variable Declaration

class Foo<T> {}
Foo<bar> baz;

Comparison Expressions

Foo<bar> baz;

Notice the only difference between the two snippets is the presence of a generic class declaration. We use a symbol table (built during parse time) to determine whether Foo is a generic class. Furthermore, we use “unlimited lookahead” to determine if there is a need for disambiguation. Given these complexities, it should naturally follow that such a grammar cannot be achieved with parser generators like yacc/bison, which only support LL(1) and LALR(1) (one-token lookahead) context-free grammars.

However, this strategy has a drawback: it only works if the class was declared locally in the same file.

Imports and Cycles

JS++ is a multi-file, modular language. Therefore, a generic class can be declared in another file. It might seem like the solution would be obvious: build a symbol table for each individual file and import the symbol tables.

However, JS++ is tricky — not least because its intricate import system but because it allows cyclic imports. If a file A.jspp depends on B.jspp, and B.jspp in turn depends on A.jspp, how will we determine which to import first? Furthermore, we are faced with a troubling dilemma: if a generic class was declared in one of the dependencies, how do we parse? It means Foo<bar> baz; would be completely ambiguous.

In JS++, the user is not required to specify an import order; the compiler will figure out the order to process the input files by itself; unfortunately, this import order can only be figured out after the input files have been parsed (because JS++ imports based on identifiers and not file paths).

This was almost the straw that broke the camel’s back.

Disambiguation with GLR Parsing

To solve our problem, we introduced GLR parsing. When faced with an ambiguous parse (is it generic or is it a comparison?), we parse both… but only if it’s ambiguous.

No, we do not parse the file twice with both interpretations. We simply parse the specific expression or statement under both interpretations. Additionally, we only parse both interpretations if it’s ambiguous. It’s not always ambiguous because the symbol might be defined locally and we’ll know we have a generic rather than a comparison right away.

In a post-processing step (after all files have been parsed and all symbol tables have been built), we can disambiguate whether it’s a variable declaration or a comparison expression. For parsers that build the symbol table during parsing, it may not be possible to “complete” the symbol table until the post-processing step. This is because it’s not possible to determine if c in a<b> c should be added to the symbol table until disambiguation occurs.

Advantages and Disadvantages

To ensure we design correctly before beginning implementation, we have consultants that review our architecture and designs. The consultants we hire typically have a PhD in computer science, and the particular consultant that we discussed this with is a contributor to the gcc compiler. He described several pros and cons for GLR parsing:

Advantages

  • All valid syntax will be parsed eventually
  • No additional restrictions on the language syntax

Disadvantages

  • Performance penalties. Not only will you be required to parse both interpretations of the initial statement but also all expressions that depend on that statement. This can lead to “a whole wood of parse trees”.
  • Certain early optimizations in the parse tree require a second run because they are not decidable for both alternatives.

Fortunately, in the case of JS++, the disadvantages could not occur and we were able to take full advantage of GLR parsing.

Optimization

In JS++’s case, we were able to optimize the GLR parsing so that it would not need to be performed if it was unnecessary. By leveraging unlimited lookahead, we can determine quite a bit:

Foo < bar && baz;      // No GLR necessary
Foo < bar > baz;       // Yes
Foo < int > foo;     // No
// etc.

Furthermore, in JS++, unlimited lookahead and GLR parsing would only need to be performed until the semicolon.

Bonus Section: Type Casts, Instantiations, and Function Calls

If you are only interested in GLR parsing in general, this section may be redundant and you can safely skip it; however, if you want to better understand JS++ under the hood, this can be useful. There is another case in JS++ where GLR parsing may be required. I’m a traditionalist, and I like the C syntax. As a consequence, we use the C-style syntax for explicit type casts (rather than an as keyword and the like):

int foo = (int) bar;

Once again, this introduces ambiguities:

(foo)(bar);

In the above expression, are we calling a function foo with argument bar, or are we casting bar to type foo? Yet again, inheriting JavaScript’s syntax introduces complexity.

Fortunately, we are able to resolve the ambiguity in a similar way to variable declarations vs. comparison expressions. If foo is a type, it’s a type cast. In all other cases, it’s a function call.

This should have been satisfactory. However, we were concerned with instantiation too. Since JavaScript allows instantiation without arguments, JS++ inherited this:

new Foo; // Instantiate 'Foo' with no arguments

This can create an entire array of complexities:

new a;                // Instantiation
new (a);              // Instantiation
new a();              // Instantiation
new a(b);             // Instantiation
new a(b, c);          // Instantiation
new (a)();            // Instantiation
new (a)(b);           // Ambiguous
new (a)(b, c);        // Ambiguous
new (a)()();          // Instantiation. Valid code: function a(){ return function() {}; } new a()();
new (a)(b)();         // Ambiguous
new (a)(b, c)();      // Ambiguous
new (a)(b)(c);        // Ambiguous
new (a)(b, c)(d, e);  // Ambiguous
new (a)()(b);         // Instantiation

new ();       // Parse error
new ()a;      // Parse error
new ()(a);    // Parse error
new ()(a)(b); // Parse error

The ambiguities in the above examples were much more difficult to resolve. We couldn’t disambiguate based on whether or not a “type” was present because instantiation (a ‘new’ expression) requires a type to be specified just as much as a type cast expression would. How we parse also depends on the order of operations.

The simple cop out would be to require all instantiations to specify explicit arguments. However, I felt this was too restrictive. Since JavaScript’s “native global objects” can be instantiated, code like this is all too common:

var d = new Date;

This introduces a tricky predicament. We didn’t want to prevent developers with a JavaScript code base from using JS++ without running a converter. There are corner cases where conversion is required, but if the syntax is in popular usage, we couldn’t (and shouldn’t) make a breaking change. However, I noticed a strange quirk in JavaScript:

function a(){ }
new +a; // Uncaught SyntaxError: Unexpected token +

This happened for all the unary operators. Our lead engineer promptly looked up the ECMAScript specification, and it appears unary expressions are not allowed inside ‘new’ expressions. We were in luck. Problem solved. In JS++, as specified in the documentation, a type cast expression is a unary expression. Since we already disambiguated function calls vs. type casts, we had nothing else to worry about.

… And that’s how the type cast syntax came to JS++.

Conclusion

If done right, the GLR parsing strategy can be effective for introducing cleaner syntax. GLR parsing has its drawbacks and needs to be carefully considered with the rest of the language design and parser/compiler architecture. We hope this in-depth analysis provides greater insight to the internals of JS++ and enables you to better understand the nuances of the language.

Roger PoonRoger Poon
JS++ Designer and Project Lead. Follow me on Twitter or GitHub.