{"id":157,"date":"2016-11-16T01:01:07","date_gmt":"2016-11-16T01:01:07","guid":{"rendered":"http:\/\/www.onux.com\/jspp\/blog\/?p=157"},"modified":"2021-12-27T01:22:58","modified_gmt":"2021-12-27T01:22:58","slug":"glr-parsing-and-disambiguation","status":"publish","type":"post","link":"https:\/\/www.onux.com\/jspp\/blog\/glr-parsing-and-disambiguation\/","title":{"rendered":"Compiler Architecture: GLR Parsing and Disambiguation"},"content":{"rendered":"<p>Today I want to talk about GLR parsing and the internals of the JS++ parser.<\/p>\n<h3>The Problem<\/h3>\n<p>In JS++, there is the potential for code to be &#8220;ambiguous&#8221;. For instance, consider the following example:<\/p>\n<pre class=\"brush: jspp\">\r\nFoo&lt;bar&gt; baz;\r\n<\/pre>\n<p>There are two interpretations for the above statement:<\/p>\n<p>1. A comparison operation: <code>Foo<\/code> is less than <code>bar<\/code> is greater than <code>baz<\/code>.<br \/>\n2. A variable declaration with type <code>Foo&lt;bar&gt;<\/code> (where <code>Foo<\/code> is a generic type with type argument <code>bar<\/code>)<\/p>\n<p>Since JS++ is a superset of the JavaScript programming language, we would naturally expect the first case since JS++ inherited this from JavaScript. However, in order to achieve a concise syntax for generic types, we also need to consider how we can enable the second case.<\/p>\n<p><!--more--><\/p>\n<h3>The Language Design and Why It Matters<\/h3>\n<p>Other JavaScript supersets use an ActionScript-like syntax:<\/p>\n<pre class=\"brush: js\">\r\nvar baz : Foo&lt;bar&gt;;\r\n<\/pre>\n<p>I have <a href=\"https:\/\/sdtimes.com\/the-case-for-js-plus-plus\/\" target=\"_blank\" rel=\"noopener noreferrer\">mentioned in the past<\/a> that JavaScript supersets have a preference for this syntax because it is easier to parse. However, I never specified why. I also have a certain disdain for this syntax because it involves more typing. <code>var<\/code> is unnecessary and redundant. You also have to unnecessarily reach for the shift key to type a colon. In C-style languages (where JavaScript inherits its syntactic style), variables are traditionally declared with their type. Compare the &#8220;colon syntax&#8221; with JS++ which uses a more traditional C-style syntax:<\/p>\n<pre class=\"brush: jspp\">\r\nvar foo : string = \"abc\"; \/\/ Other languages\r\nstring foo = \"abc\";         \/\/ JS++\r\n<\/pre>\n<p>In JS++, you type less and there&#8217;s no need to reach for the shift key to declare variables. It might seem like a trivial matter, but variable declarations will be used so many times throughout a program that it&#8217;s critical to get this detail right. It shows we cared when we designed JS++. The &#8220;colon syntax&#8221; has the benefit of not creating ambiguity; the drawback is that it degrades the developer experience.<\/p>\n<p>In a nutshell: the colon syntax is easier for the compiler writers and harder for the developers, and the traditional C syntax is harder for the compiler writers and easier for the developers.<\/p>\n<h3>Basic Disambiguation<\/h3>\n<p>We can achieve basic disambiguation with the following rule: If <code>Foo<\/code> is declared as a generic class, we have a variable declaration; otherwise, we have a comparison operation.<\/p>\n<p><strong>Variable Declaration<\/strong><\/p>\n<pre class=\"brush: jspp\">\r\nclass Foo&lt;T&gt; {}\r\nFoo&lt;bar&gt; baz;\r\n<\/pre>\n<p><strong>Comparison Expressions<\/strong><\/p>\n<pre class=\"brush: jspp\">\r\nFoo&lt;bar&gt; baz;\r\n<\/pre>\n<p>Notice the only difference between the two snippets is the presence of a generic class declaration. We use a symbol table (built during parse time) to determine whether <code>Foo<\/code> is a generic class. Furthermore, we use &#8220;unlimited lookahead&#8221; to determine if there is a need for disambiguation. Given these complexities, it should naturally follow that such a grammar cannot be achieved with parser generators like yacc\/bison, which only support LL(1) and LALR(1) (one-token lookahead) <em>context-free<\/em> grammars.<\/p>\n<p>However, this strategy has a drawback: it only works if the class was declared locally in the same file.<\/p>\n<h3>Imports and Cycles<\/h3>\n<p>JS++ is a multi-file, modular language. Therefore, a generic class can be declared in another file. It might <em>seem<\/em> like the solution would be obvious: build a symbol table for each individual file and import the symbol tables.<\/p>\n<p>However, JS++ is tricky &mdash; not least because its intricate import system but because it allows <em>cyclic imports<\/em>. If a file A.jspp depends on B.jspp, and B.jspp in turn depends on A.jspp, how will we determine which to import first? Furthermore, we are faced with a troubling dilemma: if a generic class was declared in one of the dependencies, how do we parse? It means <code>Foo&lt;bar&gt; baz;<\/code> would be completely ambiguous.<\/p>\n<p>In JS++, the user is <strong>not<\/strong> required to specify an import order; the compiler will figure out the order to process the input files by itself; unfortunately, this import order can only be figured out <em>after<\/em> the input files have been parsed (because JS++ imports based on identifiers and not file paths).<\/p>\n<p>This was almost the straw that broke the camel&#8217;s back.<\/p>\n<h3>Disambiguation with GLR Parsing<\/h3>\n<p>To solve our problem, we introduced GLR parsing. When faced with an ambiguous parse (is it generic or is it a comparison?), we parse both&#8230; but only if it&#8217;s ambiguous.<\/p>\n<p>No, we do not parse the file twice with both interpretations. We simply parse the specific expression or statement under both interpretations. Additionally, we only parse both interpretations if it&#8217;s ambiguous. It&#8217;s not always ambiguous because the symbol might be defined locally and we&#8217;ll know we have a generic rather than a comparison right away.<\/p>\n<p>In a post-processing step (after all files have been parsed and all symbol tables have been built), we can disambiguate whether it&#8217;s a variable declaration or a comparison expression. For parsers that build the symbol table during parsing, it may not be possible to &#8220;complete&#8221; the symbol table until the post-processing step. This is because it&#8217;s not possible to determine if <code>c<\/code> in <code>a&lt;b&gt; c<\/code> should be added to the symbol table until disambiguation occurs.<\/p>\n<h3>Advantages and Disadvantages<\/h3>\n<p>To ensure we design correctly before beginning implementation, we have consultants that review our architecture and designs. The consultants we hire typically have a PhD in computer science, and the particular consultant that we discussed this with is a contributor to the gcc compiler. He described several pros and cons for GLR parsing:<\/p>\n<p><strong>Advantages<\/strong><\/p>\n<ul>\n<li>All valid syntax will be parsed eventually<\/li>\n<li>No additional restrictions on the language syntax<\/li>\n<\/ul>\n<p><strong>Disadvantages<\/strong><\/p>\n<ul>\n<li>Performance penalties. Not only will you be required to parse both interpretations of the initial statement but also all expressions that depend on that statement. This can lead to &#8220;a whole wood of parse trees&#8221;.<\/li>\n<li>Certain early optimizations in the parse tree require a second run because they are not decidable for both alternatives.<\/li>\n<\/ul>\n<p>Fortunately, in the case of JS++, the disadvantages could not occur and we were able to take full advantage of GLR parsing.<\/p>\n<h3>Optimization<\/h3>\n<p>In JS++&#8217;s case, we were able to optimize the GLR parsing so that it would not need to be performed if it was unnecessary. By leveraging unlimited lookahead, we can determine quite a bit:<\/p>\n<pre class=\"brush: jspp\">\r\nFoo < bar &#038;&#038; baz;      \/\/ No GLR necessary\r\nFoo < bar > baz;       \/\/ Yes\r\nFoo < int > foo;     \/\/ No\r\n\/\/ etc.\r\n<\/pre>\n<p>Furthermore, in JS++, unlimited lookahead and GLR parsing would only need to be performed until the semicolon.<\/p>\n<h3>Bonus Section: Type Casts, Instantiations, and Function Calls<\/h3>\n<p>If you are only interested in GLR parsing in general, this section may be redundant and you can safely skip it; however, if you want to better understand JS++ under the hood, this can be useful. There is another case in JS++ where GLR parsing may be required. I&#8217;m a traditionalist, and I like the C syntax. As a consequence, we use the C-style syntax for explicit type casts (rather than an <code>as<\/code> keyword and the like):<\/p>\n<pre class=\"brush: jspp\">\r\nint foo = (int) bar;\r\n<\/pre>\n<p>Once again, this introduces ambiguities:<\/p>\n<pre class=\"brush: jspp\">\r\n(foo)(bar);\r\n<\/pre>\n<p>In the above expression, are we calling a function <code>foo<\/code> with argument <code>bar<\/code>, or are we casting <code>bar<\/code> to type <code>foo<\/code>? Yet again, inheriting JavaScript&#8217;s syntax introduces complexity.<\/p>\n<p>Fortunately, we are able to resolve the ambiguity in a similar way to variable declarations vs. comparison expressions. If <code>foo<\/code> is a type, it&#8217;s a type cast. In all other cases, it&#8217;s a function call.<\/p>\n<p>This should have been satisfactory. However, we were concerned with instantiation too. Since JavaScript allows instantiation without arguments, JS++ inherited this:<\/p>\n<pre class=\"brush: jspp\">\r\nnew Foo; \/\/ Instantiate 'Foo' with no arguments\r\n<\/pre>\n<p>This can create an entire array of complexities:<\/p>\n<pre class=\"brush: jspp\">\r\nnew a;                \/\/ Instantiation\r\nnew (a);              \/\/ Instantiation\r\nnew a();              \/\/ Instantiation\r\nnew a(b);             \/\/ Instantiation\r\nnew a(b, c);          \/\/ Instantiation\r\nnew (a)();            \/\/ Instantiation\r\nnew (a)(b);           \/\/ Ambiguous\r\nnew (a)(b, c);        \/\/ Ambiguous\r\nnew (a)()();          \/\/ Instantiation. Valid code: function a(){ return function() {}; } new a()();\r\nnew (a)(b)();         \/\/ Ambiguous\r\nnew (a)(b, c)();      \/\/ Ambiguous\r\nnew (a)(b)(c);        \/\/ Ambiguous\r\nnew (a)(b, c)(d, e);  \/\/ Ambiguous\r\nnew (a)()(b);         \/\/ Instantiation\r\n\r\nnew ();       \/\/ Parse error\r\nnew ()a;      \/\/ Parse error\r\nnew ()(a);    \/\/ Parse error\r\nnew ()(a)(b); \/\/ Parse error\r\n<\/pre>\n<p>The ambiguities in the above examples were much more difficult to resolve. We couldn&#8217;t disambiguate based on whether or not a &#8220;type&#8221; was present because instantiation (a &#8216;new&#8217; expression) requires a type to be specified just as much as a type cast expression would. How we parse also depends on the <a href=\"https:\/\/docs.onux.com\/en-US\/Developers\/JavaScript-PP\/Language-Guide\/operator-precedence\" target=\"_blank\" rel=\"noopener noreferrer\">order of operations<\/a>.<\/p>\n<p>The simple cop out would be to require all instantiations to specify explicit arguments. However, I felt this was too restrictive. Since JavaScript&#8217;s &#8220;native global objects&#8221; can be instantiated, code like this is all too common:<\/p>\n<pre class=\"brush: jspp\">\r\nvar d = new Date;\r\n<\/pre>\n<p>This introduces a tricky predicament. We didn&#8217;t want to prevent developers with a JavaScript code base from using JS++ without running a converter. There are corner cases where conversion is required, but if the syntax is in <em>popular usage<\/em>, we couldn&#8217;t (and shouldn&#8217;t) make a breaking change. However, I noticed a strange quirk in JavaScript:<\/p>\n<pre class=\"brush: js\">\r\nfunction a(){ }\r\nnew +a; \/\/ Uncaught SyntaxError: Unexpected token +\r\n<\/pre>\n<p>This happened for all the unary operators. Our lead engineer promptly looked up the ECMAScript specification, and it appears unary expressions are not allowed inside &#8216;new&#8217; expressions. We were in luck. Problem solved. In JS++, as specified in the documentation, a type cast expression is a unary expression. Since we already disambiguated function calls vs. type casts, we had nothing else to worry about.<\/p>\n<p>&#8230; And that&#8217;s how the type cast syntax came to JS++.<\/p>\n<h3>Conclusion<\/h3>\n<p>If done right, the GLR parsing strategy can be effective for introducing cleaner syntax. GLR parsing has its drawbacks and needs to be carefully considered with the rest of the language design and parser\/compiler architecture. We hope this in-depth analysis provides greater insight to the internals of JS++ and enables you to better understand the nuances of the language.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>Today I want to talk about GLR parsing and the internals of the JS++ parser. The Problem In JS++, there is the potential for code to be &#8220;ambiguous&#8221;. For instance, consider the following example: Foo&lt;bar&gt; baz; There are two interpretations for the above statement: 1. A comparison operation: Foo is less than bar is greater &hellip; <a href=\"https:\/\/www.onux.com\/jspp\/blog\/glr-parsing-and-disambiguation\/\" class=\"more-link\">Continue reading<span class=\"screen-reader-text\"> &#8220;Compiler Architecture: GLR Parsing and Disambiguation&#8221;<\/span><\/a><\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"open","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":[],"categories":[3,2,9],"tags":[6,5,8,7],"_links":{"self":[{"href":"https:\/\/www.onux.com\/jspp\/blog\/wp-json\/wp\/v2\/posts\/157"}],"collection":[{"href":"https:\/\/www.onux.com\/jspp\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.onux.com\/jspp\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.onux.com\/jspp\/blog\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/www.onux.com\/jspp\/blog\/wp-json\/wp\/v2\/comments?post=157"}],"version-history":[{"count":26,"href":"https:\/\/www.onux.com\/jspp\/blog\/wp-json\/wp\/v2\/posts\/157\/revisions"}],"predecessor-version":[{"id":987,"href":"https:\/\/www.onux.com\/jspp\/blog\/wp-json\/wp\/v2\/posts\/157\/revisions\/987"}],"wp:attachment":[{"href":"https:\/\/www.onux.com\/jspp\/blog\/wp-json\/wp\/v2\/media?parent=157"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.onux.com\/jspp\/blog\/wp-json\/wp\/v2\/categories?post=157"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.onux.com\/jspp\/blog\/wp-json\/wp\/v2\/tags?post=157"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}