Skip to main content

module ParseTree

rascal-Not specified

Library functions for parse trees.

Usage

import ParseTree;

Dependencies

extend List;
extend Message;
extend Type;
import Node;
import Set;

Description

A concrete syntax tree or parse tree is an ordered, rooted tree that represents the syntactic structure of a string according to some formal grammar.

Most Rascal users will encounter parse trees in the form of concrete values. Expert users may find the detailed description here useful when writing generic functions on parse trees.

In Rascal parse trees, the interior nodes are labeled by rules of the grammar, while the leaf nodes are labeled by terminals (characters) of the grammar.

Tree is the universal parse tree data type in Rascal and can be used to represent parse trees for any language.

  • Tree is a subtype of the type node
  • All types (non-terminals) declared in syntax, lexical, layout and keyword definitions are sub-types of Tree.
  • All concrete syntax expressions produce parse trees with a type corresponding to a non-terminal.
  • Trees can be annotated in various ways. Most importantly the \loc annotation always points to the source location of any (sub) parse tree.

Advanced users may want to create tools that analyze any parse tree, regardless of the syntax definition that generated it, you can manipulate them on the abstract level.

A parse tree is of type Tree using the auxiliary types Production, Symbol, Condition, Attr, Associativity, CharRange. Effectively, a parse tree is a nested tree structure of type Tree.

  • Most internal nodes are applications (appl) of a Production to a list of children Tree nodes. Production is the abstract representation of a rule in a syntax definition. which consists of a definition of an alternative for a Symbol by a list of Symbols.

  • The leaves of a parse tree are always characters (char), which have an integer index in the Unicode table. Specifically the integer value is a decimal codepoint. A Unicode codepoint stands for an abstract character: an atomic unit of textual data. By the way, the first 127 characters are the same as the ASCII table.

  • Some internal nodes encode ambiguity (amb) by pointing to a set of alternative Tree nodes.

The Production and Symbol types are an abstract notation for rules in syntax definitions, while the Tree type is the actual notation for parse trees.

Parse trees are called parse forests when they contain amb nodes.

You can analyze and manipulate parse trees in three ways:

  • Directly on the Tree level, just like any other algebraic data type.
  • Using concrete syntax expressions and concrete syntax patterns.
  • Using disambiguation actions (parameters of the parse function)

The type of a parse tree is the symbol that it's production produces, i.e. appl(prod(sort("A"),[],{}),[]) has type A. Ambiguity nodes Each such a non-terminal type has Tree as its immediate super-type.

Examples

rascal>import ParseTree;
ok
rascal>syntax A = "a";
ok

will make the following succeed:

rascal>t = parse(#A,"a");
A: (A) `a`
rascal>t := appl(
|1 >>>> prod(
|2 >>>> sort("A"),
|3 >>>> [lit("a")],
|4 >>>> {}),
|5 >>>> [appl(
|6 >>>> prod(
|7 >>>> lit("a"),
|8 >>>> [\char-class([range(97,97)])],
|9 >>>> {}),
|10 >>>> [char(97)])]);
bool: true

You see that the defined non-terminal A ends up as the production for the outermost node. As the only child is the tree for recognizing the literal a, which is defined to be a single a from the character-class [ a ].

When we use labels in the definitions, they also end up in the trees. The following definition

rascal>import ParseTree;
ok
rascal>lexical B= myB:"b";
ok
rascal>lexical C = myC:"c" B bLabel;
ok

Will make the following succeed:

rascal>t = parse(#C,"cb");
C: (C) `cb`
rascal>t := appl(
|1 >>>> prod(
|2 >>>> label(
|3 >>>> "myC",
|4 >>>> lex("C")),
|5 >>>> [
|6 >>>> lit("c"),
|7 >>>> label(
|8 >>>> "bLabel",
|9 >>>> lex("B"))
|10 >>>> ],
|11 >>> {}),
|12 >>> [appl(
|13 >>> prod(
|14 >>> lit("c"),
|15 >>> [\char-class([range(99,99)])],
|16 >>> {}),
|17 >>> [char(99)]),appl(
|18 >>> prod(
|19 >>> label(
|20 >>> "myB",
|21 >>> lex("B")),
|22 >>> [lit("b")],
|23 >>> {}),
|24 >>> [appl(
|25 >>> prod(
|26 >>> lit("b"),
|27 >>> [\char-class([range(98,98)])],
|28 >>> {}),
|29 >>> [char(98)])])]);
bool: true

Here you see that the alternative name is a label around the first argument of prod while argument labels become labels in the list of children of a prod.

Benefits

  • Parse trees have all the necessary information in them for high-fidelity source code analysis and transformation
  • Parse trees contain full definitions of the grammar rules that it applies
  • Parse trees can always be "unparsed" to source text again

Pitfalls

  • For historical reasons the name of the annotation is "loc" and this interferes with the Rascal keyword loc for the type of source locations. Therefore the annotation name has to be escaped as \loc when it is declared or used.
  • We are in transition from deprecating the annotation @\loc with the keyword field .src=|unknown:///|. Currently the run-time already uses .src while the source code still uses @\loc.

data Tree

The Tree data type as produced by the parser.

data Tree (loc parseError = |unknown:///|(0,0,<0,0>,<0,0>)) 
= appl(Production prod, list[Tree] args)
| cycle(Symbol symbol, int cycleLength)
| amb(set[Tree] alternatives)
| char(int character)
;

A Tree defines the trees normally found after parsing; additional constructors exist for execptional cases:

  • ❶ Parse tree constructor when parse succeeded.
  • ❷ Cyclic parsetree.
  • ❸ Ambiguous subtree.
  • ❹ A single character.

data Production

Production in ParseTrees

data Production  
= prod(Symbol def, list[Symbol] symbols, set[Attr] attributes)
| regular(Symbol def)
;

The type Production is introduced in Type, see Production. Here we extend it with the symbols that can occur in a ParseTree. We also extend productions with basic combinators allowing to construct ordered and un-ordered compositions, and associativity groups.

  • ❶ A prod is a rule of a grammar, with a defined non-terminal, a list of terminal and/or non-terminal symbols and a possibly empty set of attributes.

  • ❷ A regular is a regular expression, i.e. a repeated construct.

  • priority means operator precedence, where the order of the list indicates the binding strength of each rule;

  • assoc means all alternatives are acceptable, but nested on the declared side;

  • reference means a reference to another production rule which should be substituted there, for extending priority chains and such.

  • error means a node produced by error recovery.

  • skipped means characters skipped during error recovery, always the last child of an appl with a error production.

data Production

data Production  
= \priority(Symbol def, list[Production] choices)
| \associativity(Symbol def, Associativity \assoc, set[Production] alternatives)
| \reference(Symbol def, str cons)
;

data Production

data Production  
= \error(Symbol def, Production prod, int dot)
| \skipped(Symbol def)
;

data RuntimeException

A special exception that wraps errors that are (almost) certainly caused by unexpected parse errors

data RuntimeException  
= ParseErrorRecovery(RuntimeException trigger, loc src)
;

Certain operations will always succeed on regular parse trees but will fail when the parse tree is an error Tree resulting from error recovery. A typical example is t.someField where t is an error node and someField is a field that would normally be present in the tree but is after the dot so it is missing. Normally a NoSuchField("someField") exception would be thrown but because this problem is caused by a parse error, the original exception is wrapped like this: ParseErrorRecovery(NoSuchField("someField"), t.src).

Pitfalls

it is advised to try/catch these exception high up in the call graph of your language processor, otherwise you'll have to write try/catch in many different places

data Attr

Attributes in productions.

data Attr  
= \bracket()
| \assoc(Associativity \assoc)
;

An Attr (attribute) documents additional semantics of a production rule. Neither tags nor brackets are processed by the parser generator. Rather downstream processors are activated by these. Associativity is a parser generator feature though.

data Associativity

Associativity attribute.

data Associativity  
= \left()
| \right()
| \assoc()
| \non-assoc()
;

Associativity defines the various kinds of associativity of a specific production.

data CharRange

Character ranges and character class

data CharRange  
= range(int begin, int end)
;
  • CharRange defines a range of characters.
  • A CharClass consists of a list of characters ranges.

alias CharClass

list[CharRange]

data Symbol

Symbols that can occur in a ParseTree

data Symbol  
= \start(Symbol symbol)
;

The type Symbol is introduced in Type, see Symbol, to represent the basic Rascal types, e.g., int, list, and rel. Here we extend it with the symbols that may occur in a ParseTree.

  • ❶ The start symbol wraps any symbol to indicate that it is a start symbol of the grammar and may occur at the root of a parse tree.
  • ❷ Context-free non-terminal
  • ❸ Lexical non-terminal
  • ❹ Layout symbols
  • ❺ Terminal symbols that are keywords
  • ❻ Parameterized context-free non-terminal
  • ❼ Parameterized lexical non-terminal
  • ❽ Terminal.
  • ❾ Case-insensitive terminal.
  • ❶⓿ Character class
  • ❶❶ Empty symbol
  • ❶❷ Optional symbol
  • ❶❸ List of one or more symbols without separators
  • ❶❹ List of zero or more symbols without separators
  • ❶❺ List of one or more symbols with separators
  • ❶❻ List of zero or more symbols with separators
  • ❶❼ Alternative of symbols
  • ❶❽ Sequence of symbols
  • ❶❾ Conditional occurrence of a symbol.

data Symbol

data Symbol  
= \sort(str name)
| \lex(str name)
| \layouts(str name)
| \keywords(str name)
| \parameterized-sort(str name, list[Symbol] parameters)
| \parameterized-lex(str name, list[Symbol] parameters)
;

data Symbol

data Symbol  
= \lit(str string)
| \cilit(str string)
| \char-class(list[CharRange] ranges)
;

data Symbol

data Symbol  
= \empty()
| \opt(Symbol symbol)
| \iter(Symbol symbol)
| \iter-star(Symbol symbol)
| \iter-seps(Symbol symbol, list[Symbol] separators)
| \iter-star-seps(Symbol symbol, list[Symbol] separators)
| \alt(set[Symbol] alternatives)
| \seq(list[Symbol] symbols)
;

data Symbol

data Symbol  
= \conditional(Symbol symbol, set[Condition] conditions)
;

function subtype

bool subtype(Symbol::\sort(_), Symbol::\adt("Tree", _))

data Condition

Datatype for declaring preconditions and postconditions on symbols

data Condition  
= \follow(Symbol symbol)
| \not-follow(Symbol symbol)
| \precede(Symbol symbol)
| \not-precede(Symbol symbol)
| \delete(Symbol symbol)
| \at-column(int column)
| \begin-of-line()
| \end-of-line()
| \except(str label)
;

A Condition can be attached to a symbol; it restricts the applicability of that symbol while parsing input text. For instance, follow requires that it is followed by another symbol and at-column requires that it occurs at a certain position in the current line of the input text.

function priority

Nested priority is flattened.

Production priority(Symbol s, [*Production a, priority(Symbol _, list[Production] b), *Production c])

function associativity

Normalization of associativity.

Production associativity(Symbol s, Associativity as, {*Production a, choice(Symbol t, set[Production] b)})

Production associativity(Symbol rhs, Associativity a, {associativity(rhs, Associativity b, set[Production] alts), *Production rest})

Production associativity(Symbol s, Associativity as, {*Production a, priority(Symbol t, list[Production] b)})
  • The Choice constructor under associativity is flattened.
  • Nested (equal) associativity is flattened.
  • Priority under an associativity group defaults to choice.

function parse

Parse input text (from a string or a location) and return a parse tree.

&T<:Tree parse(type[&T<:Tree] begin, str input, bool allowAmbiguity=false, int maxAmbDepth=2, bool allowRecovery=false, int maxRecoveryAttempts=30, int maxRecoveryTokens=3, bool hasSideEffects=false, set[Tree(Tree)] filters={})

&T<:Tree parse(type[&T<:Tree] begin, str input, loc origin, bool allowAmbiguity=false, int maxAmbDepth=2, bool allowRecovery=false, int maxRecoveryAttempts=30, int maxRecoveryTokens=3, bool hasSideEffects=false, set[Tree(Tree)] filters={})

&T<:Tree parse(type[&T<:Tree] begin, loc input, bool allowAmbiguity=false, int maxAmbDepth=2, bool allowRecovery=false, int maxRecoveryAttempts=30, int maxRecoveryTokens=3, bool hasSideEffects=false, set[Tree(Tree)] filters={})
  • Parse a string and return a parse tree.
  • Parse a string and return a parse tree, origin defines the original location of the input.
  • Parse the contents of resource input and return a parse tree.

The parse either throws ParseError exceptions or returns parse trees of type Tree.

The allowAmbiguity flag dictates the behavior of the parser in the case of ambiguity. When allowAmbiguity=true the parser will construct ambiguity clusters (local sets of parse trees where the input string is ambiguous). If it is false the parser will throw an Ambiguous exception instead. An Ambiguous exception is comparable to a ParseError exception then. The latter option terminates much faster, i.e. always in cubic time, and always linear in the size of the intermediate parse graph, while constructing ambiguous parse forests may grow to O(n^p+1), where p is the length of the longest production rule and n is the length of the input.

The maxAmbDepth parameter is used to limit the depth of ambiguity clusters. If the depth of a cluster exceeds this value, a random alternative is chosen and the rest of the alternatives are ignored. So for instance when maxAmbDepth is set to 1, the first level of ambiguities is produced normally, but no ambiguities within ambiguities will be produced. This feature is primarily useful when using error recovery (see below) as error recovery tends to generate many nested ambiguities. Without limiting the depth of ambiguities, the parser parse forests generated by error recovery can easily become too large to process.

The allowRecovery can be set to true to enable error recovery. This is an experimental feature. When error recovery is enabled, the parser will attempt to recover from parse errors and continue parsing. If successful, a parse tree with error and skipped productions is returned (see the definition of Production above). The util::ErrorRecovery module contains a number of functions to analyze trees with errors, for example hasErrors, getSkipped, and getErrorText. Note that the resulting parse forest can contain a lot of ambiguities. Any code that processes error trees must be aware of this, for instance a simple traversal of all subtrees will be too expensive in most cases. disambiguateParseErrors can be used to efficiently prune the forest and leave a tree with a single (or even zero) errors based on simple heuristics, but these heuristics are somewhat arbitrary so the usability of this function is limited.

The filters set contains functions which may be called optionally after the parse algorithm has finished and just before the Tree representation is built. The set of functions contain alternative functions, only on of them is successfully applied to each node in a tree. If such a function fails to apply, the other ones are tried. There is no fixed-point computation, so composed filters must be added to the set of filters programmatically. Post-parse filtering is best done at this stage and not later on the Tree representation for efficiency reasons. Namely, the size of the parse graph before Tree construction is still cubic due to "binarized" sharing of intermediate nodes, while after Tree construction the forest may obtain a size in O(n^p+1) where n is the length of the input and p is the length of the longest syntax rule. Filtering using the filters parameter, on the other hand, may very well cut the forest quickly down to even a linear size and result in an efficient overall parsing algorithm.

The hasSideEffects flag is normally set to false. When the filters functions have side-effects to remove ambiguity, this option must be set to true to ensure correct behavior. A side-effect of filter functions is typically the construction of a symbol table and the removal (see [[Statements/Filter]]) of syntax trees which refer to undefined symbols. In such a case hasSideEffects must be set to true for correctness' sake. If its set to false then the algorithm assumes tree construction is context-free and it can memoize the results of shared intermediate graph nodes. The tree construction algorithm is effectively always worst case polynomial in O(n^p+1) --p being the length of the longest syntax rule-- when hasSideEffects is true, but may be linear when set to false. So this is quite an important flag to consider.

Examples

rascal>lexical Number = [0-9]+;
ok
rascal>syntax Exp
|1 >>>> = Number
|2 >>>> | left Exp "+" Exp
|3 >>>> ;
ok
rascal>import ParseTree;
ok

Seeing that parse returns a parse tree:

rascal>parse(#Exp, "2+3");
Exp: (Exp) `2+3`

Catching a parse error:

rascal>import IO;
ok
rascal>try {
|1 >>>> Exp e = parse(#Exp, "2@3");
|2 >>>>}
|3 >>>>catch ParseError(loc l): {
|4 >>>> println("I found a parse error at line <l.begin.line>, column <l.begin.column>");
|5 >>>>}
I found a parse error at line 1, column 1
ok

function parser

Generates a parser from an input grammar.

&T (value input, loc origin) parser(type[&T] grammar, bool allowAmbiguity=false, int maxAmbDepth=2, bool allowRecovery=false, int maxRecoveryAttempts=30, int maxRecoveryTokens=3, bool hasSideEffects=false, set[Tree(Tree)] filters={})

This builtin function wraps the Rascal parser generator by transforming a grammar into a parsing function.

The resulting parsing function has the following overloaded signature:

  • Tree parse(str input, loc origin);
  • Tree parse(loc input, loc origin);

So the parse function reads either directly from a str or via the contents of a loc. It also takes a origin parameter which leads to the prefix of the src fields of the resulting tree.

The parse function behaves differently depending of the given keyword parameters:

 * `allowAmbiguity`: if true then no exception is thrown in case of ambiguity and a parse forest is returned. if false,
the parser throws an exception during tree building and produces only the first ambiguous subtree in its message.
if set to `false`, the parse constructs trees in linear time. if set to `true` the parser constructs trees in polynomial time.
* `maxAmbDepth`: can be used to limit the maximum depth of ambiguity clusters. For instance, when set to 1, no ambiguities within ambiguities are produced.
When set to 0, all ambiguity clusters will be removed. When set to -1 no ambiguity filtering will take place. Note that ambiguity pruning
is done by dropping all but the first alternative (which is more or less random but repeatable).
* 'allowRecovery`: ***experimental*** if true, the parser tries to recover when it encounters a parse error. if a parse error is encountered that can be recovered from,
special `error` and `skipped` productions are included in the resulting parse tree. More documentation will be added here when this feature matures.
Note that if `allowRecovery` is set to true, the resulting tree can still contain ambiguity nodes related to recovered parse errors, even if `allowAmbiguity`
is set to false. When a 'regular` (non-error) ambiguity is found an exception is still thrown in this case.
* `maxRecoveryAttempts`: *** experimental *** the maximum number of recovery attempts to make when a parse error is encountered.
Only relevant when `allowRecovery` is set to true.
* `maxRecoveryTokens`: *** experimental *** the maximum number of tokens considered on each recovery attempt.
Only relevant when `allowRecovery` is set to true.
* `hasSideEffects`: if false then the parser is a lot faster when constructing trees, since it does not execute the parse _actions_ in an
interpreted environment to make side effects (like a symbol table) and it can share more intermediate results as a result.

function firstAmbiguityFinder

Generates a parser function that can be used to find the left-most deepest ambiguous sub-sentence.

Tree (value input, loc origin) firstAmbiguityFinder(type[Tree] grammar, bool hasSideEffects=false, set[Tree(Tree)] filters={})

Benefits

  • Instead of trying to build a polynomially sized parse forest, this function only builds the smallest part of the tree that exhibits ambiguity. This can be done very quickly, while the whole forest could take minutes to hours to construct.
  • Use this function for ambiguity diagnostics and regression testing for ambiguity.

Pitfalls

  • The returned sub-tree usually has a different type than the parameter of the type[] symbol that was passed in. The reason is that sub-trees typically have a different non-terminal than the start non-terminal of a grammar.

function parsers

Generates parsers from a grammar (reified type), where all non-terminals in the grammar can be used as start-symbol.

&U (type[&U] nonterminal, value input, loc origin) parsers(type[&T] grammar, bool allowAmbiguity=false, int maxAmbDepth=2, bool allowRecovery=false, int maxRecoveryAttempts=30, int maxRecoveryTokens=3, bool hasSideEffects=false,  set[Tree(Tree)] filters={})

This parser generator behaves the same as the parser function, but it produces parser functions which have an additional nonterminal parameter. This can be used to select a specific non-terminal from the grammar to use as start-symbol for parsing.

function firstAmbiguityFinders

Generates a parser function that can be used to find the left-most deepest ambiguous sub-sentence.

Tree (type[Tree] nonterminal, value input, loc origin) firstAmbiguityFinders(type[Tree] grammar, bool hasSideEffects=false,  set[Tree(Tree)] filters={})

Benefits

  • Instead of trying to build a polynomially sized parse forest, this function only builds the smallest part of the tree that exhibits ambiguity. This can be done very quickly, while the whole forest could take minutes to hours to construct.
  • Use this function for ambiguity diagnostics and regression testing for ambiguity.

Pitfalls

  • The returned sub-tree usually has a different type than the parameter of the type[] symbol that was passed in. The reason is that sub-trees typically have a different non-terminal than the start non-terminal of a grammar.

function firstAmbiguity

Parse the input but instead of returning the entire tree, return the trees for the first ambiguous substring.

Tree firstAmbiguity(type[Tree] begin, str input)

Tree firstAmbiguity(type[Tree] begin, loc input)

This function is similar to the Parse function in its functionality. However, in case of serious ambiguity parse could be very slow. This function is much faster, because it does not try to construct an entire forest and thus avoids the cost of constructing nested ambiguity clusters.

If the input sentence is not ambiguous after all, simply the entire tree is returned.

function storeParsers

Generate a parser and store it in serialized form for later reuse.

void storeParsers(type[Tree] grammar, loc saveLocation)

The stored parsers would be able to be recovered later using Load Parsers.

For any concrete grammar, a.k.a. reified syntax type, g it holds that:

  • after storeParsers(g, file);
  • then g = loadParsers(file);
  • and given h = parsers(g);
  • then for all valid nt, input and origin: g(nt, input, origin) == h(nt, input, origin)

In other words, a loaded parser function behaves exactly as a freshly generated parser for the same grammar, if (and only if) it was stored for the same grammar value.

Benefits

  • storing parsers is to cache the work of reifing a grammar, and generating a parser from that grammar.
  • stored parsers are nice for deployment scenerios where the language is fixed and efficiency is appreciated.

Pitfalls

  • caching parsers with storeParsers is your responsibility; the cache is not cleaned automatically when the grammar changes.

function loadParsers

Load a previously serialized parser from disk for usage

&U (type[&U] nonterminal, value input, loc origin) loadParsers(loc savedParsers, bool allowAmbiguity=false, int maxAmbDepth=2, bool allowRecovery=false, int maxRecoveryAttempts=30, int maxRecoveryTokens=3, bool hasSideEffects=false, set[Tree(Tree)] filters={})

For any concrete grammar, a.k.a. reified syntax type, g it holds that:

  • after storeParsers(g, file);
  • then g = loadParsers(file);
  • and given h = parsers(g);
  • then for all valid nt, input and origin: g(nt, input, origin) == h(nt, input, origin)

In other words, a loaded parser function behaves exactly as a freshly generated parser for the same grammar, if (and only if) it was stored for the same grammar value.

Examples

First we store a parser:

rascal>import ParseTree;
ok
rascal>syntax E = E "+" E | "e";
ok
rascal>storeParsers(#E, |memory://test-tmp/E.parsers|)
ok

Here we show a new shell does not even know about the grammar:

rascal>#E
|prompt:///|(1,1,<1,1>,<1,2>): Undeclared type: E
Advice: |https://www.rascal-mpl.org/docs/Rascal/Errors/CompileTimeErrors/UndeclaredType|

Then in a next run, we load the parser and use it:

rascal>import ParseTree;
ok
rascal>p = loadParsers(|memory://test-tmp/E.parsers|);
&U (type[&U], value, loc): function(|std:///ParseTree.rsc|(33071,2,<634,262>,<634,264>))
rascal>p(type(sort("E"), ()), "e+e", |src:///|);
&U<:Tree: appl(
prod(
sort("E"),
[
sort("E"),
layouts("$default$"),
lit("+"),
layouts("$default$"),
sort("E")
],
{}),
[appl(
prod(
sort("E"),
[lit("e")],
{}),
[appl(
prod(
lit("e"),
[\char-class([range(101,101)])],
{}),
[char(101)])],
src=|src:///|(0,1,<1,0>,<1,1>)),appl(
prod(
layouts("$default$"),
[],
{}),
[],
src=|src:///|(1,0,<1,1>,<1,1>)),appl(
prod(
lit("+"),
[\char-class([range(43,43)])],
{}),
[char(43)]),appl(
prod(
layouts("$default$"),
[],
{}),
[],
src=|src:///|(2,0,<1,2>,<1,2>)),appl(
prod(
sort("E"),
[lit("e")],
{}),
[appl(
prod(
lit("e"),
[\char-class([range(101,101)])],
{}),
[char(101)])],
src=|src:///|(2,1,<1,2>,<1,3>))],
src=|src:///|(0,3,<1,0>,<1,3>))

Benefits

  • loaded parsers can be used immediately without the need of loadig and executing a parser generator.

Pitfalls

  • reifiying types (use of #) will trigger the loading of a parser generator anyway. You have to use this notation for types to avoid that: type(\start(sort("MySort")), ()) to avoid the computation for #start[A]

function loadParser

Load a previously serialized parser, for a specific non-terminal, from disk for usage

&U (value input, loc origin) loadParser(type[&U] nonterminal, loc savedParsers, bool allowAmbiguity=false, int maxAmbDepth=2, bool allowRecovery=false, int maxRecoveryAttempts=30, int maxRecoveryTokens=3, bool hasSideEffects=false, set[Tree(Tree)] filters={})

This loader behaves just like Load Parsers, except that the resulting parser function is already bound to a specific non-terminal.

function unparse

Yield the string of characters that form the leafs of the given parse tree.

str unparse(Tree tree)

unparse is the inverse function of Parse, i.e., for every syntactically correct string TXT of type S, the following holds:

unparse(parse(#S, TXT)) == TXT

Examples

rascal>syntax Exp = Exp "+" Exp | Number;
ok
rascal>lexical Number = [0-9]+;
ok
rascal>import ParseTree;
ok

First parse an expression, this results in a parse tree. Then unparse this parse tree:

rascal>unparse(parse(#Exp, "2+3"));
str: "2+3"
───
2+3
───

function printSymbol

str printSymbol(Symbol sym, bool withLayout)

function implode

Implode a parse tree according to a given (ADT) type.

&T<:value implode(type[&T<:value] t, Tree tree)

Given a grammar for a language, its sentences can be parsed and the result is a parse tree (or more precisely a value of type Tree). For many applications this is sufficient and the results are achieved by traversing and matching them using concrete patterns.

In other cases, the further processing of parse trees is better done in a more abstract form. The abstract syntax for a language is a data type that is used to represent programs in the language in an abstract form. Abstract syntax has the following properties:

  • It is "abstract" in the sense that it does not contain textual details such as parentheses, layout, and the like.
  • While a language has one grammar (also known as, concrete syntax) it may have several abstract syntaxes for different purposes: type analysis, code generation, etc.

The function implode bridges the gap between parse tree and abstract syntax tree. Given a parse tree and a Rascal type it traverses them simultaneously and constructs an abstract syntax tree (a value of the given type) as follows:

  • Literals, layout and empty (i.e. ()) nodes are skipped.

  • Regular */+ lists are imploded to lists or sets depending on what is expected in the ADT.

  • Ambiguities are imploded to sets.

  • If the expected type is str the tree is unparsed into a string. This happens for both lexical and context-free parse trees.

  • If a tree's production has no label and a single AST (i.e. non-layout, non-literal) argument (for instance, an injection), the tree node is skipped, and implosion continues with the lone argument. The same applies to bracket productions, even if they are labeled.

  • If a tree's production has no label, but more than one argument, the tree is imploded to a tuple (provided this conforms to the ADT).

  • Optionals are imploded to booleans if this is expected in the ADT. This also works for optional literals, as shown in the example below.

  • An optional is imploded to a list with zero or one argument, iff a list type is expected.

  • If the argument of an optional tree has a production with no label, containing a single list, then this list is spliced into the optional list.

  • For trees with (cons-)labeled productions, the corresponding constructor in the ADT corresponding to the non-terminal of the production is found in order to make the AST.

  • If the provided type is node, (cons-)labeled trees will be imploded to untyped nodes. This means that any subtrees below it will be untyped nodes (if there is a label), tuples of nodes (if a label is absent), and strings for lexicals.

  • Unlabeled lexicals are imploded to str, int, real, bool depending on the expected type in the ADT. To implode lexical into types other than str, the PDB parse functions for integers and doubles are used. Boolean lexicals should match "true" or "false". NB: lexicals are imploded this way, even if they are ambiguous.

  • If a lexical tree has a cons label, the tree imploded to a constructor with that name and a single string-valued argument containing the tree's yield.

An IllegalArgument exception is thrown if during implosion a tree is encountered that cannot be imploded to the expected type in the ADT. As explained above, this function assumes that the ADT type names correspond to syntax non-terminal names, and constructor names correspond to production labels. Labels of production arguments do not have to match with labels in ADT constructors.

Finally, source location fields are propagated as keyword fields on constructor ASTs. To access them, the user is required to explicitly declare a keyword field on all ADTs used in implosion. In other words, for every ADT type T, add:

data T(loc location=|unknown:///|);

Examples

Here are some examples for the above rules.

  • Example for rule 5. Given the grammar
syntax IDTYPE = Id ":" Type;
syntax Decls = decls: "declare" {IDTYPE ","}* ";";

Decls will be imploded as:

data Decls = decls(list[tuple[str,Type]]);

(assuming Id is a lexical non-terminal).

  • Example for rule 6. Given the grammar
syntax Formal = formal: "VAR"? {Id ","}+ ":" Type;

The corresponding ADT could be:

data Formal = formal(bool, list[str], Type);
  • Example for rule 8. Given the grammar
syntax Tag = "[" {Modifier ","}* "]";
syntax Decl = decl: Tag? Signature Body;

In this case, a Decl is imploded into the following ADT:

data Decl = decl(list[Modifier], Signature, Body);  
  • Example for rule 9. Given the grammar
syntax Exp = left add: Exp "+" Exp;

Can be imploded into:

data Exp = add(Exp, Exp);

data TreeSearchResult

Tree search result type for ((treeAt)).

data TreeSearchResult[&T<:Tree]  
= treeFound(&T tree)
| treeNotFound()
;

function treeAt

Select the innermost Tree of a given type which is enclosed by a given location.

TreeSearchResult[&T<:Tree] treeAt(type[&T<:Tree] t, loc l, Tree a:appl(_, _))

default TreeSearchResult[&T<:Tree] treeAt(type[&T<:Tree] t, loc l, Tree root)

function sameType

bool sameType(label(_,Symbol s),Symbol t)

bool sameType(Symbol s,label(_,Symbol t))

bool sameType(Symbol s,conditional(Symbol t,_))

bool sameType(conditional(Symbol s,_), Symbol t)

bool sameType(Symbol s, s)

default bool sameType(Symbol s, Symbol t)

function isNonTerminalType

Determine if the given type is a non-terminal type.

bool isNonTerminalType(Symbol::\sort(str _))

bool isNonTerminalType(Symbol::\lex(str _))

bool isNonTerminalType(Symbol::\layouts(str _))

bool isNonTerminalType(Symbol::\keywords(str _))

bool isNonTerminalType(Symbol::\parameterized-sort(str _, list[Symbol] _))

bool isNonTerminalType(Symbol::\parameterized-lex(str _, list[Symbol] _))

bool isNonTerminalType(Symbol::\start(Symbol s))

default bool isNonTerminalType(Symbol s)

alias NewLineChar

[\n]

function reposition

Re-compute and overwrite origin locations for all sub-trees of a ((Tree))

&T <: Tree reposition(
&T <: Tree tree,
loc file = tree@\loc.top,
bool \markStart = true,
bool \markSyntax = true,
bool \markLexical = true,
bool \markSubLexical = true,
bool \markRegular = true,
bool \markLayout = true,
bool \markSubLayout = true,
bool \markLit = false,
bool \markSubLit = false,
bool \markAmb = false,
bool \markCycle = false,
bool \markChar = false
)

This function takes a Tree and overwrites the old \loc annotations of every subtree with fresh locations. The new locations are as-if the file was parsed again from the unparsed result: the locations describe the left-to-right order of the sub-trees again exactly, and they are all from the same top-level location (read "file").

Typically, with the default options, this algorithm changes nothing in a Tree which has just been produced by the parser. It will rebuild the tree and recompute the exact locations as they were originally. However, there are many reasons why the (location) fields in a Tree are not at all anymore what they were just after parsing:

  1. subtrees may have been removed
  2. subtrees may have been relocated to different parts of the tree;
  3. subtrees may have been introduced from other source files
  4. subtrees may have been introduced from concrete syntax expressions in Rascal code.
  5. other algorithms may have added more keyword fields, for example fully resolved qualified names, resolved types, error messages or future computations (closures).
  6. location fields themselves may have been lost accidentally when rewriting trees with visit
  7. etc.

Some downstream algorithms (e.g. Hi Fi Layout Diff ) require source locations to be consistent with the current actual position of every source tree. Reposition provides this contract. Even if one of the above transformations have happened, after Reposition every node has an accurate position with respect to the hypothetical file contents that would be generated if the tree is unparsed (written to a string or a file).

Next to this feature, Reposition may add locations to Tree nodes which were not annotated initially by the Parser: layout nodes, literal nodes, and sub-lexical nodes. Some algorithms on parse trees (like formatting), require more detailed location information than provided by the Parser:

  • markLexical=true, ensures the sub-structure of lexicals is annotated as well.
  • markLayout=true, ensures annotating layout nodes and their sub-structure as well.
  • markLit=true, ensures literal trees and case-insensitive literal trees are annotated as well.
  • markAmb=true, ensures ambiguity nodes are annotated. NB: the sub-structure of a cluster is always annotated according to the other flags.
  • etc. every kind of node has a "mark" flag for completeness sake.

Finally, Reposition can be used to removed superfluous locations from Tree nodes. Every node which originally had a position will lose it unless Reposition is configured to recompute it.

By default Reposition simulates the behavior of a Parser exactly. Reparsing the yield of a tree should always produce the exact same locations as Reposition does.

Benefits

  • Unlike reparsing, Reposition will maintain all other keyword parameters of Tree nodes, like resolved qualified names and type attributes.
  • Can be used to erase superfluous annotations for memory efficiency, while keeping the essential ones.
  • The default mark options simulatete the behavior of Parser functions.