Robust Parsing
Synopsis
For many tasks it is important to be able to work with inputs that contain parse errors. Robust parsing can be of great help in those cases. Using a Pico examples we will show how to parse input that contains parse errors.
Examples
Most parse functions support the allowRecovery keyword parameter. By setting this parameter to true, the parser will attempt to
recover when a pare error is encountered. If this recovery succeeds, a parse forest is returned that contains error subtrees:
rascal>import demo::lang::Pico::Syntax;
ok
rascal>import ParseTree;
ok
rascal>import vis::Text;
ok
rascal>import IO;
ok
rascal>import util::ParseErrorRecovery;
ok
rascal>
rascal>Program prg = parse(#Program, "begin declare x : natural, y : natural; x ;= 1; y := x + 5 end", allowRecovery=true);
Program: (Program) `begin declare x : natural, y : natural; x ;= 1; y := x + 5 end`
rascal>
rascal>println(prettyTree(prg));
❖
├─ Program = program:
│ ├─ Declarations = "declare" {Declaration ","}* decls ";"
│ │ └─ {Declaration ","}*
│ │ ├─ Declaration = decl:
│ │ │ ├─ Id = [a-z] [0-9a-z]* !>> [0-9a-z]
│ │ │ │ ├─ x
│ │ │ │ └─ [0-9a-z]*
│ │ │ └─ Type = natural:
│ │ └─ Declaration = decl:
│ │ ├─ Id = [a-z] [0-9a-z]* !>> [0-9a-z]
│ │ │ ├─ y
│ │ │ └─ [0-9a-z]*
│ │ └─ Type = natural:
│ └─ {Statement ";"}*
│ ├─ ❖
│ │ ├─ {Statement ";"}*
│ │ │ └─ !error dot=2: Statement = asgStat:
│ │ │ ├─ Id = [a-z] [0-9a-z]* !>> [0-9a-z]
│ │ │ │ ├─ x
│ │ │ │ └─ [0-9a-z]*
│ │ │ └─ skipped
│ │ │ ├─ ;
│ │ │ ├─ =
│ │ │ ├─ ⎵
│ │ │ └─ 1
│ │ └─ {Statement ";"}*
│ │ ├─ !error dot=1: Statement = asgStat:
│ │ │ ├─ Id = [a-z] [0-9a-z]* !>> [0-9a-z]
│ │ │ │ ├─ x
│ │ │ │ └─ [0-9a-z]*
│ │ │ └─ skipped
│ │ │ └─ ⎵
│ │ └─ !error dot=0: Statement = asgStat:
│ │ └─ skipped
│ │ ├─ =
│ │ ├─ ⎵
│ │ └─ 1
│ └─ Statement = asgStat:
│ ├─ Id = [a-z] [0-9a-z]* !>> [0-9a-z]
│ │ ├─ y
│ │ └─ [0-9a-z]*
│ └─ Expression = add:
│ ├─ Expression = id:
│ │ └─ Id = [a-z] [0-9a-z]* !>> [0-9a-z]
│ │ ├─ x
│ │ └─ [0-9a-z]*
│ └─ Expression = natCon:
│ └─ Natural = [0-9]+
│ └─ [0-9]+
│ └─ 5
├─ !error dot=4: Program = program:
│ ├─ Declarations = "declare" {Declaration ","}* decls ";"
│ │ └─ {Declaration ","}*
│ │ ├─ Declaration = decl:
│ │ │ ├─ Id = [a-z] [0-9a-z]* !>> [0-9a-z]
│ │ │ │ ├─ x
│ │ │ │ └─ [0-9a-z]*
│ │ │ └─ Type = natural:
│ │ └─ Declaration = decl:
│ │ ├─ Id = [a-z] [0-9a-z]* !>> [0-9a-z]
│ │ │ ├─ y
│ │ │ └─ [0-9a-z]*
│ │ └─ Type = natural:
│ └─ skipped
│ ├─ x
│ ├─ ⎵
│ ├─ ;
│ ├─ =
│ ├─ ⎵
│ ├─ 1
│ ├─ ;
│ ├─ ⎵
│ ├─ y
│ ├─ ⎵
│ ├─ :
│ ├─ =
│ ├─ ⎵
│ ├─ x
│ ├─ ⎵
│ ├─ +
│ ├─ ⎵
│ ├─ 5
│ ├─ ⎵
│ ├─ e
│ ├─ n
│ └─ d
└─ !error dot=4: Program = program:
├─ Declarations = "declare" {Declaration ","}* decls ";"
│ └─ {Declaration ","}*
│ ├─ Declaration = decl:
│ │ ├─ Id = [a-z] [0-9a-z]* !>> [0-9a-z]
│ │ │ ├─ x
│ │ │ └─ [0-9a-z]*
│ │ └─ Type = natural:
│ └─ Declaration = decl:
│ ├─ Id = [a-z] [0-9a-z]* !>> [0-9a-z]
│ │ ├─ y
│ │ └─ [0-9a-z]*
│ └─ Type = natural:
└─ skipped
├─ x
├─ ⎵
├─ ;
├─ =
├─ ⎵
├─ 1
├─ ;
├─ ⎵
├─ y
├─ ⎵
├─ :
├─ =
├─ ⎵
├─ x
├─ ⎵
├─ +
├─ ⎵
├─ 5
├─ ⎵
├─ e
├─ n
└─ d
ok
As you can see error recovery resulted in a tree with ambiguities: the recoverer found multiple ways to recover the error and included them all in the resulting tree as ambiguities.
The error trees themselves consist of an error node representing a production that was only partially recognized. The part that was
not recognized is included as the last child of the error node in the form of a skipped node containing the list of characters that
where skipped before parsing could continue.
Note that the definition of error and skipped productions can be found in the Parse Tree module.
For some situations, simple disambiguation of the error ambiguities can be enough. For this the module ParseErrorRecovery offers
the disambiguateParseErrors function that takes an error tree and removes all ambiguities based on some simple heuristics:
rascal>Program disambPrg = disambiguateParseErrors(prg);
Program: (Program) `begin declare x : natural, y : natural; x ;= 1; y := x + 5 end`
rascal>println(prettyTree(disambPrg));
Program = program:
├─ Declarations = "declare" {Declaration ","}* decls ";"
│ └─ {Declaration ","}*
│ ├─ Declaration = decl:
│ │ ├─ Id = [a-z] [0-9a-z]* !>> [0-9a-z]
│ │ │ ├─ x
│ │ │ └─ [0-9a-z]*
│ │ └─ Type = natural:
│ └─ Declaration = decl:
│ ├─ Id = [a-z] [0-9a-z]* !>> [0-9a-z]
│ │ ├─ y
│ │ └─ [0-9a-z]*
│ └─ Type = natural:
└─ {Statement ";"}*
├─ {Statement ";"}*
│ └─ !error dot=2: Statement = asgStat:
│ ├─ Id = [a-z] [0-9a-z]* !>> [0-9a-z]
│ │ ├─ x
│ │ └─ [0-9a-z]*
│ └─ skipped
│ ├─ ;
│ ├─ =
│ ├─ ⎵
│ └─ 1
└─ Statement = asgStat:
├─ Id = [a-z] [0-9a-z]* !>> [0-9a-z]
│ ├─ y
│ └─ [0-9a-z]*
└─ Expression = add:
├─ Expression = id:
│ └─ Id = [a-z] [0-9a-z]* !>> [0-9a-z]
│ ├─ x
│ └─ [0-9a-z]*
└─ Expression = natCon:
└─ Natural = [0-9]+
└─ [0-9]+
└─ 5
ok
As you can see, no ambiguities remain. In this case error recovery resulted in a tree with a single error tree, only skipping 4 characters before parsing could continue normally.
Note that for more complex solutions this simple disambiguation strategy is too limited. For instance if you want to use error recovery to offer the user quick fixes in the editor, you might want to use all alternatives to find possible fixes.