RAP 2 - Types are Parsers
RAP | 2 |
---|---|
Title | Types are Parsers |
Author | Jurgen Vinju |
Status | Draft |
Type | Rascal Language |
Abstract
The goal is to remove a number of features for calling parsers of different types and unify them in a single concept. This concept is that any Rascal type represents a parser and we should be able to call
this parser using a unified syntax. We propose to not introduce new syntax at all, but to reuse the function call syntax. To parse using a Rascal type, the function name is represented by a reified type, like so:
#int(“123”)
- parses the string “123” to the integer
123
. - based on rascal expression parser or rascal-value parser
- parses the string “123” to the integer
#start[CompilationUnit](|home://myCprogram.c|)
- uses the parser generated for C compilation units to parse a given C program in the given file.
- Based on (data-dependent) Rascal-generated parser
Library functions like parse
and readTextValueString
will become deprecated after this feature has been added. Also the “as type” expression notation is redundant after this addition.
We will have fewer library functions and fewer expression syntaxes after this, without loss of functionality.
Parsing functions, as generated by types here, have parameters. The first and only positional parameter is what they will parse. The other parameters will be keyword parameters to provide options to the parse, such as allowAmbiguity
and src
and to provide contextual data to data-dependent parsers (when they are added).
Motivation
- There are several ways of calling a parser right now in Rascal, none of them unique in terms of features, just different in syntax. It’s unnecessary. E.g. parse functions, read functions like readTextValueString, the [AsType] syntax, pattern matching. Is there more?
- Each has its own syntax
- Each has its own way of dealing with exceptions
- Each has its own configurability
- Non-terminals from syntax definitions are parsers conceptually, and very much so in the top-down sense we use them now. When they become parametrized, this only adds to the feeling of a non-terminal “being” a parser function.
- We often need to build builtin values from string input, but there is no convenient notation to do so, e.g. readTextValueString(...)
Specification
No new syntax is necessary for the new feature:
syntax Expression \= Expression “(“ {Expression “,”}\* “)” // callOrTree exists already
syntax Expression \= “\#” Type // type reification exists already
The new semantics is for the callOrTree syntax, statically we also allow these applications now:
<type[&T] type> ( loc input, bool allowAmbiguity \= false )
<type[&T] type> ( str input, loc src \= |unknown:///|, bool allowAmbiguity = false )
- Both expressions return a value of type
&T
(as instantiated) when successful, or they throw a syntaxError or a validationError exception value (see below)
Semantically, the parser which will be called switches on the kind of type and the kind of input:
- [builtin, text] For builtin types such as list, int, applied to a text file or string, the Rascal text value parser will be called;
- [builtin, bin] For builtin types applied to a binary serialized file, the Rascal binary value parser will be called
- [data, text] the Rascal text value parser will be called, and the value will be validated against the abstract syntax definition defined by the given reified type. Note that this gives rise to programmatically constructed abstract grammars which are used as parsers.
- [data, bin] the Rascal binary value parser will be called and the result will be checked against the expected top-level type
- [syntax, text]
- if the input does not start with “appl(” then a generated parser will be called to parse the text using the given non-terminal/grammar in the reified type
- If the input is an already serialized parse tree, the code will use the values reader and validate if the top-type is indeed the expected non-terminal or throw a validation exception
- [syntax, bin] the input is an already serialized parse tree in binary input, the code will use the values reader and validate if the top-type is indeed the expected non-terminal or throw a validation exception
Exceptions thrown by the new expression:
- syntaxError(loc src, str cause \= “”) // explains a syntactical problem with its exact location and a probable cause if possible. Note that we had parseError before and it may be good to not change this at the same time
- validationError(loc src, str cause= “”) // even though parsing was successful, the resulting structure did not match the expected type
Backwards Compatibility
- The new feature is designed to simulate the old semantics of these things exactly, with a different syntax, i.e. semantically backward compatible:
- AsType expressions
- Parse functions in ParseTree.rsc and Prelude.java
- ReadValueFrom…
- The notable problem is exception semantics. The new notation will throw only parseError and validationError, which is different from the previous parsing API
- The new feature is syntactically different, so the old features must be labeled “deprecated” for a while and co-exist with the new feature
- A simple refactoring or quick-fix tool can be provided to translate the old notations to the single new notation.
Implementation
We envision one or two RascalPrimitives to cover for the different kinds of types provided to the callOrTree:
- Splitting out to more specific functions at compile time (primitive, data or syntax)
- Each RascalPrimitive will have to dynamically dispatch based on the content (str, loc, bin or text)
The type checker should add specialized semantics for callOrTree expressions with reified type as “function”, treating them in effect as calls to the old parse function for example.
The compiler should translate the expressions to direct calls to the above RascalPrimitives