module analysis::diff::edits::HiFiTreeDiff

rascal-Not specified

Infer ((TextEdit)) from the differences between two parse ((ParseTree::Tree))s

Usage

import analysis::diff::edits::HiFiTreeDiff;

Dependencies

extend analysis::diff::edits::TextEdits;
import List;
import Location;
import ParseTree;
import String;
import util::Math;

Description

This module provides an essential building block for creating high-fidelity source-to-source code transformations. It is common for industrial use cases of source-to-source transformation to extract a list of text edits programmatically using parse tree pattern matching. This way the changes are made on the textual level, with less introduction of noise and fewer removals of valuable layout (indentation) and source code comments.

The construction of such high-fidelity edit lists can be rather involved because it tangles and scatters a number of concerns:

syntax-directed pattern matching
string substitution; construction of the rewritten text
- retention of layout and in particular indentation
- retention of source code comments
- retention of specific case-insensitive keyword style
- syntactic correctness of the result; especially in relation to list separators there are many corner-cases to thing of

On the other hand, ParseTree to ParseTree rewrites are much easier to write and get correct. They are "syntax directed" via the shape of the tree that follows the grammar of the language. Some if not all of the above aspects are tackled by the rewriting mechanism with concrete patterns. Especially the corner cases w.r.t. list separators are all handled by the rewriting mechanisms. Also the rules are in "concrete syntax", on both the matching and the substition side. So they are readable for all who know the object language. The rules guarantee syntactic correctness of the rewritten source code. However, rewrite rules do quite some noisy damage to the layout, indentation and comments, of the result.

With this module we bring these two modalities of source-to-source transformations together:

The language engineer uses concrete syntax rewrite rules to derive a new ParseTree from the original;
We run Tree Diff to obtain a set of minimal text edits;
We apply the text edits to the editor contents or the file system.

Benefits

Because the derived text edits change fewer characters, the end result is more "hifi" than simply unparsing the rewritten ParseTree. More comments are retained and more indentation is kept the same. More case-insensitive keywords retain their original shape.
At the same time the rewrite rules are easier to maintain as they remain "syntax directed".
Changes to the grammar will be picked up when checking all source and target patterns.
The diff algorithm uses cross-cutting information from the parse tree (what is layout and what not, what is case-insensitive, etc.) which would otherwise have to be managed by the language engineer in every rewrite rule.
The diff algoritm understands what indentation is and brings new sub-trees to the original level of indentation (same as the sub-trees they are replacing)
Typically the algorithm's run-time is lineair in the size of the tree, or better. Same for memory usage.

Pitfalls

Tree Diff only works under the assumption that the second tree was derived from the first by applying concrete syntax rewrite rules in Rascal. If there is no origin relation between the two then its heuristics will not work. The algorithm could degenerate to substituting the entire file, or worse it could degenerate to an exponential search for commonalities in long lists.
Tree Diff's efficiency is predicated on the two trees being derived from each other in main memory of the currently running JVM. This way both trees will share pointers where they are the same, which leads to very efficient equality testing. If the trees are first independently serialized to disk and then deserialized again, and then Tree Diff is called, this optimization is not present and the algorithm will perform (very) poorly.
Substitution patterns should be formatted as best as possible. The algorithm will not infer spacing or relative indentation inside of the substituted subtree. It will only infer indentation for the entire subtree. Another way of resolving this is using a code formatter on the subsituted patterns.

function treeDiff

Detects minimal differences between parse trees and makes them explicit as ((TextEdit)) instructions.

list[TextEdit] treeDiff(Tree a, a)

list[TextEdit] treeDiff(
    Tree t:appl(prod(label(_, Symbol s), list[Symbol] syms, set[Attr] attrs), list[Tree] args), 
    Tree u)

list[TextEdit] treeDiff(
    Tree t,
    Tree u:appl(prod(label(_, Symbol s), list[Symbol] syms, set[Attr] attrs), list[Tree] args))

list[TextEdit] treeDiff(
    appl(prod(layouts(_), _, _), list[Tree] _), 
    appl(prod(layouts(_), _, _), list[Tree] _))

list[TextEdit] treeDiff(
    appl(prod(lit(str l), _, _), list[Tree] _), 
    appl(prod(lit(l)    , _, _), list[Tree] _))

list[TextEdit] treeDiff(
    appl(prod(cilit(str l), _, _), list[Tree] _), 
    appl(prod(cilit(l)    , _, _), list[Tree] _))

list[TextEdit] treeDiff(
    t:appl(prod(lex(str l), _, _), list[Tree] _), 
    r:appl(prod(lex(l)    , _, _), list[Tree] _))

default list[TextEdit] treeDiff(
    t:appl(Production p:prod(_,_,_), list[Tree] _), 
    r:appl(Production q:!p         , list[Tree] _))

list[TextEdit] treeDiff(
    Tree t:appl(Production p:regular(Symbol reg), list[Tree] aElems), 
    appl(p, list[Tree] bElems))

default list[TextEdit] treeDiff(t:appl(Production p, list[Tree] argsA), appl(p, list[Tree] argsB))

This is a "diff" algorithm of two parse trees to generate a Text Edit script that applies the differences on the textual level, with minimal collatoral damage in whitespace. This is why it is called "HiFi": minimal unnecessary noise introduction to the original file. It also tries to conserve source code comments; where still possible.

The resulting Text Edits are an intermediate representation for making changes in source code text files. They can be executed independently via Execute Text Edits, or interactively via IDEServices, or LanguageServer features.

This top-down diff algorithm takes two arguments:

an original parse tree for a text file,
and a derived parse tree that is mostly equal to the original but has pieces of it substituted or rewritten.

From the tree node differences between these two trees, Text Edits are derived such that:

when the edited source text is parsed again, the resulting tree would match the derived tree. However, the parsed tree could be different from the derived tree in terms of whitespace, indentation and case-insensitive literals (see below).
when tree nodes (grammar rules) are equal, smaller edits are searched by pair-wise comparison of the children
differences between respective layout or (case insensitve) literal nodes are always ignored
when lists have changed, careful editing of possible separators ensures syntactic correctness
when new sub-trees are inserted, the replacement will be at the same indentation level as the original.
when case-insensitive literals have been changed under a grammar rule that remained the same, no edits are produced.

The function comes in handy when we use Rascal to rewrite parse trees, and then need to communicate the effect back to the IDE (for example using IDEServices or util::LanguageServer interfaces). We use Execute Text Edits to test the effect of Text Edits while developing a source-to-source transformation.

Examples

If we rewrite parse trees, this can be done with concrete syntax matching. The following example swaps the if-branch with the else-branch in Pico:

rascal>import lang::pico::\syntax::Main;
ok
rascal>import IO;
ok
rascal>import analysis::diff::edits::ExecuteTextEdits;
ok
rascal>import analysis::diff::edits::TextEdits;
ok
rascal>import analysis::diff::edits::HiFiTreeDiff;
ok

an example Pico program:

rascal>writeFile(|tmp:///example.pico|,
|1 >>>>    "begin
|2 >>>>    '   declare
|3 >>>>    '       a : natural,
|4 >>>>    '       b : natural;
|5 >>>>    '   if a then 
|6 >>>>    '       a := b
|7 >>>>    '   else
|8 >>>>    '       b := a
|9 >>>>    '   fi
|10 >>>>    'end");
ok
rascal>import ParseTree;
ok
rascal>original = parse(#start[Program], |tmp:///example.pico|);
start[Program]: (start[Program]) `begin
   declare
       a : natural,
       b : natural;
   if a then 
       a := b
   else
       b := a
   fi
end`

match and replace all conditionals

rascal>rewritten = visit(original) {
|1 >>>>    case (Statement) `if <Expression e> then <{Statement ";"}* ifBranch> else <{Statement ";"}* elseBranch> fi`
|2 >>>>      => (Statement) `if <Expression e> then 
|3 >>>>                     '  <{Statement ";"}* elseBranch> 
|4 >>>>                     'else 
|5 >>>>                     '  <{Statement ";"}* ifBranch> 
|6 >>>>                     'fi`
|7 >>>>}
start[Program]: (start[Program]) `begin
   declare
       a : natural,
       b : natural;
   if a then 
  b := a 
else 
  a := b 
fi
end`

Check the result as a string. It worked, but we see some collatoral damage in whitespace (indentation).

rascal>"<rewritten>"
str: "begin\n   declare\n       a : natural,\n       b : natural;\n   if a then \n  b := a \nelse \n  a := b \nfi\nend"
───
begin
   declare
       a : natural,
       b : natural;
   if a then 
  b := a 
else 
  a := b 
fi
end
───

Now derive text edits from the two parse trees:

rascal>edits = treeDiff(original, rewritten);
list[TextEdit]: [
  replace(
    |tmp:///example.pico|(78,1,<6,7>,<6,8>),
    "b"),
  replace(
    |tmp:///example.pico|(83,1,<6,12>,<6,13>),
    "a"),
  replace(
    |tmp:///example.pico|(100,1,<8,7>,<8,8>),
    "a"),
  replace(
    |tmp:///example.pico|(105,1,<8,12>,<8,13>),
    "b")
]

Wrap them in a single document edit

rascal>edit = changed(original@\loc.top, edits);
FileSystemChange: changed(
  |tmp:///example.pico|,
  [
    replace(
      |tmp:///example.pico|(78,1,<6,7>,<6,8>),
      "b"),
    replace(
      |tmp:///example.pico|(83,1,<6,12>,<6,13>),
      "a"),
    replace(
      |tmp:///example.pico|(100,1,<8,7>,<8,8>),
      "a"),
    replace(
      |tmp:///example.pico|(105,1,<8,12>,<8,13>),
      "b")
  ])

Apply the document edit on disk:

rascal>executeDocumentEdits([edit]);
ok

and when we read the result back, we see the transformation succeeded, and indentation was not lost:

rascal>readFile(tmp://example.pico|);

It's also possible to directly rewrite the original string, for debugging purposes:

|1 >>>>executeTextEdits("<original>", treeDiff(original, rewritten))

Benefits

This function allows the language engineer to work in terms of abstract and concrete syntax trees while manipulating source text. The Text Edits intermediate representation bridge the gap to the minute details of IDE interaction such as "undo" and "preview" features.
Text editing is fraught with details of whitespace, comments, list separators; all of which are handled here by the exactness of syntactic and semantic knowledge of the parse trees.
Where possible the algorithm also retains the capitalization of case-insensitive literals.
The algorithm retrieves and retains indentation levels from the original tree, even if sub-trees in the derived tree have mangled indentation. This allows us to ignore the indentation concern while thinking of rewrite rules for source-to-souce transformation, and focus on the semantic effect.
The algorithm inherits source code comments from the original, wherever sub-trees of the original and the rewritten tree still line up.

Pitfalls

If the first argument is not an original parse tree, then basic assumptions of the algorithm fail and it may produce erroneous text edits.
If the second argument is not derived from the original, then the algorithm will produce a single text edit to replace the entire source text.
If the second argument was not produced from the first in the same JVM memory, it will not share many pointers to equal sub-trees and the performance of the algorithm will degenerate quickly.
If the parse tree of the original does not reflect the current state of the text in the file, then the generated text edits will do harm.
If the original tree is not annotated with source locations, the algorithm fails.
Both parse trees must be type correct, e.g. the number of symbols in a production rule, must be equal to the number of elements of the argument list of Appl.
This algorithm does not work with ambiguous (sub)trees.
When large sub-trees or sub-lists are moved to other parts of the tree, comment inheritance is not possible anymore.

function seps

decide how many separators we have

int seps(\iter-seps(_, list[Symbol] s))

int seps(\iter-star-seps(_, list[Symbol] s))

default int seps(Symbol _)

function listDiff

List diff finds minimal differences between the elements of two lists.

list[TextEdit] listDiff(loc span, int seps, list[Tree] originals, list[Tree] replacements)

This algorithm uses heuristics to avoid searching for the largest common sublist all too often. Also it minimized the sublists that largest common sublist is executed on.

Since many patches to parse tree lists typically only change a prefix or a postfix, and we can detect this quickly, we first extract patches for those instances.
It is also fast and easy to detect unchanged prefixes and postfixes, so by focusing on the changes parts in the middle we generate more instances of case 1.
Another simple and quick case is when simply all elements are different (the prefix==the list==the postfix)
What we are left with is either an empty list and we are done, or a more complex situation where we apply the "findEqualSubList" algorithm, which splits the list in three parts:
- two unequal prefixes
- two equal sublists in the middle
- two unequal postfixes
the algorithm then concatenates the diffs by recursing to step 1 on the prefixes and the diffs by recursing to step 1. on the postfixes
two empty lists terminate the recursion,

function findEqualSubList

Finds the largest sublist that occurs in both lists

list[Tree] findEqualSubList([*Tree sub], [*_, *sub, *_])

list[Tree] findEqualSubList([*_, *Tree sub, *_], [*sub])

list[Tree] findEqualSubList([*_, p, *Tree sub, q, *_], [*_, !p, *sub, !q, *_])

default list[Tree] findEqualSubList(list[Tree] _orig, list[Tree] _repl)

Using list matching and backtracking, this algorithm detects which common sublist is the largest. It assumes Trim Equal Elements has happened already, and thus there are interesting differences left, even if we remove any equal sublist.

Note that this is not a general algorithm for Largest Common Subsequence (LCS), since it uses particular properties of the relation between the original and the replacement list:

New elements are never equal to old elements (due to source locations)
Equal prefixes and postfixes may be assumed to be maximal sublists as well (see above).
Candidate equal sublists always have consecutive source locations from the origin.

function trimEqualElements

trips equal elements from the front and the back of both lists, if any.

tuple[loc, list[Tree], list[Tree]] trimEqualElements(loc span, 
    [Tree a, *Tree aPostfix], [ a, *Tree bPostfix])

tuple[loc, list[Tree], list[Tree]] trimEqualElements(loc span, 
    [*Tree aPrefix, Tree a], [*Tree bPrefix, a])

default tuple[loc, list[Tree], list[Tree]] trimEqualElements(loc span, 
    list[Tree] a, list[Tree] b)

function commonSpecialCases

tuple[list[TextEdit], list[Tree], list[Tree]] commonSpecialCases(loc span, 0, [Tree a, *Tree tail], [*tail])

tuple[list[TextEdit], list[Tree], list[Tree]] commonSpecialCases(loc span, 1, 
    [Tree a, Tree _sep, Tree tHead, *Tree tail], [tHead, *tail])

tuple[list[TextEdit], list[Tree], list[Tree]] commonSpecialCases(loc span, 3, 
    [Tree a, Tree _l1, Tree _sep, Tree _l2, Tree tHead, *Tree tail], [tHead, *tail])

tuple[list[TextEdit], list[Tree], list[Tree]] commonSpecialCases(loc span, int _, 
    [Tree a], [Tree b])

default tuple[list[TextEdit], list[Tree], list[Tree]] commonSpecialCases(loc span, int _, list[Tree] a, list[Tree] b)

function fromUntil

convenience overload for shorter code

loc fromUntil(Tree from, Tree until)

function fromUntil

Compute location span that is common between an element and a succeeding element

loc fromUntil(loc from, loc until)

The resulting loc is including the from but excluding the until. It goes right up to until.

 [from] gap [until]
 <--------->

function end

int end(loc src)

function after

loc after(loc src)

function endCover

loc endCover(loc span, [])

loc endCover(loc span, [Tree x])

default loc endCover(loc span, list[Tree] l)

function beginCover

loc beginCover(loc span, [])

loc beginCover(loc span, [Tree x])

default loc beginCover(loc span, list[Tree] l)

function cover

loc cover(list[Tree] elems:[_, *_])

function yield

yield a consecutive list of trees

str yield(list[Tree] elems)

function learnIndentation

Make sure the subtitution is at least as far indented as the original

str learnIndentation(loc span, str replacement, str original)

This algorithm ignores the first line, since the first line is always preceeded by the layout of a parent node.

Then it measures the depth of indentation of every line in the original, and takes the minimum. That minimum indentation is stripped off every line that already has that much indentation in the replacement, and then all lines are re-indented with the discovered minimum.

Usage​

Dependencies​

Description​

Benefits​

Pitfalls​

function treeDiff​

Examples​

Benefits​

Pitfalls​

function seps​

function listDiff​

function findEqualSubList​

function trimEqualElements​

function commonSpecialCases​

function fromUntil​

function fromUntil​

function end​

function after​

function endCover​

function beginCover​

function cover​

function yield​

function learnIndentation​

Usage

Dependencies

Description

Benefits

Pitfalls

function treeDiff

Examples

Benefits

Pitfalls

function seps

function listDiff

function findEqualSubList

function trimEqualElements

function commonSpecialCases

function fromUntil

function fromUntil

function end

function after

function endCover

function beginCover

function cover

function yield

function learnIndentation