vendredi 14 mai 2010

Character feedback for XSugar

In the previous post, we showed that embedding white chars inside the XML is not a viable solution for strict bidirectionality. Automaticaly pre-process the modified XML to fix it and make sure it conforms to the stylesheet seems not possible based on the stylesheet itself, because the XML tree and AST tree are both orthogonal and not related to each other.

How does it works in Augeas? Augeas is based on lenses primitives and combinators. The one we are most interested here is the del lens. In the put direction, the del lens remove the matched content, and stores in a skeleton. In the put direction, the lens restores the original content from the skeleton or provides a default value in case of a new node. The skeleton values are retreived either by sequence or by key. To understand the difference, let's see what happens in case of a subtree move. If skeleton is aligned by sequence, then deleted content by the del lens will be restored in the order in which it appears in the original document. It means that white chars are sticking in place while the content move. In the case of insert, every record following the insert will be modified and get the previous skeleton value, and the last record will get the default value. This behavior doesn't correspond to minimal edit. A better approach is to use the key alignment. Then, skeleton is matched by key, and the correct original chars are selected even in the case of move, insert or delete of records.

The idea behind this principle of operation is to create a feedback of the original document to insert previous values when possible, while preserving modifications.

It's possible to do something similar in XSugar, by working on the AST level. In the rest of this blog, I present the principle of operation, a prototype with basic algorithm, obtained results and conclusion about this solution.

The goal is to to get back original white chars inside the updated document, without embedding them in the XML document. By using the strict version of the stylesheet, where every items are labeled, the resulting AST preserves every parsed chars, and we call strict nodes those that contains chars that would have been lost otherwise. Then, we could use this strict AST to match nodes from the updated XML and the original document, and then copy only strict elements from the original to the updated document.

The matching algorithm used to test this hypothesis works as follow. For each node of the first tree, search in the second tree a node where each child labels and text matches, except for strict nodes. The algorithm works by traversing the AST in level order, and has a complexity of O(n^2).

Here is the expected behavior of this simple algorithm.
  • Round trip without updating content should restore all original values.
  • Inserted node should get default values.
  • Updated nodes should get default values, because they will not match any previous values. They are handled the same way as an insert.
  • Moved nodes should get their original values.
  • Deleted nodes should not render any previous associated value.

To test and highlight the behavior, I modified slightly the students.xsg example. Here is the format of records that the stylesheet parses :

( [0-9]+ [z]+ [0-9]+ [a]+ )*

Hence, two numbers are separated by space, and multiple records are separated by new line, but "z" represents spaces and "a" represents new line. The default number of z's and a's is one, to show the behavior, we use two and more chars to detect when a default value is inserted.

A parse tree and a match between original document and updated document in which a record has been deleted is shown in the following figure.


Here are the results of tests.

Test 1 : round trip without modifications

input:   1zz1aa11zzz11aaa111zzzz111aaaa
default: 1z1a11z11a111z111a
merged:  1zz1aa11zzz11aaa111zzzz111aaaa

All characters are recovered, and the resulting document is exactly the same as the input.

Test 2: insert

input:   1zz1aa11zzz11aaa111zzzz111aaaa
default: 1z1a11z11a99z88a111z111a
merged:  1zz1aa11zzz11aaa99z88aaaa111zzzz111a

The new record "99 88" gets a default "z", while the last record gets a default "a".

Test 2 : move

input:   1zz1aa11zzz11aaa111zzzz111aaaa
default: 1z1a111z111a11z11a
merged:  1zz1aa111zzzz111aaa11zzz11aaaa

The moved record gets the right number of "z", but "a" chars keeps their order.

Test 4 : delete

input:   1zz1aa11zzz11aaa111zzzz111aaaa
default: 1z1a111z111a
merged:  1zz1aa111zzzz111aaa

The central record is deleted, and the last record gets 3 a's in place of 4.

From this results, we can see that strict elements z's are retrieved by keys, and a's by sequence. This is caused by the fact that branch nodes that have only one strict leaf nodes are match by their level in the tree. We can't determine if this node should be matched to the parents, or are themselves a parents for other nodes, and in consequence, sequence match is the only remaining possible match. As I understand, other tree matching algorithms can't address this issue.

The algorithm can be improved to match nodes by leaf node string similarity, using Levenshtein distance. It may be possible to restore chars from updated nodes. But then, the match will depend on the number of changes made to the document, and a node may match another one more similar, may confuse the user by an apparent non-deterministic behavior and may produce non-minimal edit.

In conclusion, we showed that it's possible to use the original document to merge back otherwise lost chars. But, alignment of original content is mixed between sequence and key. The user can't control which one will apply. It may lead to non-minimal edit of the resulting file. In that respect, lenses primitives and combinators used in Augeas provides a more predictive behavior, by letting explicitly choose between key or sequence alignment by the lens developer.