Multivax: xsugar

Affichage des articles dont le libellé est xsugar. Afficher tous les articles

vendredi 14 mai 2010

Character feedback for XSugar

In the previous post, we showed that embedding white chars inside the XML is not a viable solution for strict bidirectionality. Automaticaly pre-process the modified XML to fix it and make sure it conforms to the stylesheet seems not possible based on the stylesheet itself, because the XML tree and AST tree are both orthogonal and not related to each other.

How does it works in Augeas? Augeas is based on lenses primitives and combinators. The one we are most interested here is the del lens. In the put direction, the del lens remove the matched content, and stores in a skeleton. In the put direction, the lens restores the original content from the skeleton or provides a default value in case of a new node. The skeleton values are retreived either by sequence or by key. To understand the difference, let's see what happens in case of a subtree move. If skeleton is aligned by sequence, then deleted content by the del lens will be restored in the order in which it appears in the original document. It means that white chars are sticking in place while the content move. In the case of insert, every record following the insert will be modified and get the previous skeleton value, and the last record will get the default value. This behavior doesn't correspond to minimal edit. A better approach is to use the key alignment. Then, skeleton is matched by key, and the correct original chars are selected even in the case of move, insert or delete of records.

The idea behind this principle of operation is to create a feedback of the original document to insert previous values when possible, while preserving modifications.

It's possible to do something similar in XSugar, by working on the AST level. In the rest of this blog, I present the principle of operation, a prototype with basic algorithm, obtained results and conclusion about this solution.

The goal is to to get back original white chars inside the updated document, without embedding them in the XML document. By using the strict version of the stylesheet, where every items are labeled, the resulting AST preserves every parsed chars, and we call strict nodes those that contains chars that would have been lost otherwise. Then, we could use this strict AST to match nodes from the updated XML and the original document, and then copy only strict elements from the original to the updated document.

The matching algorithm used to test this hypothesis works as follow. For each node of the first tree, search in the second tree a node where each child labels and text matches, except for strict nodes. The algorithm works by traversing the AST in level order, and has a complexity of O(n^2).

Here is the expected behavior of this simple algorithm.

Round trip without updating content should restore all original values.
Inserted node should get default values.
Updated nodes should get default values, because they will not match any previous values. They are handled the same way as an insert.
Moved nodes should get their original values.
Deleted nodes should not render any previous associated value.

To test and highlight the behavior, I modified slightly the students.xsg example. Here is the format of records that the stylesheet parses :

( [0-9]+ [z]+ [0-9]+ [a]+ )* 

Hence, two numbers are separated by space, and multiple records are separated by new line, but "z" represents spaces and "a" represents new line. The default number of z's and a's is one, to show the behavior, we use two and more chars to detect when a default value is inserted.

A parse tree and a match between original document and updated document in which a record has been deleted is shown in the following figure.

Here are the results of tests.

Test 1 : round trip without modifications

input:     1zz1aa11zzz11aaa111zzzz111aaaa

default: 1z1a11z11a111z111a

merged:   1zz1aa11zzz11aaa111zzzz111aaaa

All characters are recovered, and the resulting document is exactly the same as the input.

Test 2: insert

input:     1zz1aa11zzz11aaa111zzzz111aaaa

default:  1z1a11z11a99z88a111z111a

merged:  1zz1aa11zzz11aaa99z88aaaa111zzzz111a

The new record "99 88" gets a default "z", while the last record gets a default "a".

Test 2 : move

input:    1zz1aa11zzz11aaa111zzzz111aaaa

default: 1z1a111z111a11z11a

merged:   1zz1aa111zzzz111aaa11zzz11aaaa

The moved record gets the right number of "z", but "a" chars keeps their order.

Test 4 : delete

input:     1zz1aa11zzz11aaa111zzzz111aaaa

default: 1z1a111z111a

merged:   1zz1aa111zzzz111aaa

The central record is deleted, and the last record gets 3 a's in place of 4.

From this results, we can see that strict elements z's are retrieved by keys, and a's by sequence. This is caused by the fact that branch nodes that have only one strict leaf nodes are match by their level in the tree. We can't determine if this node should be matched to the parents, or are themselves a parents for other nodes, and in consequence, sequence match is the only remaining possible match. As I understand, other tree matching algorithms can't address this issue.

The algorithm can be improved to match nodes by leaf node string similarity, using Levenshtein distance. It may be possible to restore chars from updated nodes. But then, the match will depend on the number of changes made to the document, and a node may match another one more similar, may confuse the user by an apparent non-deterministic behavior and may produce non-minimal edit.

In conclusion, we showed that it's possible to use the original document to merge back otherwise lost chars. But, alignment of original content is mixed between sequence and key. The user can't control which one will apply. It may lead to non-minimal edit of the resulting file. In that respect, lenses primitives and combinators used in Augeas provides a more predictive behavior, by letting explicitly choose between key or sequence alignment by the lens developer.

vendredi 9 avril 2010

Limitations of XSugar

In previous posts, I explained how XSugar could be used to transform configuration files to XML and back. Preserving white characters (space, tab, carriage return, etc.) is possible by creating special elements to hold them. In a simple round-trip, when the XML document is not modified, it works well. But, there are issues when the XML document is modified. For example, be this simple XML with strict nodes, from the students example of XSugar.


<students xmlns="http://studentsRus.org/">
  <student sid="19701234">
    <name>John Doe</name>
    <email>john_doe@notmail.org</email>
  </student>
  <space1> </space1>
  <space2> </space2>
  <newline>
</newline>
  <student sid="19785678">
    <name>Jane Dow</name>
    <email>dow@bmail.org</email>
  </student>
  <space1> </space1>
  <space2> </space2>
  <newline>
</newline>
</students>

There are spaces and newline elements that follow the student record. They are used to restitute the exact formating of the original string. We can then modify an existing value, and spaces will be preserved, and this correspond to a minimal edit.

But, say another record is inserted after the first student, but before space1 element, then the first record will see it's spacing reset to default values, and the new inserted element will get the spaces from the first record! This doesn't correspond to minimal edit, and is similar to alignment problem found by Foster et. al. in Boomerang.

If a student element is removed, but spaces elements are left, then it will cause a syntax error according to XSugar stylesheet. If an element is moved, it must inserted to right place in the XML, otherwise syntax error may occur.

It may be possible to fix the XML before transformation to non-XML, by adding, moving or deleting space elements. The idea to use the original document and the modified one to produce the updated output is already what does a lens, and is exactly what Augeas does. And now, with the introduction of the recursive lens, Augeas is about to support context-free languages, and opens the door to handle the Web server Apache configuration, and many other.

That's all for now, stay tuned!

jeudi 11 mars 2010

Static validation of strict bidirectionality

These days, I had some fun to validate XSugar stylesheet for their strict bidirectional property. XSugar can convert non-XML documents to XML documents, and vice-versa. In theory, you get back exactly the same file if it does a round-trip to and from XML. But, there are few tricky things that prevent it to be exactly the same. XSugar verify statically the reversibility of a stylesheet, by verifying that both grammars are bijections. It means that there is no ambiguity, and so there will be always one way to do the conversion. But, not all stylesheet that are valid in this respect will yield correct behavior when performing actual transformation. The problem is linked to the fact that some white spaces may be lost in the transformation. To understand the problem, I will use a concrete example.

Say you have the following file, and want to convert it to XML. There are a bloc of numbers lines followed by words lines. Elements are separated by space, and there may be spaces before and after each lines.

  123  456  789  
  987  654  321  
  abc  def  ghi  
  ihg  fed  cba

We will convert this file to the following XML.

<?xml version="1.0" encoding="UTF-8"?>
<a xmlns:b="http://udes.ca/noesis">
  123 456 789
987 654 321

  <z>
    <y>abc</y>
    <y>def</y>
    <y>ghi</y>
  </z>
  <z>
    <y>ihg</y>
    <y>fed</y>
    <y>cba</y>
  </z>
</a>

The stylesheet looks like this...

xmlns:b = "http://udes.ca/noesis"

NUM = [0-9]+ (MAX)   /* numbers */
ALPHA = [a-zA-Z]+ (MAX) /* letters */
NL   = \n|\r|\r\n   /* mandatory new line */
SP = [ \t]+ (MAX)   /* mandatory white space */
OSP = [ \t]* (MAX)   /* optional white space */

/* top-level rule : define the whole document */
a : [xs x] [ys y] = <a> [xs x] [ys y] </>

/* sequence of lines of numbers separated by space, with optional leading and endin white space  */
xs : [OSP] [ns n1] [OSP] [NL] [xs n2] =  [ns n1] [NL] [xs n2]
: =

/* numbers sequence */
ns : [NUM num] [SP] [ns n] = [NUM num] [SP] [ns n]
: [NUM num] = [NUM num]

/* sequence of lines of words separated by space, with optional leading and ending white space */
ys :[OSP] [y y1] [OSP] [NL] [ys y2] = <z> [y y1] </> [ys y2] 
: = 

/* sequence of <y> elements that hold actual words */
y : [ALPHA alpha] [SP] [y y1] = <y> [ALPHA alpha] </> [y y1]
: [ALPHA alpha] = <y> [ALPHA alpha] </>

... and it validates well.

$ xsugar -b strict-test3.xsg
Transformation is guaranteed to be reversible!

But, when we actually try to convert the XML back to the non-XML format, we get a parse error on char 74. When activating the debug mode of XSugar, we can see the result of XML normalization, which is a preparation step done before parsing.

Normalized input:
<a>123 456 789
987 654 321<z><y>abc</><y>def</><y>ghi</></><z><y>ihg</><y>fed</><y>cba</></></>

We see that spaces and carriage return before and after the number text node are lost. This is because the interleaved text with other element is trimmed. Spaces inside the text are preserved by the trim. This doesn't occur when an element contains only text. The parse error occur because the last mandatory new line is dropped. Let's fix the xs rule.

/* sequence of lines of numbers separated by space, with optional leading and endin white space  */
xs : [OSP] [ns n1] [OSP] [NL] [xs n2] =  [ns n1] [NL] [xs n2]
: [OSP] [ns n1] [OSP] [NL] = [ns n1]

This makes the last return carriage optional, and parsing succeed. Let's look at the diff between the result and the original file.

$ diff -u strict-test.txt strict-test3.txt.back
--- strict-test.txt 2010-03-05 15:33:52.043893254 -0500
+++ strict-test3.txt.back 2010-03-12 00:23:12.482905316 -0500
@@ -1,4 +1,4 @@
-  123  456  789  
-  987  654  321  
-  abc  def  ghi  
-  ihg  fed  cba  
+123 456 789
+987 654 321
+abc def ghi
+ihg fed cba

The content is there, but the actual diff shows that every lines are modified. This is because some terminals are not labeled. When this is the case, their values are lost, and example string is generated back when there is not corresponding item. It explains why spaces are all reset to their default values, the empty string for the optional white space, one space for the mandatory white space, and the default carriage return.

Hence, we have to look at two things to make sure stylesheets are strictly bidirectional.

First, the easy part, all items in the stylesheet must be labeled. This way, we can be sure the XML document will contain all chars from the input, and no information will be lost. We only have to traverse the stylesheet and make sure that every RegexpTerminal are labeled.

Second, interleaved text nodes must not begin or end with white characters [ \t\r\n], otherwise those chars will be lost. This is harder to do. First, we must detect if an element contains text nodes, elements, or both, from the stylesheet. Determining this for StringTerminal, RegexpTerminal and Element objects is straight forward. First two generate text, while the other yield elements. But, this is non-trivial for Nonterminal objects, because of the recursive nature of these items. The problem is a chicken and egg problem, because to determine the actual possible values for a nonterminal, we have to compute this nonterminal and substitute it's actual value in a recursive rule, which produces a loop. One solution to this problem is to compute the regular approximation of the XML stylesheet rules. It yields an automaton for each nonterminal that accepts a superset of the XML content than the stylesheet allow. Here is an example for the rule y.

We can know which kind of content this nonterminal have by looking at the transitions of the automaton. In the last example, we know for sure that the nonterminal will yield only elements. The last step is to verify that the interleaved text content beginning and end may overlap with white chars. Here again, regular approximation of nonterminal is used, but this time on the grammar itself. The overlap operator, used for horizontal ambiguity detection, is used. If such overlap exists, such as "WHITE <-> TEXT <-> WHITE", information lost may occur.

With this new information, developers can adjust the stylesheet to label items and enclose text inside elements and prevent information lost.

In summary, strict bidirectionality is required to prevent lost information in a transformation round-trip. Information lost can be caused by unlabeled items in the stylesheet, or because of trimming of interleaved text nodes. A method to detect areas that may not be strict, based on regular approximation of the stylesheet and the grammar, using the overlap operator, is proposed. Results from different stylesheets will be given soon. Stay tuned!

mercredi 23 décembre 2009

Strict reversibility in XSugar

XSugar is a tool to do bidirectional transformations between two file format. This is particulary useful to provide common API to configuration files under Linux. For example, here is the result of a stylesheet on /etc/hosts file :

<hosts xmlns="http://usherbrooke.ca/">
<record>
<ipaddr>127.0.0.1</ipaddr>
<canonical>localhost</canonical>
</record>
</hosts>

This file can be converted back to it's flat format. But, as you may notice, indentation doesn't appears in the XML file, and will be lost. Spacing is reset to a default value. The round-trip between hosts file and XML format keeps the semantic, but looses formating. Even without modification, if the file is written back, diff will show changes. Once spaces are reset, round-trip will yield identity function i.e. strings will be exactly the same.

One solution to overcome this problem is to add to the XML all elements that would be lost otherwise. This can be done by labeling terminal elements, and add corresponding nodes to XML part of the stylesheet. For examples, this rule loose optional "a" header :

A = [a]*
X = [x]+
n : [A] [X x] "z" = <x> [X x] </x>

Providing input "aaaaxxz" will give the following XML :
<x>xx</x>
Converting it back to non-XML will yield the string "xxz". Since the empty string matches "[a]*", this is the default string that is returned.

Now, let's label the terminal "A" :

A = [a]*
X = [x]+
n : [A a] [X x] "z" = <x> [X x] <a> [A a] <a></x>

Now, we get the string
<x>
xx
<a>aaaa</a>
</x>
and converting it back to non-XML format yield "aaaaxxz", the exact same string as the original input.

Preserving semantic of the file is simple bidirectional property. In addition, if the stylesheet preserve the concrete representation of an input, I call this strict bidirectionality.

Strict bidirectionality can be achieved by labeling unlabeled terminal, and add corresponding element to the XML part. I did a small prototype of this algorithm, that augment the resulting stylesheet. Hence, any stylesheet can be made strict bidirectional.

It rises the question : can we staticaly verify that a stylesheet is strictly bidirectional. Hopefully yes, it's really simple. We have to do the basic check that the stylesheet is bidirectional, and then verify that all regular expression terminal are labeled. This way, we are sure that all the variable concrete string will be represented in the XML.

Automatic strict bidirectionality for stylesheet and static validation of this property will be useful to provide the behavior a system administrator would expect from a tool that modify configuration files under Linux. Let's go on!