jeudi 11 mars 2010

Static validation of strict bidirectionality

These days, I had some fun to validate XSugar stylesheet for their strict bidirectional property. XSugar can convert non-XML documents to XML documents, and vice-versa. In theory, you get back exactly the same file if it does a round-trip to and from XML. But, there are few tricky things that prevent it to be exactly the same. XSugar verify statically the reversibility of a stylesheet, by verifying that both grammars are bijections. It means that there is no ambiguity, and so there will be always one way to do the conversion. But, not all stylesheet that are valid in this respect will yield correct behavior when performing actual transformation. The problem is linked to the fact that some white spaces may be lost in the transformation. To understand the problem, I will use a concrete example.

Say you have the following file, and want to convert it to XML. There are a bloc of numbers lines followed by words lines. Elements are separated by space, and there may be spaces before and after each lines.

  123  456  789  
  987  654  321  
  abc  def  ghi  
  ihg  fed  cba  

We will convert this file to the following XML.

<?xml version="1.0" encoding="UTF-8"?>
<a xmlns:b="">
  123 456 789
987 654 321


The stylesheet looks like this...

xmlns:b = ""

NUM = [0-9]+ (MAX)   /* numbers */
ALPHA = [a-zA-Z]+ (MAX) /* letters */
NL   = \n|\r|\r\n   /* mandatory new line */
SP = [ \t]+ (MAX)   /* mandatory white space */
OSP = [ \t]* (MAX)   /* optional white space */

/* top-level rule : define the whole document */
a : [xs x] [ys y] = <a> [xs x] [ys y] </>

/* sequence of lines of numbers separated by space, with optional leading and endin white space  */
xs : [OSP] [ns n1] [OSP] [NL] [xs n2] =  [ns n1] [NL] [xs n2]
: =

/* numbers sequence */
ns : [NUM num] [SP] [ns n] = [NUM num] [SP] [ns n]
: [NUM num] = [NUM num]

/* sequence of lines of words separated by space, with optional leading and ending white space */
ys :[OSP] [y y1] [OSP] [NL] [ys y2] = <z> [y y1] </> [ys y2] 
: = 

/* sequence of <y> elements that hold actual words */
y : [ALPHA alpha] [SP] [y y1] = <y> [ALPHA alpha] </> [y y1]
: [ALPHA alpha] = <y> [ALPHA alpha] </>

... and it validates well.

$ xsugar -b strict-test3.xsg
Transformation is guaranteed to be reversible!

But, when we actually try to convert the XML back to the non-XML format, we get a parse error on char 74. When activating the debug mode of XSugar, we can see the result of XML normalization, which is a preparation step done before parsing.

Normalized input:
<a>123 456 789
987 654 321<z><y>abc</><y>def</><y>ghi</></><z><y>ihg</><y>fed</><y>cba</></></>

We see that spaces and carriage return before and after the number text node are lost. This is because the interleaved text with other element is trimmed. Spaces inside the text are preserved by the trim. This doesn't occur when an element contains only text. The parse error occur because the last mandatory new line is dropped. Let's fix the xs rule.

/* sequence of lines of numbers separated by space, with optional leading and endin white space  */
xs : [OSP] [ns n1] [OSP] [NL] [xs n2] =  [ns n1] [NL] [xs n2]
: [OSP] [ns n1] [OSP] [NL] = [ns n1]

This makes the last return carriage optional, and parsing succeed. Let's look at the diff between the result and the original file.

$ diff -u strict-test.txt strict-test3.txt.back
--- strict-test.txt 2010-03-05 15:33:52.043893254 -0500
+++ strict-test3.txt.back 2010-03-12 00:23:12.482905316 -0500
@@ -1,4 +1,4 @@
-  123  456  789  
-  987  654  321  
-  abc  def  ghi  
-  ihg  fed  cba  
+123 456 789
+987 654 321
+abc def ghi
+ihg fed cba

The content is there, but the actual diff shows that every lines are modified. This is because some terminals are not labeled. When this is the case, their values are lost, and example string is generated back when there is not corresponding item. It explains why spaces are all reset to their default values, the empty string for the optional white space, one space for the mandatory white space, and the default carriage return.

Hence, we have to look at two things to make sure stylesheets are strictly bidirectional.

First, the easy part, all items in the stylesheet must be labeled. This way, we can be sure the XML document will contain all chars from the input, and no information will be lost. We only have to traverse the stylesheet and make sure that every RegexpTerminal are labeled.

Second, interleaved text nodes must not begin or end with white characters [ \t\r\n], otherwise those chars will be lost. This is harder to do. First, we must detect if an element contains text nodes, elements, or both, from the stylesheet. Determining this for StringTerminal, RegexpTerminal and Element objects is straight forward. First two generate text, while the other yield elements. But, this is non-trivial for Nonterminal objects, because of the recursive nature of these items. The problem is a chicken and egg problem, because to determine the actual possible values for a nonterminal, we have to compute this nonterminal and substitute it's actual value in a recursive rule, which produces a loop. One solution to this problem is to compute the regular approximation of the XML stylesheet rules. It yields an automaton for each nonterminal that accepts a superset of the XML content than the stylesheet allow. Here is an example for the rule y.

We can know which kind of content this nonterminal have by looking at the transitions of the automaton. In the last example, we know for sure that the nonterminal will yield only elements. The last step is to verify that the interleaved text content beginning and end may overlap with white chars. Here again, regular approximation of nonterminal is used, but this time on the grammar itself. The overlap operator, used for horizontal ambiguity detection, is used. If such overlap exists, such as "WHITE <-> TEXT <-> WHITE", information lost may occur.

With this new information, developers can adjust the stylesheet to label items and enclose text inside elements and prevent information lost.

In summary, strict bidirectionality is required to prevent lost information in a transformation round-trip. Information lost can be caused by unlabeled items in the stylesheet, or because of trimming of interleaved text nodes. A method to detect areas that may not be strict, based on regular approximation of the stylesheet and the grammar, using the overlap operator, is proposed. Results from different stylesheets will be given soon. Stay tuned!