Multivax: 2010

jeudi 9 décembre 2010

CPU history from kernel trace

Everybody is familiar with CPU history usage, for example with GNOME monitor. We see the average usage in time.

The GNOME monitor poll /proc/stats at regular interval to compute average usage. This file is updated by the kernel. The values comes from sampling that the kernel does for accounting various process.

How could we compute the CPU usage history from the actual kernel trace? We only have to look at sched_schedule events. The event inform us that in a certain point in time, the kernel switched the task on a given CPU. If the new task has PID zero, we know that's the idle thread from the kernel. By splitting running and idle time in fixed intervals, we can compute CPU average usage.

I show one small examples done on my computer that is dual core. First, we see the execution of three cpuburn process started with one second of delay. We can see the result in the control flow viewer in LTTV, and the resulting CPU usage view.

We see a peek at the beginning of the trace. This is likely to be related to the metadata flushing when a trace is started. The two steps after are related to each cpuburn process, and we see that when the third process is started, the CPU usage stays at 100%.

For testing, I added scripts to generate those traces automatically on your computer. It's written in Java, and you will need some extra packages to run it, see the README file. Get the code on github:

https://github.com/giraldeau/flightbox

Happy hacking!

vendredi 12 novembre 2010

Introduction to Linux Tracing Toolkit

The Linux Tracing Toolkit next generation (LTTng) is absolutely amazing. In this introduction, we will look at how to interpret the kernel trace of a small program by comparing it to strace. If you want to install components and get a trace on your own machine, you can get packages for the patched kernel and other tools for Ubuntu Maverick in LTTng's PPA on Launchpad. You will also need build-essential to compile the small C executable, strace and hexdump.

Here is the small program. It opens a file and write a integer into it. Then, it adds number in a for loop.

#include <stdio.h>
#define INT_MAX 2147483647
int main(volatile int argc, char **argv) {
    int i = 3;
    FILE *fd = fopen("test.out", "w");
    fwrite(&i, sizeof(int), 1, fd);
    fclose(fd);
    volatile long a;
    int x;
    for (x = 0; x<INT_MAX; x++) {
        a++;
    }
    return 0;
}

Compile it, test it and run it under strace:

$ gcc -o usecase usecase.c
$ ./usecase
$ strace -o usecase.strace ./usecase

Ok. Now, let's run the program under tracing, but before, we have to enable all trace points in the kernel.

$ sudo ltt-armall

To get a tight window around the command, I'm using the small script trace-cmd.sh (see at the end of the post). The script simply starts tracing, run the program and stop the trace. Invoke it like this (don't forget to set the script executable):

./trace-cmd.sh ./usecase

Now, we are ready to launch lttv-gui, which is the utility to look at the trace. To load a trace, select File->Add trace, then select the "trace-cmd" directory created in the same directory as the trace-cmd.sh script. Then, scroll down to the "./usecase" process in the control flow viewer, which is in the middle of the window. Each trace events are displayed under the graph. I will present three phase of the process. First, the creation and file access, then computation phase and the process exit. Trace is rotated to make it more readable. Events name and comments are embedded in the graph. strace events are in red, while related code is in blue.

In addition to system calls, we can look at the paging of an application and see process starvation. This information is not shown by strace. Another cool side effect is that, contrary to strace, the running program can't detect it's under tracing, then we can trace viruses that otherwise quit before they can be analyzed. Tracing reports much more information about kernel internals, which helps a lot to understand what's going on. Events timestamps of the kernel trace are in nanoseconds, so the precision is really high, and since trace points are static (simple function calls compiled right into the kernel), performance impact is ridiculously low, below 3% for core trace points! I'm running LTTng in background on my system without noticeable impact, in flight recorder mode.

I will be working on advanced trace analysis for the next three years or so for my PhD. Since LTTng got all the feature I need for tracing, I will use this tools. My first goal is to create a CPU usage view. Say that something was eating your CPU a minute ago, then with kernel tracing you would be able to go in the "flight recorder" history and see what was running at that time, and understand what was going on.

You can help me in two ways. First, by submitting your ideas about cool trace analysis we could perform and/or you need. Second, you can write to Linus Torvalds and ask him to merge LTTng in mainline. LTTng is the right stuff we need for tracing and it's a bit a shame that this not merged yet while other operating systems have this since a decade.

Next post will be about TMF, an eclipse plugin for trace analysis! Stay tuned!

Script trace-cmd.sh:

#!/bin/sh

if [ -z "$@" ]; then
    echo "missing command argument"
    exit 1
fi

cmd="$@"

name="cmd"
dir="$(pwd)/trace-$name"
sudo rm -rf $dir
sudo lttctl -o channel.all.bufnum=8 -C -w $dir $name
echo "executing $cmd..."
$cmd
echo "return code: $?" 
sudo lttctl -D $name

vendredi 22 octobre 2010

Scary pumpkin

I left my computer for few minutes to carve the scariest pumpking for any programmer, the Big O factorial time pumpkin!

Terrifying, isn't?! Hand carved into a local organic grown pumpkin, low power lightening with LEDs, eatable after use, I'm ready for an happy hacking halloween!

vendredi 8 octobre 2010

Augedit: regedit for Linux

Few months ago, a prototype of GUI for Augeas was posted on the augeas-devel mailing list. I found it awesome, so I wanted to take it further. I think this is the kind of tool that can easy the configuration management on Linux, like regedit does on Windows. It's still require knowledge of what you're doing, but at least, you don't need to worry about the syntax of a particular config file! Here is a screenshot of the latest version.

On the left pane, there is a GtkTreeView, that shows the tree path and the corresponding value, if any. Then, on the right pane, there is the actual file, where the label and the value are highlighted. At this early stage, the views are read only, but the goal is to make any side editable and make the changes propagate to the other side.

Some of ideas that this kind of tool should have: basic edit operations (add, modify, move, delete, ...) searching, filtering, bookmarks, auto-completion.

If you want to try it, you will need the "myhead" branch of augeas, up to date python-augeas branch "aug_info" that adds the API function for highlighting, and the code of Augedit itself.

http://github.com/giraldeau/augeas
http://github.com/giraldeau/python-augeas
http://github.com/giraldeau/augedit

Leave your ideas about this tool in comments! ;)

jeudi 2 septembre 2010

Windows Phone 7 adventure

I went to Ottawa Mobile Conference, and I won as a presence prize 2 day training on Windows Phone 7 (WP7) plateform, at the Microsoft office in Ottawa. The training is provided by Colin Melia from DevTeach, which is knows the plateform very well. This was for me the occasion to get an idea of what it will look like. With the iOS, Android and RIM devices, is there a place for WP7? What it stands for? Millions of theses devices are supposed to hit the market before Thanks Giving in the US, made by Samsung, LG and Dell and others, so it has to be good.

WP7 is the Microsoft implementation of the whole mobile concept from Apple. Mobile Market is equivalent of AppStore, Zune have the same role as iTunes to sync the device and basic apps are about the same, except they are based on Microsoft technology, for example Internet Explorer replace Safari, Bing map and search replaces Google. A light version of Office and SharePoint may be free on the phone for the release. XboxLive will be the basis for all games on the device. Interesting fact, IE doesn't ship with Flash plugin. Seems that Microsoft agree with Apple on this point.

Concerning the interface, it uses the design pattern Hub and Spoke, that force the user to focus on one task at a time. A tile view of apps acts as the Hub. There are three mandatory physical button on the phone itself: back, start and search. Back goes to the previous view, start goes to the tile view and the search go to Bing. There can't multi-process running simultaneously, the same as the iPhone before multitasking. Say that you start an app, then go to the tile view and start another app, the first app instance is lost.

You can use two different libraries for an app, Silverlight for straight forward forms or XNA for games that requires more heavy video display. We saw mainly how to use Silverlight to design the UI. The application code can be written in C#, using the .NET framework classes. What is realy interesting is the clear separation between the logic and display of the application. It has been thought in such a way that the designer can work on the view independently of the programmer, that fills the view. Some preview data can be supplied to the view at design time, so the designer doesn't need the whole application to work. The UI is completely defined in XAML file in a declarative manner. A factory then read the file at runtime and create corresponding objects. .NET library provides a high level of abstraction for XML handling, HTTP and SQL queries, so in very few lines of code, a lot of things occur.

For those of you who may want to give it a try, here are some notes about installation on a MacBook Pro 1,1. The requirement and installation instructions are accurate. You need the VMX extension for virtualization to get good performances. It has to be enabled into EFI boot loader. See the Parallels knowledge base to get instructions. Microsoft supplies a small utility to check if the extension is available. But, maybe more critical is this MacBook Pro have an ATI X1600 card, and the latest (and the last) driver from AMD doesn't support WDDM 1.1, which is required for decent display performance with the phone emulator. Any computer with proper drivers for Windows 7 should do, it just means I will need a new computer to run the emulator appropriately.

In conclusion, Microsoft is jumping in the arena seriously with WP7. SDK is quite lean yet, and this is a great piece of integrated technology AFAIK. Their first step is obviously to copy recipies of competitors with their own technology. It's right, because innovation comes a lot from inspiration from others. iPhone is still the blue ocean device, and except for the Office app, there is nothing that a WP7 seems to do better, so the price point must be somewhat attractive to change the market forces. The fate of WP7 is now in the hand of customers.

mardi 31 août 2010

No network, no problem!

I got a brand new smart phone, and the first thing I noticed is that it worth nothing without network. Wifi hotspot availability is uncertain, everybody has yet beg someone to get some access. 3G data is still expansive. Roaming above all, at 20$ bucks per megabyte, could ruine somebody looking at few youtube videos. And even 3G network is not always available, for example in subways, and will make your device as useful as a stone.

We already know the solution: data cache. At least, a device with some type of cache will be useful at some level. I want to bring with me my own files on any of my devices. This way, we do not rely on the cloud for every lookup, and we can sync with low cost network connexion when it becomes available. Mail is an obvious example of asynchronous system that is working well. With IMAP and a good mail client, you can get a cache of all your emails, read them when offline, even compose an email and send them later when network is back.

Another drastic example of a device that rely only on a cache is the WikiReader, that I bought recently. It has a cache of all 3 millions Wikipedia articles on a 8 Gb SSD card. It can be updated with a computer or by subscribing to an update service by old fashion mail. Never forget that a truck loaded with SSD cards has a lot of bandwith!

Though, keeping in sync multiple disconnected data sets is harder. Yet, many interesting projects, like the distributed replicated database system CouchDB, used in Ubuntu One for online services, borrow my contacts where ever I need them. At the file system level, I'm using Dropbox that replicates my files on all my devices. Beyong replication, it acts as a continuous backup, so I don't worry anymore with hard drive crashed. Also, HTML5 cache is a great addition, and can be used to lower your precious G3 bandwith, server load and lower latency. Those are example of disconnected aware cloud services, that I think are going to take off.

This said, sync all data does make sens only if the device belong to the user for a fair amount of time. In the case of shared workstation, for example a school lab, roaming profiles are known to be a disaster, because the data has to be synchronized from scratch multiple time, and it makes the cache useless. We may think of better synchonization, based on the content the user uses frequently, and fetch other things on demand. But the "keep it simple and stupid" way would be to make user reuse the same computer, or at least the same set of computers, and not flush the cache. Keeping users files on a shared computer is always tricky for privacy reasons, but providing encryption based on the user's credentials could be a solution. Another solution would be to let the student own and borrow their computer. With all those Linux mobile tablets that are going to hit the market soon, I think that it will create a great opportunity for providing affordable computers to each and every students. And with 16Gb of flash storage on basic models, there is enough room to cache almost everything! This way, the cache hit will be much higher, and users will be happier.

In summary, let's take a GPS as an example. Would you prefer a GPS that downloads all it's bitmap maps just-in-time, that rely heavily on network, or a GPS that is using it's embedded geo data. The later will not work without network, the later is not ofter updated, or even not at all. What I would consider the best design is to get a device with both. The mapping data should be embedded and rendered on the device itself, and this sort of cache should be updated each time you get network. Or even if you don't have all the data, you should be able to select areas to cache.

In conclusion, the cloud is so fantastic that we sometimes forget that without a link to it, we're screwed. I consider that devices that are not designed to work without network is bad design, and those devices will end into the Computer History Museum soon.

dimanche 8 août 2010

Firefox memory management nightmare

Can you believe that you can crash any linux machine with a web page displaying stamp size images, while an iPod can display it without an glitch? Try this dead page with you browser, that displays large images resized and watch your memory usage.

WARNING: this page requires ~300Mb RAM to display!

https://pages.usherbrooke.ca/fgiraldeau/crashff/

Ok, what's going on? I did a small utility to monitor the memory usage of Firefox, and the memory usage was quite normal. Where did all this memory allocation occur? After few top, it appears that the Xorg process blew up. Here is a graph that show both process on crashff loading.

Memory consumption for crashff page loading according to time

The test shows the page loading, opening other tabs and coming back to the original page. The full scale image is transfered to the Xserver for rescaling the image. On localhost, it's fast, but when Firefox is started remotely, then all this data has to be transfered on the network! After a while, the image is flushed from Xserver, and display the tab requires doing all this work again. And because the Xserver allocates the memory, if you run out of memory, the Out Of Memory (OOM) killer of the kernel will be triggered, but it's likely to kill the Xorg process and leads to the user's session lost. This is really critical.

There is one optimization that may help this, and wanted to know the impact. When setting MOZ_DISABLE_IMAGE_OPTIMIZE to 1, it was supposed to change the behavior. Let see the same experiment with this environment value set.

Memory consumption for crashff page loading, with optimization, according to time

Ok, so we see that the Xorg process stays at some low level of memory usage, this time firefox blows up. At least, if the user hits the memory limit, then only the Firefox process gets killed, and the user's session will not crash.

I did also the test with Opera 10.60.6386 and Konqueror 4.4.2, and the result is mostly the same as with Firefox with optimization. Konqueror uses a less memory, and doesn't deallocate the tab, so when displaying it again, it's really fast. Here are the graphs.

So MOZ_DISABLE_IMAGE_OPTIMIZE does the trick, and IMHO, this option should be the default.

Note: this blog is related to this firefox bug: https://bugzilla.mozilla.org/show_bug.cgi?id=395260

samedi 3 juillet 2010

Mouse safety

For those of you that are working long hours in front of a computer, you may encounter pain in your wrist at some point. This may be caused by repetitive movement, that creates inflammation of tendon.

The inflammation can be treated with ice, rest and acetaminophen (paracetamol). But the cause, for me, was the abuse of the scroll wheel. Since I'm avoiding to scroll, my hand and wrist pain diminish progressively.

Hope that this trick helps you. Happy hacking!

mercredi 9 juin 2010

APLWACA what?

I went to the conference Analysis and Programming Languages for Web Applications and Cloud Applications (APLWACA) held in Toronto, and I wanted to share one cool stuff I learned: Javascript security.

The kenote was about the security of untrusted Javascript that browsers may run. This is actually happenning because trusted website can embedded code from other, like analysis code to gather statistics and display ads. Here are two examples of those threath.

This code will replace every links in a document to somewhere you don't trust.

els = document.getElementsByTafName("a"); 
for (var el in els) {
    el[url] = "http://dangerous-site.com/";
}

Another bigger threath is that, if you have an active session with a trusted server, then the actual embedded Javascript code can interact with your session like this:

var x = window.XMLHttpRequest();
x.open("/account");
x.send("some nasty command");

To prevent this, all code from untrusted Javascript should be wrapped to prevent calls to XMLHttpRequest. Here is an actual snippet of code that verify simple lookup o[s].

lookup function(o, s){
    if (s == "XMLHttpRequest"){
        return "Not allowed";
    } else {
        return o[s];
    }
}

Some implementation of this kind of wrapper exists, like Facebook Javascript, Google Caja and Yahoo ADsafe. But, how can we actually prove that these wrappers are safe, and they really don't allow to execute the XMLHttpRequest? For example, our simple lookup, s may be an object, will not be equal to the searched string and will be evaluated. But then, s.toString() function will be evaluated, and can return "XMLHttpRequest"!

The solution proposed is to statically typecheck Javascript. String that may evaluates to "XMLHttpRequest" are carried, and then all code is checked to look at unsafe calls that may be made.

Static typecheck has been performed on ADsafe itself. It showed one possible exploit as far. The result is that it could be possible to prove it's secure.

In the mean time, disabling Javascript may not be a bad idea after all...!

mardi 1 juin 2010

Square lens for Augeas

Augeas has the ability to handle XML like tag. Here is an example of how to do it with key and del lens in a simplified version for a paragraph tag of an HTML document.

let para = [ Util.del_str("<") . key "p" . Util.del_str(">") .
 store /[a-zA-Z0-9 \r\n\t]*/ .
Util.del_str("") ]

That's working fine, but there is a gotcha: we have to list all HTML tags as strings, not as a regexp. Let's create a new version of this lens to process arbitrary tag.

let tag = [ Util.del_str("<") . key /[a-z]+/ . Util.del_str(">") .
 store /[a-zA-Z0-9 \r\n\t]*/ .
 Util.del_str("</") . del /[a-z]+/ "abc" . Util.del_str(">") ]

Let see what happens with the tree cases get, put and create.

The get will accept arbitrary tags and set the node label accordingly

text ---> {"p" = "text"}

The put direction, without updating the label, will yield correct result

{"p" = "text"} ---> text

In the create operation, in the case of a new label, this will yield the default value for the close tag, and this will produce a syntax error

text ---> {"p" = "text"} ---> {"b" = "text"} ---> text</abc>

Also, this lens accepts a malformed tags, like "<a>text", and we should throw an error in this case.

We need that the closed tag be linked to the key. The square lens is just about that. Let's rewrite the example with the square lens. The lens takes two arguments. First, a regexp that describe the tag, and behaves as a key. The other is a lens that represent what's inside the tag.

let content = Util.del(">") . store /[a-zA-Z0-9 \r\n\t]*/ . Util.del_str("<")
let tag = [ Util.del_str("<") . square /[a-z]+/ content . Util.del_str(">") ]

The first difference is that, now, the open and close tag are related. In the get direction, we can test that the second tag is the same as the first, and then detect syntax errors in the input document. The other difference is about the create operation in the put direction, because now we can copy the key of the node at the end of the content, and then yield correct behavior.

"text" ---> {"p" = "text"} ---> {"b" = "text"} ---> "text"

The first lens to benefit from it will be httpd.conf lens. Stay tuned, the patch is cooking!

vendredi 14 mai 2010

Character feedback for XSugar

In the previous post, we showed that embedding white chars inside the XML is not a viable solution for strict bidirectionality. Automaticaly pre-process the modified XML to fix it and make sure it conforms to the stylesheet seems not possible based on the stylesheet itself, because the XML tree and AST tree are both orthogonal and not related to each other.

How does it works in Augeas? Augeas is based on lenses primitives and combinators. The one we are most interested here is the del lens. In the put direction, the del lens remove the matched content, and stores in a skeleton. In the put direction, the lens restores the original content from the skeleton or provides a default value in case of a new node. The skeleton values are retreived either by sequence or by key. To understand the difference, let's see what happens in case of a subtree move. If skeleton is aligned by sequence, then deleted content by the del lens will be restored in the order in which it appears in the original document. It means that white chars are sticking in place while the content move. In the case of insert, every record following the insert will be modified and get the previous skeleton value, and the last record will get the default value. This behavior doesn't correspond to minimal edit. A better approach is to use the key alignment. Then, skeleton is matched by key, and the correct original chars are selected even in the case of move, insert or delete of records.

The idea behind this principle of operation is to create a feedback of the original document to insert previous values when possible, while preserving modifications.

It's possible to do something similar in XSugar, by working on the AST level. In the rest of this blog, I present the principle of operation, a prototype with basic algorithm, obtained results and conclusion about this solution.

The goal is to to get back original white chars inside the updated document, without embedding them in the XML document. By using the strict version of the stylesheet, where every items are labeled, the resulting AST preserves every parsed chars, and we call strict nodes those that contains chars that would have been lost otherwise. Then, we could use this strict AST to match nodes from the updated XML and the original document, and then copy only strict elements from the original to the updated document.

The matching algorithm used to test this hypothesis works as follow. For each node of the first tree, search in the second tree a node where each child labels and text matches, except for strict nodes. The algorithm works by traversing the AST in level order, and has a complexity of O(n^2).

Here is the expected behavior of this simple algorithm.

Round trip without updating content should restore all original values.
Inserted node should get default values.
Updated nodes should get default values, because they will not match any previous values. They are handled the same way as an insert.
Moved nodes should get their original values.
Deleted nodes should not render any previous associated value.

To test and highlight the behavior, I modified slightly the students.xsg example. Here is the format of records that the stylesheet parses :

( [0-9]+ [z]+ [0-9]+ [a]+ )* 

Hence, two numbers are separated by space, and multiple records are separated by new line, but "z" represents spaces and "a" represents new line. The default number of z's and a's is one, to show the behavior, we use two and more chars to detect when a default value is inserted.

A parse tree and a match between original document and updated document in which a record has been deleted is shown in the following figure.

Here are the results of tests.

Test 1 : round trip without modifications

input:     1zz1aa11zzz11aaa111zzzz111aaaa

default: 1z1a11z11a111z111a

merged:   1zz1aa11zzz11aaa111zzzz111aaaa

All characters are recovered, and the resulting document is exactly the same as the input.

Test 2: insert

input:     1zz1aa11zzz11aaa111zzzz111aaaa

default:  1z1a11z11a99z88a111z111a

merged:  1zz1aa11zzz11aaa99z88aaaa111zzzz111a

The new record "99 88" gets a default "z", while the last record gets a default "a".

Test 2 : move

input:    1zz1aa11zzz11aaa111zzzz111aaaa

default: 1z1a111z111a11z11a

merged:   1zz1aa111zzzz111aaa11zzz11aaaa

The moved record gets the right number of "z", but "a" chars keeps their order.

Test 4 : delete

input:     1zz1aa11zzz11aaa111zzzz111aaaa

default: 1z1a111z111a

merged:   1zz1aa111zzzz111aaa

The central record is deleted, and the last record gets 3 a's in place of 4.

From this results, we can see that strict elements z's are retrieved by keys, and a's by sequence. This is caused by the fact that branch nodes that have only one strict leaf nodes are match by their level in the tree. We can't determine if this node should be matched to the parents, or are themselves a parents for other nodes, and in consequence, sequence match is the only remaining possible match. As I understand, other tree matching algorithms can't address this issue.

The algorithm can be improved to match nodes by leaf node string similarity, using Levenshtein distance. It may be possible to restore chars from updated nodes. But then, the match will depend on the number of changes made to the document, and a node may match another one more similar, may confuse the user by an apparent non-deterministic behavior and may produce non-minimal edit.

In conclusion, we showed that it's possible to use the original document to merge back otherwise lost chars. But, alignment of original content is mixed between sequence and key. The user can't control which one will apply. It may lead to non-minimal edit of the resulting file. In that respect, lenses primitives and combinators used in Augeas provides a more predictive behavior, by letting explicitly choose between key or sequence alignment by the lens developer.

vendredi 9 avril 2010

Limitations of XSugar

In previous posts, I explained how XSugar could be used to transform configuration files to XML and back. Preserving white characters (space, tab, carriage return, etc.) is possible by creating special elements to hold them. In a simple round-trip, when the XML document is not modified, it works well. But, there are issues when the XML document is modified. For example, be this simple XML with strict nodes, from the students example of XSugar.


<students xmlns="http://studentsRus.org/">
  <student sid="19701234">
    <name>John Doe</name>
    <email>john_doe@notmail.org</email>
  </student>
  <space1> </space1>
  <space2> </space2>
  <newline>
</newline>
  <student sid="19785678">
    <name>Jane Dow</name>
    <email>dow@bmail.org</email>
  </student>
  <space1> </space1>
  <space2> </space2>
  <newline>
</newline>
</students>

There are spaces and newline elements that follow the student record. They are used to restitute the exact formating of the original string. We can then modify an existing value, and spaces will be preserved, and this correspond to a minimal edit.

But, say another record is inserted after the first student, but before space1 element, then the first record will see it's spacing reset to default values, and the new inserted element will get the spaces from the first record! This doesn't correspond to minimal edit, and is similar to alignment problem found by Foster et. al. in Boomerang.

If a student element is removed, but spaces elements are left, then it will cause a syntax error according to XSugar stylesheet. If an element is moved, it must inserted to right place in the XML, otherwise syntax error may occur.

It may be possible to fix the XML before transformation to non-XML, by adding, moving or deleting space elements. The idea to use the original document and the modified one to produce the updated output is already what does a lens, and is exactly what Augeas does. And now, with the introduction of the recursive lens, Augeas is about to support context-free languages, and opens the door to handle the Web server Apache configuration, and many other.

That's all for now, stay tuned!

jeudi 11 mars 2010

Static validation of strict bidirectionality

These days, I had some fun to validate XSugar stylesheet for their strict bidirectional property. XSugar can convert non-XML documents to XML documents, and vice-versa. In theory, you get back exactly the same file if it does a round-trip to and from XML. But, there are few tricky things that prevent it to be exactly the same. XSugar verify statically the reversibility of a stylesheet, by verifying that both grammars are bijections. It means that there is no ambiguity, and so there will be always one way to do the conversion. But, not all stylesheet that are valid in this respect will yield correct behavior when performing actual transformation. The problem is linked to the fact that some white spaces may be lost in the transformation. To understand the problem, I will use a concrete example.

Say you have the following file, and want to convert it to XML. There are a bloc of numbers lines followed by words lines. Elements are separated by space, and there may be spaces before and after each lines.

  123  456  789  
  987  654  321  
  abc  def  ghi  
  ihg  fed  cba

We will convert this file to the following XML.

<?xml version="1.0" encoding="UTF-8"?>
<a xmlns:b="http://udes.ca/noesis">
  123 456 789
987 654 321

  <z>
    <y>abc</y>
    <y>def</y>
    <y>ghi</y>
  </z>
  <z>
    <y>ihg</y>
    <y>fed</y>
    <y>cba</y>
  </z>
</a>

The stylesheet looks like this...

xmlns:b = "http://udes.ca/noesis"

NUM = [0-9]+ (MAX)   /* numbers */
ALPHA = [a-zA-Z]+ (MAX) /* letters */
NL   = \n|\r|\r\n   /* mandatory new line */
SP = [ \t]+ (MAX)   /* mandatory white space */
OSP = [ \t]* (MAX)   /* optional white space */

/* top-level rule : define the whole document */
a : [xs x] [ys y] = <a> [xs x] [ys y] </>

/* sequence of lines of numbers separated by space, with optional leading and endin white space  */
xs : [OSP] [ns n1] [OSP] [NL] [xs n2] =  [ns n1] [NL] [xs n2]
: =

/* numbers sequence */
ns : [NUM num] [SP] [ns n] = [NUM num] [SP] [ns n]
: [NUM num] = [NUM num]

/* sequence of lines of words separated by space, with optional leading and ending white space */
ys :[OSP] [y y1] [OSP] [NL] [ys y2] = <z> [y y1] </> [ys y2] 
: = 

/* sequence of <y> elements that hold actual words */
y : [ALPHA alpha] [SP] [y y1] = <y> [ALPHA alpha] </> [y y1]
: [ALPHA alpha] = <y> [ALPHA alpha] </>

... and it validates well.

$ xsugar -b strict-test3.xsg
Transformation is guaranteed to be reversible!

But, when we actually try to convert the XML back to the non-XML format, we get a parse error on char 74. When activating the debug mode of XSugar, we can see the result of XML normalization, which is a preparation step done before parsing.

Normalized input:
<a>123 456 789
987 654 321<z><y>abc</><y>def</><y>ghi</></><z><y>ihg</><y>fed</><y>cba</></></>

We see that spaces and carriage return before and after the number text node are lost. This is because the interleaved text with other element is trimmed. Spaces inside the text are preserved by the trim. This doesn't occur when an element contains only text. The parse error occur because the last mandatory new line is dropped. Let's fix the xs rule.

/* sequence of lines of numbers separated by space, with optional leading and endin white space  */
xs : [OSP] [ns n1] [OSP] [NL] [xs n2] =  [ns n1] [NL] [xs n2]
: [OSP] [ns n1] [OSP] [NL] = [ns n1]

This makes the last return carriage optional, and parsing succeed. Let's look at the diff between the result and the original file.

$ diff -u strict-test.txt strict-test3.txt.back
--- strict-test.txt 2010-03-05 15:33:52.043893254 -0500
+++ strict-test3.txt.back 2010-03-12 00:23:12.482905316 -0500
@@ -1,4 +1,4 @@
-  123  456  789  
-  987  654  321  
-  abc  def  ghi  
-  ihg  fed  cba  
+123 456 789
+987 654 321
+abc def ghi
+ihg fed cba

The content is there, but the actual diff shows that every lines are modified. This is because some terminals are not labeled. When this is the case, their values are lost, and example string is generated back when there is not corresponding item. It explains why spaces are all reset to their default values, the empty string for the optional white space, one space for the mandatory white space, and the default carriage return.

Hence, we have to look at two things to make sure stylesheets are strictly bidirectional.

First, the easy part, all items in the stylesheet must be labeled. This way, we can be sure the XML document will contain all chars from the input, and no information will be lost. We only have to traverse the stylesheet and make sure that every RegexpTerminal are labeled.

Second, interleaved text nodes must not begin or end with white characters [ \t\r\n], otherwise those chars will be lost. This is harder to do. First, we must detect if an element contains text nodes, elements, or both, from the stylesheet. Determining this for StringTerminal, RegexpTerminal and Element objects is straight forward. First two generate text, while the other yield elements. But, this is non-trivial for Nonterminal objects, because of the recursive nature of these items. The problem is a chicken and egg problem, because to determine the actual possible values for a nonterminal, we have to compute this nonterminal and substitute it's actual value in a recursive rule, which produces a loop. One solution to this problem is to compute the regular approximation of the XML stylesheet rules. It yields an automaton for each nonterminal that accepts a superset of the XML content than the stylesheet allow. Here is an example for the rule y.

We can know which kind of content this nonterminal have by looking at the transitions of the automaton. In the last example, we know for sure that the nonterminal will yield only elements. The last step is to verify that the interleaved text content beginning and end may overlap with white chars. Here again, regular approximation of nonterminal is used, but this time on the grammar itself. The overlap operator, used for horizontal ambiguity detection, is used. If such overlap exists, such as "WHITE <-> TEXT <-> WHITE", information lost may occur.

With this new information, developers can adjust the stylesheet to label items and enclose text inside elements and prevent information lost.

In summary, strict bidirectionality is required to prevent lost information in a transformation round-trip. Information lost can be caused by unlabeled items in the stylesheet, or because of trimming of interleaved text nodes. A method to detect areas that may not be strict, based on regular approximation of the stylesheet and the grammar, using the overlap operator, is proposed. Results from different stylesheets will be given soon. Stay tuned!

samedi 6 février 2010

Why iPad is Evil

Apple just launched the iPad last week, and I was a bit shaken by it. It's a piece of art of human interaction design, and it changes the relation we could have with a computer, more ubiquitous than ever. The device itself is amazing at the technical level, and set the bar for competitors.

But, let's think about the concept in a whole. Apple provides stores for music, books and applications. In fact, this device is a personal buying device, where you can buy only content provided by Apple and it's partner. It's a kind of perfect world, where the consumer is captive like anything ever before, a garden of pure ideology, in each one may bloom, secure, away from the pests, and ... wait a minute! Doesn't it makes you think about 1984?

http://www.youtube.com/watch?v=OYecfV3ubP8

What a closed world we would be in if Apple had dominated the market, closed hardware, closed software, and forced to buy content from Apple!

So, I say no to that iPad, and I will wait for an truly open platform.