Multivax: 2009

mercredi 23 décembre 2009

Strict reversibility in XSugar

XSugar is a tool to do bidirectional transformations between two file format. This is particulary useful to provide common API to configuration files under Linux. For example, here is the result of a stylesheet on /etc/hosts file :

<hosts xmlns="http://usherbrooke.ca/">
<record>
<ipaddr>127.0.0.1</ipaddr>
<canonical>localhost</canonical>
</record>
</hosts>

This file can be converted back to it's flat format. But, as you may notice, indentation doesn't appears in the XML file, and will be lost. Spacing is reset to a default value. The round-trip between hosts file and XML format keeps the semantic, but looses formating. Even without modification, if the file is written back, diff will show changes. Once spaces are reset, round-trip will yield identity function i.e. strings will be exactly the same.

One solution to overcome this problem is to add to the XML all elements that would be lost otherwise. This can be done by labeling terminal elements, and add corresponding nodes to XML part of the stylesheet. For examples, this rule loose optional "a" header :

A = [a]*
X = [x]+
n : [A] [X x] "z" = <x> [X x] </x>

Providing input "aaaaxxz" will give the following XML :
<x>xx</x>
Converting it back to non-XML will yield the string "xxz". Since the empty string matches "[a]*", this is the default string that is returned.

Now, let's label the terminal "A" :

A = [a]*
X = [x]+
n : [A a] [X x] "z" = <x> [X x] <a> [A a] <a></x>

Now, we get the string
<x>
xx
<a>aaaa</a>
</x>
and converting it back to non-XML format yield "aaaaxxz", the exact same string as the original input.

Preserving semantic of the file is simple bidirectional property. In addition, if the stylesheet preserve the concrete representation of an input, I call this strict bidirectionality.

Strict bidirectionality can be achieved by labeling unlabeled terminal, and add corresponding element to the XML part. I did a small prototype of this algorithm, that augment the resulting stylesheet. Hence, any stylesheet can be made strict bidirectional.

It rises the question : can we staticaly verify that a stylesheet is strictly bidirectional. Hopefully yes, it's really simple. We have to do the basic check that the stylesheet is bidirectional, and then verify that all regular expression terminal are labeled. This way, we are sure that all the variable concrete string will be represented in the XML.

Automatic strict bidirectionality for stylesheet and static validation of this property will be useful to provide the behavior a system administrator would expect from a tool that modify configuration files under Linux. Let's go on!

vendredi 18 décembre 2009

Bcfg2 V.S. Puppet

Bcfg2 and Puppets are two tools for system configuration management, where we want to install configuration files, packages and adjust permissions. It seems simple said like this, but it's near from trivial when the target machines are various operating system and versions.

Bcfg2 and Puppet, while they share the same goal, are completely different in terms of their concepts, how does the desired state is modeled. The fundamental concept here is more important than the actual implementation.

Puppet is like a programming language for system administration. It's another meta-language, with it's own syntax, to indicate what files, packages, users and such, should be present on a system. It's concepts are very close to Cfengine. Well, I used Cfengine a lot, even developing a small utility to automating the management of files under Subversion (svnengine). But, while using this tool, I really came to one simple conclusion : what a mess for so little! I ended up doing everything in scripts, which are copied by cfengine and run in a cron. Big deal.

Bcfg2 is not the same, because unlike Puppet, you don't program configuration elements, they are declared in XML document. The client, specific to a target, encapsulate actions on that target, their sequencing, error handling, etc. This model enables advanced behavior, because the model can be manipulated programmatically, which is not the case for Puppet. If you want to output a Puppet manifest, you have to output strings, which is a Middle Age practice!

Bcfg2 implementation itself has some glitches that must be fixed, but the concept behind it represents a building block for semantic system configuration management, which represent the future. Bcfg2 belongs to scientific field, while I think that Puppet has the exposure it has now because of marketing wave.

But, as the history reminds us, it's not always the best product or software that wins...

samedi 17 octobre 2009

Pragmatic testing

It's certainly a matter of fact that good practice in software development include automated testing. Developing without unit tests is stone age practice. I read few books about testing, and many of them are about writing bureaucracy plans, not code. Reading documentation about unit testing framework doesn't provide much information about how to design tests. Here are few learnings from my experience.

Automated tests should be easy to write and run. If they are hard to run, if they require a special manual setup, then they will be run infrequently. If you need a database or a daemon, it should be setup and launched automatically on localhost by test running scripts. Default config that just work should be provided to avoid configuration mess. Document needed packages from known archives to run tests, that may be different than running or compiling the software.
Take care to apparmor when trying to launch a daemon with unusual locations : it may fail to start because of security restrictions. Disabling it may be necessary in some conditions.
If the program operates on file, then create them in a temporary directory, flushed at each run. A program that copy files is testable : execute the operation and verify that files are where they should be.
Running tests should be fast. if testing requires few minutes each time, and you are interested only to the result of one of them, it's not efficient. A simple solution is to provide a way to test only a subset of all tests.
Running tests should usually not require root access, excepted for special purpose. If you bind ports, use a port over 1024 to avoid restrictions. Don't use global settings, use relative path to settings and files. If you get data from system library, you can simulate obtained data by providing objects with the same interface.
Software always operates on data. You will need test data that simulate possible inputs. If your application reads input files, then those files should be held beside test code.
One tests should not depend on previous test to succeed, they should be independent. In other words, the result should not depend on the order in which tests are executed. To make sure that tests are not linked together, always start from a known state. It means reloading initial data for each test, recreating objects and such. This is conveniently done with setup and tear down method usually available from unit testing framework.
Test Driven Development is about creating test and code for some functionality together. This is the basic thing to do to make sure at least it works. Testing can go further for critical code, by trying to break the code, use it in ways that it may not have been thought by programmers. This kind of tests needs a different point of view, and should be done by someone else.
Don't depend on external services. Tests environment should be self-contained, and running tests should be possible without network access. This may not be possible for all situations, for example if the application is interacting with google API, we can do the assumption that the service will be available, and anyway, we can't reproduce the service on localhost. But providing local servers when possible limit dependencies to run tests, and avoid network delays.

I hope that these few advices will help you build better testing!

dimanche 4 octobre 2009

Django, soooo cool

A log time ago, I did a small form in PHP to feed data in a database. I did create the HTML form and the database schema, wrote small functions to validate the form, all inside one php page. SQL statements, fields names and everything else was hardcoded, and it was a nightmare to test and maintain. I was feeling like a prehistoric men, who was trying to survive with rudiment tools. Well, it was the old days.

Now, I had a revelation while using Django! Don't loose your time with other crappy framework and use Django! Django contains all pieces needed for all the gory details of web development, like session management, forms, database and templating. It's so powerfull : you write your model once, Django generate sql statements for the database, generate HTML form to add or modify data, with data validation in bonus, in few lines or code. Documentation is wonderful.

There are some days like that when I really feel that technology is getting better, and Django is a wise and beautiful piece of work.

Look at www.djangoproject.com

mercredi 30 septembre 2009

Java and system administration

Java is a wonderful platform for software development, it's mature, stable and feature-full. I had a bias against Java for many reasons in the past, as my coworkers, that lead to dismiss it a few time. Now, since I'm working with Java for Noesis, I had time to revisit this technology. Here are the bias deconstructions.

Java JVM is slow : We don't know for sure at first if the running time of a target application will be slower or faster with C++ or Java, and I don't want to dig into benchmarking. There are so many factors that may lead to a winner and the contrary. We know for sure that there is some overhead while running Java program, but for many applications, the raw output is not a primary concern. In scientific computing field, some code was not optimized and was taking a huge time to run, and optimization was leading to great performance improvement, better than a technology switch. Also, since there are large libraries available for Java, reuse reduces the development time, and it may represent a large portion of total running time. Startup time is annoying, but it occurs only the first time, and can be reduced by read ahead caching.
Java is memory hungry : java applications uses a lot of memory because they do a lot of things.
Java was proprietary : I was excluding Java because it was not open source, and then, any open source Java software had a non-free dependency. Free JVM and libraries was available, but you were on your own by doing this. Since Java JDK is now open, we are free to use it.
Java is complicated : there is some learning curve, but that's not so bad. Eclipse is helping a lot to reduce burden because of typechecking. Compiling with "ant" is so much easy compared to autotools madness! And since CDBS has support for ant, it's easy to package a Java program.
Java is huge : Installing JRE headless on a minimal ubuntu requires about 93MB of disk space. I agree that if you run only one application, it's huge. But if we were mainly running java software, this stack would be used for all applications, and also here, the reuse lowers disk requirement. But for Noesis, it's a good question to ask : do system administrator will be willing to install about one hundred MB of stuff for it? 100MB disk space costs about 1 cent today, but with backups and management time, say 5 cents. Well, I guess it would be the best investment we can do with a nickel. Still, I will have to work hard to convince system administrators to install a JRE on their Linux virtual server that takes less than few hundred MB, because it's a large proportion of the server install size.
What about embedded devices? It could be of practical interest to get Noesis running on a small device. There is the JME, but since it's stripped, I don't really know for sure if it won't break something.

It's a chance that XSugar was in Java, because it's definitely an important technology in the whole picture, and a new arrow to my quiver.

lundi 28 septembre 2009

Brics projects packages

New packages from brics projects are now available for Ubuntu. This is a snapshot of dependencies required for bidirectional configuration file parsing. Main packages are :

automaton : library for finite automata
grammar : library and utility for grammar validation and handling
xsugar : main program for bidirectional transformations

You can find them on launchpad :

http://launchpad.net/~francis-giraldeau/+archive/noesis

Take care, names are subject to change. Have fun!

jeudi 10 septembre 2009

Configuration as XML

Augeas provides an XPath like interface and API to access and modify configuration files. But, there are few limitations :

Augeas is limited to regular grammar, it can't parse nested structured documents. Il did search to see if it could be possible to use regular approximations for context free grammar, but in this case it's not possible. A regular expression parser uses only a finite automata, and for general context free grammar, we need a stack to keep track of the nested level of the document.
Augeas doesn't provide a complete xml file from a configuration file, and hence, can't use all the XML libraries processing available.

We need a parser that will be able to parse a general context-free grammar. It should be easy to write grammars, and LR or LALR parsers are too hard to user, since grammar must be written to avoid ambiguities and some type of recursion. The Earley parser algorithm is able to do that.

The project XSugar is exactly what I was looking for. First, it implements a tokenless Earley parser, that has relative acceptable performances on config files. XSugar is able to do bidirectional transformation between a concrete file and an XML document and vice versa. There are few issues that must be resolved.

Bidirectional relation doesn't preserve formating of the config file. The reversability propriety is hence approximate, because a round trip will yield the same result, except for spaces and indentation. You have to keep formating manually, and this can be tedious. There is no way to verify that the stylesheet is able to capture all character of the input. Strict unidirectionality is required for config files.
Ignorable Elements, like nodes to keep spaces and indentation, has to be present in the XML file, otherwise the unparsing fail. The problem is that clients that will modify the XML will have to add formating nodes. One of the main benefit of using XML was to abstract formating, and this requirement on XML breaks this abstraction. Ignorable Elements must be optional, and when not provided, a default value should be used.
The order of elements matters in the XML. If nodes are not provided in the right order, the unparsing fail. The client has to know in which order to provide Elements, and it would be better if the client has not to worry about it.

Those are the main issues I see to make a new day for configuration management come true.

To test those concepts, I created a new project, called Noesis. It means "insight", and I thought it would be meaningful for the current project. News soon.

dimanche 22 février 2009

Augeas : state of art configuration management

If you are like me, you probably find that configuration management under Linux is a real mess. Each software has it's own configuration file syntax and semantic. Since every project has it's own needs and is developed by different teams, a global and standard repository for configuration is unlikely to ever work in the open source model. We saw some initiatives that worked, like LSB and Freedesktop, but such a shift for configuration file would require a huge coordination and agreement, and is unlikely to happen, and since configuration values are usually highly coupled in software, it makes it hard to change it.

There are few ways we can address the situation. You can choose to not manage at all your config files, you can do backups, use revision management software like subversion. You can also parametrize your configuration files with variables and generate them. You can also do some grep'ing and sed'ing on your config files to change values. Everything there tends to be cumbersome and error prone. But, there is better. If it was possible to have a parser for every config file out there and get in memory representation, change whatever you want and save it back, then you have the st-graal of file configuration management. This way, you can manage configuration at a high semantic level, not manipulate file lines but configuration elements.

Guess what, Augeas aims is exactly that! The concept is simple. For each file type, you write a lens that describe the format of the file. Then, the lens can be used to parse an instance of that config file and you get a tree in memory. You can access the tree and modify it. When you're done, the lens is used again to write back the file, including any modification you made to the tree. A lens is bidirectional, you don't have to specify how to read and write a file, only one description describe both directions.

Last year, at the Linux Symposium, I listen to the talk of David Lutterkort about the progress of the project. There is one limitation of Augeas today related to the type of files for which you can write a lens. For example, we can't write lens to process config files that have recursive structures. For example, Apache configuration has this kind of nested structure. Actualy, this is not possible, because the lens semantic doesn't allow to define recursive grammar.

So, I choosed to do that for my master. When this will be done, it will be possible to manage more file types with Augeas. I will post my findings here, thoughts and ideas on the project. Stay tuned!