Thursday, 12 July 2007

4: Ruby regular expressions

In the wireless engineering applications area I work, we often have to deal with new data types and formats, and for years I relied on Perl for fast and convenient scripting to deal with data conversion and data validation. I never became a master at all that Perl offered, but was very proficient in a number of key areas applicable to these kinds of problems, notably text file manipulation, data management and regular expressions. This knowledge was valuable also when using regular expressions in Java, initially with the OROMatcher library and later with the standard regex library in java 1.4, but Java was never as convenient and concise as Perl, which was always a small irritation.

Needless to say I was thrilled to discover that Ruby regular expressions were the best of both worlds, with the conciseness of Perl and the Object-Orientation of Java. My first example here is a comparison of a Perl script with the Ruby equivalent:



This Perl script opens a text file and iterates over all lines, splitting each line on the ';' character and using a regular expression to extract two important bits of data out of one of the fields. The results look something like the following:



OK, so this data might not mean anything to you if you don't work in mobile network engineering, but the principle stands. The Perl script is short and simple, and the single line regular expression very useful. The Perl syntax for this was $fields[1] =~ /ID \"(\w+)\", cellRef (\d+)/, which basically compares the second field to the regular expression, and stores the two parenthesized substrings in the convenience variables $1 and $2. Concise, simple and powerful. The Ruby example is almost exactly the same:





Amazingly this is even shorter than Perl, since we merged the two loops into one, with Ruby's lovely File.open.each convenience methods, and we have less punctuation, but otherwise the scripts look almost identical, and the regular expression line is nearly exactly the same, with only the first '$' removed. For Perl fans, this is wonderful. But what about the Object Orientation fans out there? Perls regular expressions are not apparently object orientated, and the use of global variables $1 and $2 in both the Perl and Ruby examples does not please everyone.

The good news is that this Ruby approach is actually a convenience wrapper on what is fundamentally a very object orientated regular expression library in Ruby. And of course you can write the code in a very OO way, remarkably similar to Java. Consider the following Java example:





As with the previous two scripts, the output is identical. However, the program code is much, much more verbose, to the extent that I had to shrink the font size to fit it on the page! Firstly we have the required Java class and static main method plumbing, and then the java.io plumbing to open and read the file. The String.split method introduced in Java2 helps, and looks somewhat like the Perl and Ruby version, but the regular expression code is sure ugly - lots of steps: creating Pattern, Matcher objects and so many methods: compile(), matches(), and group(). Whew!. So is there anything good about this approach? Well, it is object orientated, and no global variables are used, and if you work in eclipse you can ctrl-space on all objects to figure out what to do next, aiding programming very much. It is relatively easy to code, but somewhat less convenient to read later.

So how does Ruby do this? Let's re-write the previous Ruby script using a purely object orientated syntax:





Wow! Amazing! The Ruby version of the program is almost exactly like the Java version, constructing a Regexp object, calling match, and then extracting the groups with the [] method. But it certainly looks a lot less verbose than the Java version, and is definitely easier to read. It is not as concise as the Perl-like Ruby script before, but it has achieved a clean object-orientated approach with no use of globals, while only being a little more verbose. Nice!

For a clear comparison of the two Ruby versions, I place them here side by side with the line including the regular expression highlighted. The differences are not that great, even though if you did the same comparison between Perl and java you might have trouble seeing the connections. Ruby has found a great way to bridge the gap:





No comments: