The Problem: a Parser for Unified Diffs
Recently, I needed to parse a unified diff. I looked for a Java library that does this for me without luck, so I decided to roll my own since it is not that big of a problem. A unified diff is a text fragment that contains the differences in lines between two text files (or between two versions of the same text file). It looks something like this:[gist file=unified-diff.txt]7512852[/gist]
This text is to be parsed to create a Java object that contains all information.
A State Machine to the Rescue
Luckily, a unified diff follows a pattern that can be parsed. There are only a couple of different line types, each line type containing some specific information. These are the possible line types:
- HEADER: a header line that contains part of the header of the diff
- FROM_FILE: a line that contains the name of the first file that is compared
- TO_FILE: a line that contains the name of the second file that is compared
- HUNK_START: a line that marks the start of a “hunk” of differences in both files, providing the line numbers of the start line and the end line of the hunk
- NEUTRAL_LINE: a line within a hunk that is contained in both files
- FROM_LINE: a line within a hunk that is contained in the “from file” only
- TO_LINE: a line within a hunk that is contained in the “to file” only
Projecting these types to the diff example from above looks like this:[gist file=states.txt]7512852[/gist]
Now, each line type can only be followed by a certain set of other line types. Modelling the line types as states we get a state machine like this:
Coding the States
So, what’s the best way to code such a state machine in Java? Te natural feature for representing states in Java code is an enum. The plain enum would look like this:[gist file=enum.java]7512852[/gist]
Note that there are two additional states for the initial state and the end state.
Coding the Transitions
That was easy. The question how best to code the transitions between the states is a little harder. One way to do it is to simply let each enum constant implement a method called “nextState” that returns the next state depending on some context information.[gist file=enum-with-transition.java]7512852[/gist]
Here, each enum constant has to implement the abstract method nextState that takes an object containing some context information about the line of the diff that is currently being parsed and returns another enum constant representing the next state. Usually an implementation of the nextState method should not be overly complex since it simply uses a couple if/else statements to return the next state based on the current state and the information in the context object (an object of type ParseWindow in this case).
If calculating the next state becomes more complex, you should code each transition in it’s own class and simply call that class from the nextState method.
Coding the “Engine”
Having coded the states and the transitions between those states, you need some kind of “engine” that drives the state machine from the initial state through a number of other states to the final end state. While the structure of the state enum is pretty generic, the engine is very specific to the problem you are trying to solve. In the case of a text parser the engine starts by calling the INITIAL state’s nextState method providing the first line of the file and some context information on the next couple lines. Since the first line must be a HEADER line, this will be the result.
Now that we know that the current line is a HEADER line, we know how to parse it. Having parsed the first line, the engine takes the second line and passes it along with some context information to the HEADER state’s nextState method. Lather, rinse, repeat until the END state has been reached.
State machines are a natural way to solve some programming problems, parsing problems probably being the most prominent among them. However, a state machine like this can also be a nice and readable way to solve business problems like the transitions between states of an insurance contract or states of a user going through a complex registration process.
A Live Example
The code examples above are excerpts of a live github project called diffparser. Feel free to have a look at the source.
eclipse -application org.eclipse.jdt.core.JavaCodeFormatter -verbose -config <PATH_TO_CONFIG_FILE> <PATH_TO_JAVA_SOURCES>
- <PATH_TO_CONFIG_FILE>: Path to the properties file containing the configuration of the code formatter. This file is created in .settings/org.eclipse.jdt.core.prefs when you configure the formatter in eclipse for a specific project
- <PATH_TO_JAVA_SOURCES>: Path to the folder containing the java source files to format.
Ant’s default tasks include archiving tasks with which we can create zip, jar and war files in the course of our project’s build process. All these tasks offer the “update”-attribute for time-saving incremental builds. If this attribute is set to true, the target archive file will not be created from scratch if it already exists.
Rather, ant checks the contents of the existing archive file and compares them to the files which are supposed to be archived. Any files that are not already in the archive (or which have changed since the last build) are then put in the already existing archive, overwriting existing files:
<target name="jar-update"> <jar basedir="DirToJar" update="true" destfile="test.jar"/> </target>
But what happens if you delete a file in the directory you want to archive? Does ant realize that you deleted the file and will also delete this file in the target archive file?
The answer to that question is: no, ant does not delete files from an archive file if the “update” atribute is set to “true”. This can lead to nasty surprises in some rare cases. However, deleting the target archive file (for example during a “clean”) solves this problem.
I recently had to import custom ant tasks in an ant build.xml file. Custom ant tasks are usually shipped in a JAR file. Two options were available in the project’s context to make a custom ant task defined in a JAR file available in the build file. Those options are described below along the example of the antcontrib ant library.
Option 1: JAR file in %ANT_HOME%/lib
If you put the antcontrib.jar file in the lib directory of your ant installation, the library will be available by default. All you have to do is to load the ant tasks from this library:
From this point on, all ant tasks defined in antcontrib.jar are available in your ant build file.
Option 2: JAR file anywhere
In most cases you don’t want to “dirty” your ant installation by putting a custom library in the lib directory. So, you have to specify the location of the JAR file (which would be in some sub directory of your project):
<taskdef resource="net/sf/antcontrib/antlib.xml" classpath="path/to/antcontrib.jar"/>
Using namespaces in Ant
To keep your custom ant tasks distinct from default ant tasks, you can add namespaces to each custom library. Here is a toy example:
<project name="MyProject" xmlns:antcontrib="net/sf/antcontrib"/> <taskdef uri="net/sf/antcontrib" resource="net/sf/antcontrib/antlib.xml" classpath="path/to/antcontrib.jar"/> <!-- now we have to use the prefix "antcontrib:" to access the antcontrib tasks --> <antcontrib:foreach ... /> </project>