The Problem: a Parser for Unified Diffs
Recently, I needed to parse a unified diff. I looked for a Java library that does this for me without luck, so I decided to roll my own since it is not that big of a problem. A unified diff is a text fragment that contains the differences in lines between two text files (or between two versions of the same text file). It looks something like this:[gist file=unified-diff.txt]7512852[/gist]
This text is to be parsed to create a Java object that contains all information.
A State Machine to the Rescue
Luckily, a unified diff follows a pattern that can be parsed. There are only a couple of different line types, each line type containing some specific information. These are the possible line types:
- HEADER: a header line that contains part of the header of the diff
- FROM_FILE: a line that contains the name of the first file that is compared
- TO_FILE: a line that contains the name of the second file that is compared
- HUNK_START: a line that marks the start of a “hunk” of differences in both files, providing the line numbers of the start line and the end line of the hunk
- NEUTRAL_LINE: a line within a hunk that is contained in both files
- FROM_LINE: a line within a hunk that is contained in the “from file” only
- TO_LINE: a line within a hunk that is contained in the “to file” only
Projecting these types to the diff example from above looks like this:[gist file=states.txt]7512852[/gist]
Now, each line type can only be followed by a certain set of other line types. Modelling the line types as states we get a state machine like this:
Coding the States
So, what’s the best way to code such a state machine in Java? Te natural feature for representing states in Java code is an enum. The plain enum would look like this:[gist file=enum.java]7512852[/gist]
Note that there are two additional states for the initial state and the end state.
Coding the Transitions
That was easy. The question how best to code the transitions between the states is a little harder. One way to do it is to simply let each enum constant implement a method called “nextState” that returns the next state depending on some context information.[gist file=enum-with-transition.java]7512852[/gist]
Here, each enum constant has to implement the abstract method nextState that takes an object containing some context information about the line of the diff that is currently being parsed and returns another enum constant representing the next state. Usually an implementation of the nextState method should not be overly complex since it simply uses a couple if/else statements to return the next state based on the current state and the information in the context object (an object of type ParseWindow in this case).
If calculating the next state becomes more complex, you should code each transition in it’s own class and simply call that class from the nextState method.
Coding the “Engine”
Having coded the states and the transitions between those states, you need some kind of “engine” that drives the state machine from the initial state through a number of other states to the final end state. While the structure of the state enum is pretty generic, the engine is very specific to the problem you are trying to solve. In the case of a text parser the engine starts by calling the INITIAL state’s nextState method providing the first line of the file and some context information on the next couple lines. Since the first line must be a HEADER line, this will be the result.
Now that we know that the current line is a HEADER line, we know how to parse it. Having parsed the first line, the engine takes the second line and passes it along with some context information to the HEADER state’s nextState method. Lather, rinse, repeat until the END state has been reached.
State machines are a natural way to solve some programming problems, parsing problems probably being the most prominent among them. However, a state machine like this can also be a nice and readable way to solve business problems like the transitions between states of an insurance contract or states of a user going through a complex registration process.
A Live Example
The code examples above are excerpts of a live github project called diffparser. Feel free to have a look at the source.