A Grammar For Diff Output (in Normal, Unified, and Git formats)
By Pedro R. Borges
In this article, I describe a grammar for the normal and unified formats generated by the diff utility, and for the diff produced by git when comparing two files, which I’ll call git format. I describe the grammar in EBNF, as described on Wikipedia.
The first and second files being compared or diffed are usually called “original” and “new”, or “source” and “destination”. In this post, they are called left and right files.
Let’s start with, well, the start symbol: diff
.
diff = emptyDiff | normalDiff | uniDiff | gitDiff ;
The diff output can be in any of the three formats I am considering in this post, or be empty when the files being compared are identical.
In this case, the output is empty even if the diff is requested in unified or with the git command.
So, for emptyDiff
, we have:
emptyDiff = "" ;
In other notations, we would write the RHS as λ or as ε.
The normal format
The normal format consists solely of the hunks describing the differences between the files, of which there must be at least one.
A normalHunk
can report added lines (addHunk
), deleted lines (deleteHunk
), or changed lines (changeHunk
):
normalDiff = normalHunk , {normalHunk } ;
normalHunk = addHunk | deleteHunk | changeHunk ;
An addHunk
starts with a line indicating the line number in the left file after which the addition occurred, the character ‘a’, and the range of numbers of the lines in the right file.
This is followed by the added lines, and possibly a line indicating a missing newline character in the last added line.
We have, then:
addHunk =
number , 'a' , range , eol ,
addedLines ,
[ noteLine ] ;
In the first line of a deleteHunk
, we have the range of the deleted lines in the left file, the character ’d’, and the line number after which the lines are deleted;
Then the deleted lines, and possibly the missing-newline note:
deleteHunk =
range , 'd' , number , eol ,
deletedLines ,
[ noteLine ] ;
A changeHunk
starts with the numbers of the lines in the left file, the character ‘c’, and the numbers of the lines in the right file.
This is followed by the changed lines in the left file and (possibly) its noteLine
, a line with three dashes, and the right file lines and note, if present:
changeHunk =
range , 'c' , range , eol ,
deletedLines ,
[ noteLine ] ,
"---" , eol ,
addedLines ,
[ noteLine ] ;
A range of lines may be 2 line numbers separated by a ‘,’, or just one line number when the range contains just one line:
range
= number
| number , ',' , number ;
The numbers in a range are natural numbers without leading zeroes, except for zero itself:
number
= '0'
| '1' .. '9' , { digit } ;
digit = '0' .. '9' ;
The productions for added and deleted lines simply state that there must be at least one line of each kind. The rules for each type of line indicate that added lines are prefixed by “> ”, whereas deleted lines are prefixed by “< ”:
addedLines = addedLine , { addedLine } ;
addedLine = "> " , line , eol ;
deletedLines = deletedLine , { deletedLine } ;
deletedLine = "< " , line , eol ;
Note that there is a space after the ‘<’ and ‘>’ in the prefixes of the added and deleted lines, respectively.
A line
is a line taken from one of the files being compared.
This may be any string that does not contain a newline, which I’ll leave as a special sequence:
line = ? line from a diffed file without '\n' ? ;
As for eol
, it is simply the end-of-line indicator, which we could define as a special sequence, or just as:
eol = '\n' ;
The missing-newline note
If the last line of a file is contained in a hunk, and it does not end in a newline character (’\n’), diff adds the missing newline and outputs a line starting with a backslash (’'), followed by a space, and a note indicating it. The note itself depends on the system’s language. In English, for example, the note reads “No newline at end of file”, while in Spanish it reads “No hay ningún carácter de nueva línea al final del fichero”.
Leaving the note itself as a special sequence, we have, then:
noteLine = "\\ " , note , eol ;
note = ? note indicating no newline in last line of a file ? ;
Unified format
A unified diff starts with 2 lines with information about the files being compared.
The information about the left file is preceded by 3 dashes and a space, whereas the one about the right file is preceded by 3 plus signs and a space.
These lines are followed by at least one unified format hunk (uniHunk
):
uniDiff =
"--- " , fileInfo , eol ,
"+++ " , fileInfo , eol ,
uniHunk , { uniHunk } ;
The information about each file consists of its path, followed by a tab character, and its modification time (timestamp
):
fileInfo = filePath , '\t' , timestamp ;
timestamp = date , ' ' , time24 , ' ' , timeZone ;
date = fourDigits , '-' , twoDigits , '-' , twoDigits ;
time24 = twoDigits , ':' , twoDigits , ':' , twoDigits , '.' , nineDigits ;
timeZone = ('+' | '-') , fourDigits ;
I leave filePath
as a special sequence, as you can see in the final listing.
The *Digits
should be self-explanatory and are defined in the complete listing.
The nineDigits
are fractions of a second and, according to the GNU documentation, are omitted in some systems.
Unified hunk descriptor
The first line of each hunk in unified format starts with a uniHunkDescriptor
with the range of each file included in the hunk, and possibly the section of the file that contains them.
The ranges are between the strings “@@ ” and “ @@” and are separated by a space.
The left range is prefixed by a plus sign and the right one by a dash.
uniHunkDescriptor= "@@ +" , range , " -" , range , " @@" , [section] ;
The second number in the range
of a file here is the number of lines of that file in the hunk, not the number of the last line included, as in the normal hunks.
If omitted, there is only 1 line in the range.
The section, if present, is a line from the diffed files prefixed by a space:
section = ' ' , line ;
Context, left, and right lines
A hunk in unified format may contain 3 types of lines from the diffed files:
- Context lines, identical in both files, are prefixed by a space in the hunk.
- Lines from the left file (left lines), that have been deleted or changed in the right file. They are prefixed by a dash in the hunk.
- Lines from the right file (right lines), that have been added or changed on the right file. They are prefixed by a plus sign in the hunk.
These lines are described by:
contextLine = ' ' , line , eol ;
leftLine = '-' , line , eol ;
rightLine = '+' , line , eol ;
Left and right lines appear in groups corresponding to the hunks that the diff would have if in the normal format:
lrGroup
= leftLines
| rightLines
| leftLines , rightLines ;
leftLines =
leftLine ,
{ leftLine } ,
[ noteLine ] ;
rightLines =
rightLine ,
{ rightLine } ,
[ noteLine ] ;
Note that right lines come after left lines, as in a changeHunk
and that, just as in a normal hunk, the last line of each kind may have its corresponding missing-newline note.
If a unified hunk corresponds to more than one hunk in normal format,
it contains additional groups of left/right lines, but they are separated from the previous group by at least one context line.
An additional lrGroup
and its preceding context line(s) is a clrGroup
:
clrGroup = contextLines , lrGroup ;
contextLines = contextLine , { contextLine } ;
Unified hunks
We are now ready to describe a unified hunk.
It starts with its uniHunkDescriptor
as its first line, followed by the reported lines.
The reported lines may start with some context lines, followed by at least one lrGroup
,
possibly followed by additional groups (clrGroups
).
The hunk may end with some context lines, and a missing-newline note, if in order:
uniHunk =
uniHunkDescriptor , eol ,
{ contextLine } ,
lrGroup ,
{clrGroup } ,
[ contextLines , [ noteLine ] ] ;
The git format
The diff generated by git is similar to the unified format but with 2 extra lines at the beginning, and with no timestamp for the diffed files. The first line starts with “diff –git ”, followed by the paths of the diffed files. The second line contains the commit hashes of the left and right files.
gitDiff =
"diff --git " , filePath , ' ' , filePath , eol ,
"index " , commitHash , ' ' , commitHash , eol ,
"--- " , filePath , eol ,
"+++ " , filePath , eol ,
uniHunk , { uniHunk } ;
I used here the same filePath
used for uniDiff
; however, git prepends “a/” and “b/” to the left and right file names by default.
Too many notes
You may have realized that only the last hunk in a diff may contain missing-newline notes, but the grammar allows them in all hunks. To solve this, we could use separate non-terminals for the last normal hunk and the last unified hunk.
For hunks in normal format, we would have:
normalHunk = addHunk | deleteHunk | changeHunk ;
normalLastHunk = addLastHunk | deleteLastHunk | changeLastHunk ;
addHunk =
number , 'a' , range , eol ,
addedLines ;
addLastHunk =
number , 'a' , range , eol ,
addedLines ,
[ noteLine ] ;
Similarly, we would have deleteHunk
, deleteLastHunk
, changeHunk
, and changeLastHunk
.
Then, we would define normalDiff
as:
normalDiff = { normalHunk } , lastNormalHunk ;
We would have to modify uniDiff
and uniHunk
in the same way, making the grammar a lot larger and more repetitive.
If you need the grammar to be more precise, this would be the way to go, but for this post, I’ll leave the grammar as hitherto described.
Complete grammar listing
This is the entire grammar listing. It is available as a gist.
|
|
Final words
The intended use of the grammar in this post is to guide the construction of a parser for the output generated by diff in the formats described. If you need to describe more precisely those formats or validate them, or need to include other formats, this grammar should serve as a starting point for a grammar more suited to your needs.
For more information about diff in general you may visit the resources I used for this article:
- The entry for diff on Wikipedia.
- The GNU manual: Comparing and merging files.
- An article about the unified diff format by none other than Guido van Rossum.
- An article about Git diff by Matt Forsyth Jun.
In my next article, I’ll describe a parser for the formats described here. Meanwhile, I’ll appreciate any feedback about the article, and please let me know if you detect any typos or mistakes in it.
Finally, don’t forget to share it if you think it can be of interest to others.