A Grammar For Diff Output (in Normal, Unified, and Git formats)

By Pedro R. Borges

March 8, 2023

In this article, I describe a grammar for the normal and unified formats generated by the diff utility, and for the diff produced by git when comparing two files, which I’ll call git format. I describe the grammar in EBNF, as described on Wikipedia.

The first and second files being compared or diffed are usually called “original” and “new”, or “source” and “destination”. In this post, they are called left and right files.

Let’s start with, well, the start symbol: diff.

diff = emptyDiff | normalDiff | uniDiff | gitDiff ;

The diff output can be in any of the three formats I am considering in this post, or be empty when the files being compared are identical. In this case, the output is empty even if the diff is requested in unified or with the git command. So, for emptyDiff, we have:

emptyDiff = "" ;

In other notations, we would write the RHS as λ or as ε.

The normal format

The normal format consists solely of the hunks describing the differences between the files, of which there must be at least one. A normalHunk can report added lines (addHunk), deleted lines (deleteHunk), or changed lines (changeHunk):

normalDiff = normalHunk , {normalHunk } ;

normalHunk = addHunk | deleteHunk | changeHunk ;

An addHunk starts with a line indicating the line number in the left file after which the addition occurred, the character ‘a’, and the range of numbers of the lines in the right file. This is followed by the added lines, and possibly a line indicating a missing newline character in the last added line. We have, then:

addHunk =
    number , 'a' , range , eol ,
    addedLines ,
    [ noteLine ] ;

In the first line of a deleteHunk, we have the range of the deleted lines in the left file, the character ’d’, and the line number after which the lines are deleted; Then the deleted lines, and possibly the missing-newline note:

deleteHunk =
    range , 'd' , number , eol ,
    deletedLines ,
    [ noteLine ] ;

A changeHunk starts with the numbers of the lines in the left file, the character ‘c’, and the numbers of the lines in the right file. This is followed by the changed lines in the left file and (possibly) its noteLine, a line with three dashes, and the right file lines and note, if present:

changeHunk =
    range , 'c' , range , eol ,
    deletedLines ,
    [ noteLine ] ,
    "---" , eol ,
    addedLines ,
    [ noteLine ] ;

A range of lines may be 2 line numbers separated by a ‘,’, or just one line number when the range contains just one line:

range
    = number
    | number , ',' , number ;

The numbers in a range are natural numbers without leading zeroes, except for zero itself:

number
    = '0'
    | '1' .. '9' , { digit } ;

digit = '0' .. '9' ;

The productions for added and deleted lines simply state that there must be at least one line of each kind. The rules for each type of line indicate that added lines are prefixed by “> ”, whereas deleted lines are prefixed by “< ”:

addedLines = addedLine , { addedLine } ;

addedLine = "> " , line , eol ;

deletedLines = deletedLine , { deletedLine } ;

deletedLine = "< " , line , eol ;

Note that there is a space after the ‘<’ and ‘>’ in the prefixes of the added and deleted lines, respectively.

A line is a line taken from one of the files being compared. This may be any string that does not contain a newline, which I’ll leave as a special sequence:

line = ? line from a diffed file without '\n' ? ;

As for eol, it is simply the end-of-line indicator, which we could define as a special sequence, or just as:

eol = '\n' ;

The missing-newline note

If the last line of a file is contained in a hunk, and it does not end in a newline character (’\n’), diff adds the missing newline and outputs a line starting with a backslash (’'), followed by a space, and a note indicating it. The note itself depends on the system’s language. In English, for example, the note reads “No newline at end of file”, while in Spanish it reads “No hay ningún carácter de nueva línea al final del fichero”.

Leaving the note itself as a special sequence, we have, then:

noteLine = "\\ " , note , eol ;

note = ? note indicating no newline in last line of a file ? ;

Unified format

A unified diff starts with 2 lines with information about the files being compared. The information about the left file is preceded by 3 dashes and a space, whereas the one about the right file is preceded by 3 plus signs and a space. These lines are followed by at least one unified format hunk (uniHunk):

uniDiff =
    "--- " , fileInfo , eol ,
    "+++ " , fileInfo , eol ,
    uniHunk , { uniHunk } ;

The information about each file consists of its path, followed by a tab character, and its modification time (timestamp):

fileInfo = filePath , '\t' , timestamp ;

timestamp = date , ' ' , time24 , ' ' , timeZone ;

date = fourDigits , '-' , twoDigits , '-' , twoDigits ;

time24 = twoDigits , ':' , twoDigits , ':' , twoDigits , '.' , nineDigits ;

timeZone = ('+' | '-') , fourDigits ;

I leave filePath as a special sequence, as you can see in the final listing. The *Digits should be self-explanatory and are defined in the complete listing. The nineDigits are fractions of a second and, according to the GNU documentation, are omitted in some systems.

Unified hunk descriptor

The first line of each hunk in unified format starts with a uniHunkDescriptor with the range of each file included in the hunk, and possibly the section of the file that contains them. The ranges are between the strings “@@ ” and “ @@” and are separated by a space. The left range is prefixed by a plus sign and the right one by a dash.

uniHunkDescriptor= "@@ +" , range , " -" , range , " @@" , [section] ;

The second number in the range of a file here is the number of lines of that file in the hunk, not the number of the last line included, as in the normal hunks. If omitted, there is only 1 line in the range.

The section, if present, is a line from the diffed files prefixed by a space:

section = ' ' , line ;

Context, left, and right lines

A hunk in unified format may contain 3 types of lines from the diffed files:

Context lines, identical in both files, are prefixed by a space in the hunk.
Lines from the left file (left lines), that have been deleted or changed in the right file. They are prefixed by a dash in the hunk.
Lines from the right file (right lines), that have been added or changed on the right file. They are prefixed by a plus sign in the hunk.

These lines are described by:

contextLine = ' ' , line , eol ;

leftLine = '-' , line , eol ;

rightLine = '+' , line , eol ;

Left and right lines appear in groups corresponding to the hunks that the diff would have if in the normal format:

lrGroup
    = leftLines
    | rightLines
    | leftLines , rightLines ;

leftLines =
    leftLine ,
    { leftLine } ,
    [ noteLine ] ;

rightLines =
    rightLine ,
    { rightLine } ,
    [ noteLine ] ;

Note that right lines come after left lines, as in a changeHunk and that, just as in a normal hunk, the last line of each kind may have its corresponding missing-newline note.

If a unified hunk corresponds to more than one hunk in normal format, it contains additional groups of left/right lines, but they are separated from the previous group by at least one context line. An additional lrGroup and its preceding context line(s) is a clrGroup:

clrGroup = contextLines , lrGroup ;

contextLines = contextLine , { contextLine } ;

Unified hunks

We are now ready to describe a unified hunk. It starts with its uniHunkDescriptor as its first line, followed by the reported lines. The reported lines may start with some context lines, followed by at least one lrGroup, possibly followed by additional groups (clrGroups). The hunk may end with some context lines, and a missing-newline note, if in order:

uniHunk =
    uniHunkDescriptor , eol ,
    { contextLine } ,
    lrGroup ,
    {clrGroup } ,
    [ contextLines , [ noteLine ] ] ;

The git format

The diff generated by git is similar to the unified format but with 2 extra lines at the beginning, and with no timestamp for the diffed files. The first line starts with “diff –git ”, followed by the paths of the diffed files. The second line contains the commit hashes of the left and right files.

gitDiff =
    "diff --git " , filePath , ' ' , filePath , eol ,
    "index " , commitHash , ' ' , commitHash , eol ,
    "--- " , filePath , eol ,
    "+++ " , filePath , eol ,
    uniHunk , { uniHunk } ;

I used here the same filePath used for uniDiff; however, git prepends “a/” and “b/” to the left and right file names by default.

Too many notes

You may have realized that only the last hunk in a diff may contain missing-newline notes, but the grammar allows them in all hunks. To solve this, we could use separate non-terminals for the last normal hunk and the last unified hunk.

For hunks in normal format, we would have:

normalHunk = addHunk | deleteHunk | changeHunk ;

normalLastHunk = addLastHunk | deleteLastHunk | changeLastHunk ;

addHunk =
    number , 'a' , range , eol ,
    addedLines ;

addLastHunk =
    number , 'a' , range , eol ,
    addedLines ,
    [ noteLine ] ;

Similarly, we would have deleteHunk, deleteLastHunk, changeHunk, and changeLastHunk. Then, we would define normalDiff as:

normalDiff = { normalHunk } , lastNormalHunk ;

We would have to modify uniDiff and uniHunk in the same way, making the grammar a lot larger and more repetitive. If you need the grammar to be more precise, this would be the way to go, but for this post, I’ll leave the grammar as hitherto described.

Complete grammar listing

This is the entire grammar listing. It is available as a gist.

  1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132


diff = emptyDiff | normalDiff | uniDiff | gitDiff ;

(* 1. Empty diff *)

emptyDiff = "" ;

(* 2. Normal format *)

normalDiff = normalHunk , {normalHunk } ;

normalHunk = addHunk | deleteHunk | changeHunk ;

addHunk =
    number , 'a' , range , eol ,
    addedLines ,
    [ noteLine ] ;

deleteHunk =
    range , 'd' , number , eol ,
    deletedLines ,
    [ noteLine ] ;

changeHunk =
    range , 'c' , range , eol ,
    deletedLines ,
    [ noteLine ] ,
    "---" , eol ,
    addedLines ,
    [ noteLine ] ;

addedLines = addedLine , { addedLine } ;

addedLine = "> " , line , eol ;

deletedLines = deletedLine , { deletedLine } ;

deletedLine = "< " , line , eol ;

(* 3. Unified and Git formats *)

uniDiff =
    "--- " , fileInfo , eol ,
    "+++ " , fileInfo , eol ,
    uniHunk , { uniHunk } ;

gitDiff =
    "diff --git " , filePath , ' ' , filePath , eol ,
    "index " , commitHash , ' ' , commitHash , eol ,
    "--- " , filePath , eol ,
    "+++ " , filePath , eol ,
    uniHunk , { uniHunk } ;

fileInfo = filePath , '\t' , timestamp ;

timestamp = date , ' ' , time24 , ' ' , timeZone ;

date = fourDigits , '-' , twoDigits , '-' , twoDigits ;

time24 = twoDigits , ':' , twoDigits , ':' , twoDigits , '.' , nineDigits ;

timeZone = ('+' | '-') , fourDigits ;

uniHunk =
    uniHunkDescriptor , eol ,
    { contextLine } ,
    lrGroup ,
    {clrGroup } ,
    [ contextLines , [ noteLine ] ] ;

uniHunkDescriptor= "@@ +" , range , " -" , range , " @@" , [section] ;

section = ' ' , line ;

clrGroup = contextLines , lrGroup ;

contextLines = contextLine , { contextLine } ;

contextLine = ' ' , line , eol ;

lrGroup
    = leftLines
    | rightLines
    | leftLines , rightLines ;

leftLines =
    leftLine ,
    { leftLine } ,
    [ noteLine ] ;

leftLine = '-' , line , eol ;

rightLines =
    rightLine ,
    { rightLine } ,
    [ noteLine ] ;

rightLine = '+' , line , eol ;

(* 4. Common and auxiliary rules *)

range
    = number
    | number , ',' , number ;

(* Here, "\\ " is one backslash followed by a space *)
noteLine = "\\ " , note , eol ;

number
    = '0'
    | '1' .. '9' , { digit } ;

(* Ranges and * repetition are not allowed in some EBNF variants *)

digit = '0' .. '9' ;

twoDigits = 2 * digit ;

fourDigits = 4 * digit ;

nineDigits = 9 * digit ;

eol = '\n' ;

(* 5. Special rules *)

line = ? line from a diffed file without '\n' ? ;

note = ? note indicating no newline in last line of a file ? ;

filePath = ? file path string ? ;

commitHash = ? alphanumeric commit hash ? ;

Final words

The intended use of the grammar in this post is to guide the construction of a parser for the output generated by diff in the formats described. If you need to describe more precisely those formats or validate them, or need to include other formats, this grammar should serve as a starting point for a grammar more suited to your needs.

For more information about diff in general you may visit the resources I used for this article:

The entry for diff on Wikipedia.
The GNU manual: Comparing and merging files.
An article about the unified diff format by none other than Guido van Rossum.
An article about Git diff by Matt Forsyth Jun.

In my next article, I’ll describe a parser for the formats described here. Meanwhile, I’ll appreciate any feedback about the article, and please let me know if you detect any typos or mistakes in it.

Finally, don’t forget to share it if you think it can be of interest to others.