Parsing Diff Output in Haskell

By Pedro R. Borges

May 22, 2023

Screenshot with result of parsing git diff of Main.hs

Parser combinators and the libraries available to build parsers using them are among the best features of Haskell. I describe here a parser for diff output written in Haskell using attoparsec, a monadic parser combinator library designed to be fast and efficient. The parser accepts git output in the normal or unified format, or the output produced by “git diff” when comparing two files.

I assume the reader has some familiarity with monadic parsing in Haskell, particularly with the Attoparsec library, as well as with the diff formats parsed. For the latter, I invite you to read my previous article, where I described a grammar for diffs in these formats.

Like in the grammar article, I call the diffed files left and right files. The parser takes a ByteString as input, and the lines coming from the diffed files reported in the diff are kept in that type. However, text not coming directly from the diffed files is converted to the Text type.

The code on this post is available on GitHub and is compiled with the GHC2021 set of extensions.

Entry point and the Diff type

The entry point for our parser is the function diffParse:: ByteString -> Either String Diff which, given a (ByteString) input, returns a value of type Diff or an error string. To produce its result, it supplies a Parser Diff to Attoparsec’s function parseOnly, of type:

parseOnly :: Parser a -> ByteString -> Either String a

Let’s define the Diff type and our entry point function.

The Diff type

A diff can be empty, or be in normal, unified, or git format:

An empty diff is EmptyDiff.
A normal format diff is a NormalDiff and contains a non-empty list of hunks of type NormalHunk.
A unified format diff is a UniDiff, which contains the name and modification time for each diffed file, each of type NameTime, and a non-empty list of hunks of type UniHunk.
A diff in git format is a GitDiff, which contains a NameHash for each diffed file with their respective names and hashes, in addition to a non-empty list of hunks of type UniHunk.

Hence, I define the Diff type as:

import Data.List.NonEmpty (NonEmpty)
import Data.Text (Text)

data Diff
    = EmptyDiff
    | NormalDiff (NonEmpty NormalHunk)
    | UniDiff NameTime NameTime (NonEmpty UniHunk)
    | GitDiff NameHash NameHash (NonEmpty UniHunk)

Entry point function: diffParse

I define parsers for each diff variant named, respectively, emptyParser, normalDiffParser, uniDiffParser, and gitDiffParser. Combining these parsers with the alternative operator <|> from Control.Applicative we have a parser for Diff, which we can supply to parseOnly:

import Control.Applicative ((<|>))
import Data.Attoparsec.ByteString qualified as A
import Data.ByteString (ByteString)

diffParse :: ByteString -> Either String Diff
diffParse =
    A.parseOnly $
        emptyDiffParser
            <|> normalDiffParser
            <|> uniDiffParser
            <|> gitDiffParser

The definition of emptyParser is trivial; we only have to parse the end of the input and return EmptyDiff. The end-of-input is parsed by endOfInput from Data.Attoparsec.ByteString:

import Data.Attoparsec.ByteString (Parser)

emptyDiffParser :: Parser Diff
emptyDiffParser = A.endOfInput >> pure EmptyDiff

Parsing normal diffs

A normal diff is composed of a non-empty list of normal hunks. To parse each normal hunk I define normalHunkParser, and for the list, I use some, defined in the Control.Monad.Combinators.NonEmpty module from the parser-combinators package. This combinator is similar to the commonly used many1 and many1', except that it builds a NonEmpty list instead of a plain list.

import Control.Monad.Combinators.NonEmpty (some)

normalDiffParser :: Parser Diff
normalDiffParser = NormalDiff <$> some normalHunkParser <* A.endOfInput

Note the use of <* before A.endOfInput, since its result is discarded.

For the three variants of hunks in the normal format, I define the NormalHunk type as:

type Line = ByteString

data NormalHunk
    = AddHunk Range Range (NonEmpty Line) EndNote
    | DeleteHunk Range Range (NonEmpty Line) EndNote
    | ChangeHunk Range Range (NonEmpty Line) (NonEmpty Line) EndNote

The Range components contain the line numbers from the left and right files in each hunk. I use this type for the normal hunks as well as for the unified hunks.

A AddHunk contains a non-empty list of added lines. A DeleteHunk contains a non-empty list of deleted lines, and a ChangeHunk has both added and deleted lines. I use the type alias Line to indicate lines coming from the diffed files.

Each hunk also has an element of type EndNote, which indicates if the last line of any or both files does not end in a newline character. This type is also used for unified hunks.

Only the last hunk on a diff may contain end notes; however, while parsing each hunk, it is unknown if it is the last one. For this reason, and simplicity, I have chosen to parse each hunk as if it were the last one, and consequently, to include an EndNote in each. This applies also to unified hunks.

The hunk descriptors

In the normal format, each hunk starts with a descriptor line containing the left file range, a character indicating the hunk type, and the right file range. The character is ‘a’ for an AddHunk, ’d’ for a DeleteHunk, and ‘c’ for a ChangeHunk. I parse this line with descriptorParser, which returns a triplet with the left range, the parser for the hunk, and the right range.

The parser returned is addHunkParser, deleteHunkParser, or changeHunkParser, when the descriptor line contains ‘a’, ’d’, or ‘c’, respectively. Each of these parsers is of type Range -> Range -> Parser NormalHunk since they receive the ranges on the hunk descriptor.

import Data.Attoparsec.ByteString.Char8 qualified as AC (endOfLine)
import Data.Word8 (_a, _c, _d)

descriptorParser :: Parser (Range, Range -> Range -> Parser NormalHunk, Range)
descriptorParser =
    (,,)
        <$> normalRangeParser
        <*> hunkParserParser
        <*> normalRangeParser
        <* AC.endOfLine
    where
                hunkParserParser =
            A.choice
                [ A.word8 _a >> pure addHunkParser
                , A.word8 _d >> pure deleteHunkParser
                , A.word8 _c >> pure changeHunkParser
                ]

Here, AC.endOfLine matches either ‘\n’ or ‘\r\n’, and A.word8 parses the character given as its argument. From Data.Word8, provided by the word8 package, we take mnemonic names for the ByteString words ‘a’, ‘c’, and ’d’. The A.choice function generalizes <|> to a list of alternatives.

A hunkParserParser? Well, it is a parser, and it produces a hunk parser. So, it is consistent with the other names in the article.

The ranges

I define the Range type for the ranges in both the normal and unified hunks. In a normal format diff, the range of lines may be indicated by just one line number, or by the start and end line numbers. In the unified hunks, they can be either one line number, or the start line number and the length of the range. We have, then:

type LNumber = Int

data Range
    = OneLine LNumber
    | StartEnd LNumber LNumber
    | StartLen LNumber Int

Here we use the type alias LNumber for an Int that indicates a line number.

To parse a range then, we need to parse either a number, or a pair of numbers, and then apply the appropriate constructor. To parse the number(s), I define intOrPairParser as:

import Data.Word8 (_comma)

intOrPairParser :: Parser (Either Int (Int, Int))
intOrPairParser = do
    int1 <- AC.decimal
    A.option (Left int1) $
        Right . (int1,) <$> (A.word8 _comma *> AC.decimal)

In the first step, we parse the first number as int1. In the second step, we try to parse a second number after a comma. If successful, we paired it with int1 and returned with Right; otherwise, we return Left int1.

To parse a normal range using intOrPairParser we only have to apply the corresponding constructor to each possible result:

normalRangeParser :: Parser Range
normalRangeParser = either OneLine (uncurry StartEnd) <$> intOrPairParser

I take the opportunity to define a function to calculate the number of lines in a range, which I’ll use to parse normal hunks, and in some tests for unified hunks:

rangeLength :: Range -> Int
rangeLength (OneLine _ln) = 1
rangeLength (StartEnd start end) = end - start + 1
rangeLength (StartLen _start len) = len

The normal hunks

Let us now define the parsers for the three types of normal hunks. As mentioned above, these parsers receive the range of lines from the left and right files in the hunk. The ranges are the first two components of each hunk, and they also provide the number of lines to be parsed for the hunk.

To parse an add hunk, we supply the ranges to the constructor and apply it to the parsers for the other two components:

addHunkParser :: Range -> Range -> Parser NormalHunk
addHunkParser leftRange rightRange =
    AddHunk leftRange rightRange
        <$> addedLinesParser (rangeLength rightRange)
        <*> rightNoteParser

Here, The addedLinesParser receives the number of lines to be parsed, and rightNoteParser returns an EndNote indicating if the last line of the hunk does not have an ending newline character. Since the added lines come from the right file its count is calculated from rightRange, and the end note corresponds to that file.

The definition of deleteHunkParser follows the same pattern:

deleteHunkParser :: Range -> Range -> Parser NormalHunk
deleteHunkParser leftRange rightRange =
    DeleteHunk leftRange rightRange
        <$> deletedLinesParser (rangeLength leftRange)
        <*> leftNoteParser

Here, the number of lines to be parsed by deletedLines is provided by leftRange, and the result of leftNoteParser indicates the presence or absence of the end note for the left file.

A change hunk contains the lines and the note from the left file, as in a delete hunk, and the lines and note from the right file, as in a add hunk, separated by a (discarded) line of dashes:

changeHunkParser :: Range -> Range -> Parser NormalHunk
changeHunkParser leftRange rightRange = do
    deleted <- deletedLinesParser (rangeLength leftRange)
    leftNote <- leftNoteParser
    _dashes <- lineAfterParser "-"
    added <- addedLinesParser (rangeLength rightRange)
    rightNote <- rightNoteParser
    pure $
        ChangeHunk leftRange rightRange deleted added (leftNote <> rightNote)

The EndNote for the hunk is the combination of the left and right notes. I use the semigroup operator since I make EndNote an instance of Semigroup, as we’ll see later.

To parse the dashes we parse the line after a ‘-’ with the function lineAfterParser, which we’ll define in a moment.

The lines from the diffed files

I now define the parsers for the lines from the diffed files reported in the hunks in normal format. These lines are prefixed by “> " or “< “, that is, by ‘>’ or ‘<’, and a space.

I define a parser to match the rest of the line after a given prefix:

import Data.Word8 (_lf)

lineAfterParser :: ByteString -> Parser Line
lineAfterParser prefix =
    A.string prefix *> A.takeTill (== _lf) <* A.take 1

First, we match the given prefix with the A.string`` prefix parser and discard it, hence the *>. Then we parse the rest of the line up to, but not including the newline character (_lf). Finally, we match the newline character with A.take1 and discard it.

Note that to parse the line after the prefix I could have used:

A.takeTill AC.isEndOfLine <* AC.endOfLine

This would not include the ‘\r’ in the parser result when the line ended in ‘\r\n’. But what if the diffed files have different line endings?

Consider, for example, an empty line. In one file this line would be "\r\n" and in the other "\n", but both would be parsed as "", that is, the empty string. Hence, lines reported as different would be parsed as the same line. This could be a valid option, but I prefer to have the difference explicit. The downside is that for files with ‘\r\n’ line endings, all the parsed lines will contain an ending ‘\r’.

Using lineAfterParsing it is quite easy to define addedLines and deletedLines: we just have to invoke lineAfterParser repeatedly and collect its result in a non-empty list. I could use some as I did for the normal hunks, but since the number of lines to be parsed is known, I instead use A.count and then convert the list to NonEmpty:

import Control.Exception (assert)
import Data.List.NonEmpty (fromList)

addedLinesParser, deletedLinesParser :: Int -> Parser (NonEmpty Line)
addedLinesParser nLines =
    assert (nLines > 0) $
        fromList <$> A.count nLines (lineAfterParser "> ")
deletedLinesParser nLines =
    assert (nLines > 0) $
        fromList <$> A.count nLines (lineAfterParser "< ")

The fromList function converts a plain list into a NonEmpty and raises an exception when its input is empty. Our parser is intended to be used on diff outputs, so `fromList`` should not raise an exception here. However, I have added an assertion just to be reassured, and to have a clearer conscience?!

The end notes

To complete the definition of normalDiffParser We need to define EndNote as well as leftNoteParser and rightNoteParser. I’ll also define bothNoteParser, which we’ll need when parsing unified diffs.

The type is defined as:

data EndNote = None | LeftNote | RightNote | BothNote

The constructors indicate whether a hunk has no missing newline note, a note for its left file, one for its right file, or notes for both files.

An end note is a line that starts with a backslash (’\’) and a space, followed by a localized message, which in English reads “No newline at end of file”. To recognize an end note I just match a line prefixed with “\ “, and void the rest of the line:

import Control.Monad (void)

-- "\\" as prefix would work, since no other line starts with '\'
-- and the rest of the line is discarded
noteLineParser :: Parser ()
noteLineParser = void $ lineAfterParser "\\ "

For leftNoteParser I just return LeftNote if noteLineParser succeeds, and None otherwise. This is easily done with A.option. The definition of rightNoteParser and bothNoteParser follow the same pattern:

leftNoteParser, rightNoteParser, bothNoteParser :: Parser EndNote
leftNoteParser = A.option None (LeftNote <$ noteLineParser)
rightNoteParser = A.option None (RightNote <$ noteLineParser)
bothNoteParser = A.option None (BothNote <$ noteLineParser)

As we have seen, when we parse a changeHunk, we need to combine two endNotes parsed separately to obtain the note for the hunk. To parse unified diffs later, I’ll need to combine more than two notes. For these combinations, I make EndNote an instance of Semigroup and define the <> operator for end notes.

The operation needs to preserve any note when combined with None and produce BothNote when a left note and a right note are combined. Therefore, we define the semigroup instance by:

instance Semigroup EndNote where
    None <> other = other
    other <> None = other
    _ <> _ = BothNote

Technically, <> should be idempotent on end notes, given that, for example, two left notes should produce a left note, and similarly for right notes. The definition above, however, yields BothNote for both cases. Nevertheless, I’ll keep the definition above since the potentially wrong cases do not occur in valid diffs.

Note also that None is a neutral element, and therefore EndNote is a monoid. I make EndNote an instance of Monoid since I’ll make use of this fact when parsing unified diff.

instance Monoid EndNote where
    mempty = None

Parsing unified diffs

As we saw earlier, a unified dif contains two elements of tyepe NameTime with info about the left and right file and a non-empty list of hunks in the unified format. The info about the left file appears in a line that starts with 3 dashes and a space ("--- "), followed by the info about the right file, in a line starting with 3 plus signs and a space ("+++ ").

Then, to parse a unified diff we parse the info about the left and right files, the hunks, and the end of input, and return a UniDiff with these components:

{-# LANGUAGE OverloadedStrings #-}

uniDiffParser :: Parser Diff
uniDiffParser = do
    leftFile <- "--- " *> nameTimeParser
    rightFile <- "+++ " *> nameTimeParser
    uniHunks <- some uniHunkParser
    A.endOfInput
    pure $ UniDiff leftFile rightFile uniHunks

The OverloadedStrings extension allows us to use strings as parsers. That’s why, for example, I used "--- " instead of A.string "--- ".

Parsing NameTime

The information about each file in the diff consists of the name of the file (or its path), and its modification time:

type TimeStamp = Text

data NameTime = NameTime Text TimeStamp

We could, of course, use a proper date and time type for the time stamp, but I am leaving it as Text for this post.

The two elements of information are separated by a tab character, so we can define the parser for NameTime as:

import Data.Word8 (_tab)

nameTimeParser :: Parser NameTime
nameTimeParser =
    NameTime
        <$> takeTillAsText (_tab ==)
        <* A.take 1 -- take _tab
        <*> takeRestAsText

The A.take 1 parser matches 1 byte, in this case, the tab character. I define takeTillAsText by applying A.takeTill and converting its result to Text, and use it to define takeRestAsText, which matches the rest of the line as Text:

import Data.Text.Encoding (decodeUtf8With)
import Data.Text.Encoding.Error (lenientDecode)

takeRestAsText :: Parser Text
takeRestAsText = takeTillAsText AC.isEndOfLine <* AC.endOfLine

takeTillAsText :: (Word8 -> Bool) -> Parser Text
takeTillAsText p = decodeUtf8With lenientDecode <$> A.takeTill p

The unified hunks

A unified hunk is made of:

The range of lines in the hunk for the left and right file, of the already described type Range.
An optional heading or section that the hunk is part of, which I represent as a Maybe Line.
A non-empty list of groups of lines, each of type LinesGroup. These groups indicate whether the lines come from the left file (deleted lines), the right file (added lines), or are context lines (in both files).
An EndNote, like in the normal hunks.

This gives the following definition for UniHunk and LinesGroup:

type Section = Maybe Line

data UniHunk = UniHunk Range Range Section (NonEmpty LinesGroup) EndNote

data LinesGroup
    = Added (NonEmpty Line)
    | Deleted (NonEmpty Line)
    | Context (NonEmpty Line)

The first line of each unified hunk contains the line ranges and the optional section. Let’s define the parsers for these components before the parser for the hunk lines.

Recall that a range in the unified format is either one line number or the start line number and the length of the range, for which we have the constructors OneLine and StartLen, respectively. Thus, we define UniRangeParser just like normalRangeParser but with StartLen instead of StartEnd:

uniRangeParser :: Parser Range
uniRangeParser = either OneLine (uncurry StartLen) <$> intOrPairParser

The section heading, if present, is preceded by a space and extends to the end of the line, so that we can parse it by:

import Data.Word8 (_space)

sectionParser :: Parser Line
sectionParser = A.word8 _space *> A.takeTill AC.isEndOfLine

To parse a UniHunk I use the previous parsers for the first line and then parse the hunk lines. The ranges are between the strings “@@ ” and “ @@” and are separated by a space. The left range is prefixed by a plus sign and the right one by a dash. We have, then:

{-# LANGUAGE OverloadedStrings #-}

import Control.Applicative (optional)

uniHunkParser :: Parser UniHunk
uniHunkParser = do
  leftRange <- "@@ -" *> uniRangeParser
  rightRange <- " +" *> uniRangeParser <* " @@"
  section <- optional sectionParser
  AC.endOfLine
  (groups, note) <- combineNotes <$> some groupParser
  pure $ UniHunk leftRange rightRange section groups note

The last line of each group of lines could be the last line of its file and therefore could be followed by an end note. Hence, each LinesGroup is parsed paired with an EndNote. These notes are then combined, as we’ll see later, to get the end note for the hunk.

The hunk lines

A parser for LinesGroup and its end note just has to match any of its variants:

groupParser :: Parser (LinesGroup, EndNote)
groupParser = leftGroupParser <|> rightGroupParser <|> contextGroupParser

For each group type, I use lineAfterParser with the appropriate prefix and collect the lines with som, and parse the end note for the group. The grouped lines and the note are paired, and wrapped with the constructor for the group:

leftGroupParser :: Parser (LinesGroup, EndNote)
leftGroupParser =
  (,) . Deleted
    <$> some (lineAfterParser "-")
    <*> leftNoteParser

rightGroupParser :: Parser (LinesGroup, EndNote)
rightGroupParser =
  (,) . Added
    <$> some (lineAfterParser "+")
    <*> rightNoteParser

contextGroupParser :: Parser (LinesGroup, EndNote)
contextGroupParser =
  (,) . Context
    <$> some (lineAfterParser " ")
    <*> bothNoteParser

Note that context lines appear in both the left and right files and therefore an end note after them is parsed with BothNoteParser, which we did not use for normal hunks.

Note also that added, deleted, and context lines are prefixed by the characters ‘+’, ‘-’ and ’ ‘’, respectively, which are passed to lineAfterParser as strings of length 1, rather than defining a similar parser for a single-character prefix.

Combining notes

In uniHunkParser, I used some groupParser to get the groups of lines in the hunk paired with their respective end notes. This parser has the type:

some groupParser :: Parser (NonEmpty (LinesGroup, EndNote))

From the non-empty list produced by this parser, we need to collect the first components into a list and combine the second components into the end note for the hunk. This gives the following type for combineNotes:

combineNotes :: NonEmpty (LinesGroup, EndNote) -> (NonEmpty LinesGroup, EndNote)

We could unzip the input list and then use mconcat over the list of notes; however, I’ll use the Control.Foldl module provided by the foldl package. This module allows us to define two independent folds which are then executed in a single pass over the input list: collect and combine.

For collect, the module provides a predefined list precisely to collect into a list, but we must previously get the first components of the pairs, which is done with premap:

import Control.Foldl qualified as Fold

collect = Fold.premap fst Fold.list

To combine, I use the Monoid instance of EndNote and use the mconcat fold provided by the Foldl module, “pre-mapping” with snd:

combine = Fold.premap snd Fold.mconcat

Using these two folds we can define combineNotes as:

import Data.Bifunctor (first)
import Data.List.NonEmpty (fromList)

combineNotes :: NonEmpty (LinesGroup, EndNote) -> (NonEmpty LinesGroup, EndNote)
combineNotes =
  first fromList . Fold.fold ((,) <$> collect <*> combine)
  where
    collect = Fold.premap fst Fold.list
    combine = Fold.premap snd Fold.mconcat

Since Fold.list produces a normal list, I use first from the Bifunctor module to convert the first component of the result into a non-empty list.

Some comments about alternative implementations for combineNotes:

The foldl package provides a Control.Foldl.NonEmpty which defines folding over NonEmpty. However, the module does not offer premap, which allows simple definitions for the folds I used.
We could define combine as Fold.foldMap snd id. I do not have any strong preference here, I used the version above because it is similar to the definition of collect.
You may have realized that only the last two groups of lines may have end notes other than None. We could, then, use Fold.lastN to get the last two notes and combine them with mconcat. However, after having a look at lastN implementation, I don’t think it would be more efficient except for very long lists. It could be worth exploring, though, and some may find this definition clearer:

import Data.Bifunctor (bimap)

combineNotes =
  bimap fromList mconcat . Fold.fold ((,) <$> collect <*> combine)
  where
    collect = Fold.premap fst Fold.list
    combine = Fold.premap snd (Fold.lastN 2)

Parsing git diffs

Recall from the definition of the Diff type that a gitDiff contains the manes and hashes of the diffed files and a non-empty list of unified hunks. I group the names and hashes in a NameHash, defined as:

type Hash = Text

data NameHash = NameHash Text Hash

In the git format, there are five lines before the hunks, from which we get the file names and hashes. Let’s see how to parse them.

The diff starts with a header line, which starts with “diff –git” and is followed by the file names. I parse this line with lineAfterParser "diff --git" but ignore its content, since the file names appear later in lines 4 and 5.

The second line starts with the string “index ”, followed by the two hashes, which I parse with hashesParser:

"index " *> hashesParser

Lines 4 and 5 contain the name of the left and right files, preceded by “--- ” and “+++ ” respectively.
I parse line 4 with `" *> takeRestAsText` and for line 5 I just change the prefix.
Like in the unified format, the left and right file names are actually their paths, and in this case, they are prefixed with “a/” and “b/” respectivley.

We already know how to parse the unified hunks, so we define `gitDiffParser` as:


gitDiffParser :: Parser Diff
gitDiffParser = do
    _gitHeader <- lineAfterParser "diff --git"
    (leftHash, rightHash) <- "index " *> hashesParser
    leftName <- "--- " *> takeRestAsText
    rightName <- "+++ " *> takeRestAsText
    uniHunks <- some uniHunkParser
    A.endOfInput
    let leftNameHash = NameHash leftName leftHash
    let rightNameHash = NameHash rightName rightHash
    pure $ GitDiff leftNameHash rightNameHash uniHunks

The remaining parser to define is hashesParser, which is quite simple: we just have to match the text before the space and the one after it and pair them.

hashesParser :: Parser (Hash, Hash)
hashesParser =
    (,)
        <$> takeTillAsText (_space ==)
        <* A.take 1 -- take _space
        <*> takeRestAsText

The real code

The seemingly unstructured code I have described so far is organized in five modules available in GitHub as a stack project:

DiffParse, which exports the diffParse function and the parser types.
Types, with the types described above, along with their Show instance declarations. This module makes use of the StrictData language extension so that all field types are strict.
NormalHunk and UniHunk, with the code for normalHunkParser and uniHunkParser.
Common, with some functions used in the other modules.

The project contains a Main module so that it can build a simple demo app to run and test the parser. The app reads from the standard input if no argument is given, or reads from the given file. It then parses its input and prints the result:

module Main (main) where

import Control.Exception (IOException, catch)
import Data.ByteString qualified as B
import DiffParse
import System.Environment (getArgs)
import System.Exit (die)
import System.IO (IOMode (ReadMode), stdin, withBinaryFile)

main :: IO ()
main =
    do
        args <- getArgs
        diff <- case args of
            [] -> diffParse <$> B.hGetContents stdin
            [fp] ->
                withBinaryFile fp ReadMode (fmap diffParse . B.hGetContents)
            _more_than_2_args ->
                die
                    "Please provide one file to read from,\
                    \ or none to read from standard input."
        print diff
        `catch` ( \(e :: IOException) -> do
                    die ("Problem reading input: " <> show e)
                )

The repository includes some sample .diff files which can be used as input to try the parser, and some tests. I use the hspec and hspec-golden packages to make three kinds of tests:

A simple test that checks if the samples parse to a Right diff, that is, that they do not return a parser error.
A golden test for each sample.
For the samples in unified and git formats, a test that checks that the number of lines parsed in each hunk is correct according to the ranges of the hunk.

Originally, I was going to describe the tests here, but this article is quite long as it is. I invite you to check the repo if interested.

The end

Thank you for reading all the way through this post! This should serve as an intermediate example of monadic/applicative parsing in Haskell, particularly using the Attoparsec library.

To use as a diff parser there are, of course, several implementation and design alternatives, some of which I have already described. One worth mentioning is matching end notes but ignoring them. They are, in fact, ignored in some diff output descriptions, and discarding them simplifies both the parser and the Diff type.

I hope you found this post helpful. Please consider sharing it on your social networks, and any feedback will be appreciated!

Table Of Contents