Parsing Diff Output in Haskell
By Pedro R. Borges
I assume the reader has some familiarity with monadic parsing in Haskell, particularly with the Attoparsec library, as well as with the diff formats parsed. For the latter, I invite you to read my previous article, where I described a grammar for diffs in these formats.
Like in the grammar article, I call the diffed files left and right files. The parser takes a ByteString as input, and the lines coming from the diffed files reported in the diff are kept in that type. However, text not coming directly from the diffed files is converted to the Text type.
The code on this post is available on GitHub and is compiled with the GHC2021 set of extensions.
Table Of Contents
Entry point and the Diff type
The entry point for our parser is the function diffParse:: ByteString -> Either String Diff
which, given a (ByteString) input, returns a value of type Diff
or an error string.
To produce its result, it supplies a Parser Diff
to Attoparsec’s function parseOnly
, of type:
parseOnly :: Parser a -> ByteString -> Either String a
Let’s define the Diff
type and our entry point function.
The Diff type
A diff can be empty, or be in normal, unified, or git format:
- An empty diff is
EmptyDiff
. - A normal format diff is a
NormalDiff
and contains a non-empty list of hunks of typeNormalHunk
. - A unified format diff is a
UniDiff
, which contains the name and modification time for each diffed file, each of typeNameTime
, and a non-empty list of hunks of typeUniHunk
. - A diff in git format is a
GitDiff
, which contains aNameHash
for each diffed file with their respective names and hashes, in addition to a non-empty list of hunks of typeUniHunk
.
Hence, I define the Diff
type as:
import Data.List.NonEmpty (NonEmpty)
import Data.Text (Text)
data Diff
= EmptyDiff
| NormalDiff (NonEmpty NormalHunk)
| UniDiff NameTime NameTime (NonEmpty UniHunk)
| GitDiff NameHash NameHash (NonEmpty UniHunk)
Entry point function: diffParse
I define parsers for each diff variant named, respectively, emptyParser
, normalDiffParser
, uniDiffParser
, and gitDiffParser
.
Combining these parsers with the alternative operator <|>
from Control.Applicative
we have a parser for Diff
, which we can supply to parseOnly
:
import Control.Applicative ((<|>))
import Data.Attoparsec.ByteString qualified as A
import Data.ByteString (ByteString)
diffParse :: ByteString -> Either String Diff
diffParse =
A.parseOnly $
emptyDiffParser
<|> normalDiffParser
<|> uniDiffParser
<|> gitDiffParser
The definition of emptyParser
is trivial; we only have to parse the end of the input and return EmptyDiff
.
The end-of-input is parsed by endOfInput
from Data.Attoparsec.ByteString
:
import Data.Attoparsec.ByteString (Parser)
emptyDiffParser :: Parser Diff
emptyDiffParser = A.endOfInput >> pure EmptyDiff
Parsing normal diffs
A normal diff is composed of a non-empty list of normal hunks.
To parse each normal hunk I define normalHunkParser
,
and for the list, I use some
, defined in the Control.Monad.Combinators.NonEmpty
module from the parser-combinators package.
This combinator is similar to the commonly used many1
and many1'
, except that it builds a NonEmpty
list instead of a plain list.
import Control.Monad.Combinators.NonEmpty (some)
normalDiffParser :: Parser Diff
normalDiffParser = NormalDiff <$> some normalHunkParser <* A.endOfInput
Note the use of <*
before A.endOfInput
, since its result is discarded.
For the three variants of hunks in the normal format, I define the NormalHunk
type as:
type Line = ByteString
data NormalHunk
= AddHunk Range Range (NonEmpty Line) EndNote
| DeleteHunk Range Range (NonEmpty Line) EndNote
| ChangeHunk Range Range (NonEmpty Line) (NonEmpty Line) EndNote
The Range
components contain the line numbers from the left and right files in each hunk.
I use this type for the normal hunks as well as for the unified hunks.
A AddHunk
contains a non-empty list of added lines.
A DeleteHunk
contains a non-empty list of deleted lines, and a ChangeHunk
has both added and deleted lines.
I use the type alias Line
to indicate lines coming from the diffed files.
Each hunk also has an element of type EndNote
, which indicates if the last line of any or both files does not end in a newline character.
This type is also used for unified hunks.
Only the last hunk on a diff may contain end notes;
however, while parsing each hunk, it is unknown if it is the last one.
For this reason, and simplicity, I have chosen to parse each hunk as if it were the last one, and consequently, to include an EndNote
in each.
This applies also to unified hunks.
The hunk descriptors
In the normal format, each hunk starts with a descriptor line containing the left file range, a character indicating the hunk type, and the right file range.
The character is ‘a’ for an AddHunk
, ’d’ for a DeleteHunk
, and ‘c’ for a ChangeHunk
.
I parse this line with descriptorParser
, which returns a triplet with the left range, the parser for the hunk, and the right range.
The parser returned is addHunkParser
, deleteHunkParser
, or changeHunkParser
, when the descriptor line contains ‘a’, ’d’, or ‘c’, respectively.
Each of these parsers is of type Range -> Range -> Parser NormalHunk
since they receive the ranges on the hunk descriptor.
import Data.Attoparsec.ByteString.Char8 qualified as AC (endOfLine)
import Data.Word8 (_a, _c, _d)
descriptorParser :: Parser (Range, Range -> Range -> Parser NormalHunk, Range)
descriptorParser =
(,,)
<$> normalRangeParser
<*> hunkParserParser
<*> normalRangeParser
<* AC.endOfLine
where
hunkParserParser =
A.choice
[ A.word8 _a >> pure addHunkParser
, A.word8 _d >> pure deleteHunkParser
, A.word8 _c >> pure changeHunkParser
]
Here, AC.endOfLine
matches either ‘\n’ or ‘\r\n’, and A.word8
parses the character given as its argument.
From Data.Word8
, provided by the word8 package, we take mnemonic names for the ByteString
words ‘a’, ‘c’, and ’d’.
The A.choice
function generalizes <|>
to a list of alternatives.
A hunkParserParser
? Well, it is a parser, and it produces a hunk parser.
So, it is consistent with the other names in the article.
The ranges
I define the Range
type for the ranges in both the normal and unified hunks.
In a normal format diff, the range of lines may be indicated by just one line number, or by the start and end line numbers. In the unified hunks, they can be either one line number, or the start line number and the length of the range.
We have, then:
type LNumber = Int
data Range
= OneLine LNumber
| StartEnd LNumber LNumber
| StartLen LNumber Int
Here we use the type alias LNumber
for an Int
that indicates a line number.
To parse a range then, we need to parse either a number, or a pair of numbers, and then apply the appropriate constructor.
To parse the number(s), I define intOrPairParser
as:
import Data.Word8 (_comma)
intOrPairParser :: Parser (Either Int (Int, Int))
intOrPairParser = do
int1 <- AC.decimal
A.option (Left int1) $
Right . (int1,) <$> (A.word8 _comma *> AC.decimal)
In the first step, we parse the first number as int1
.
In the second step, we try to parse a second number after a comma.
If successful, we paired it with int1
and returned with Right
; otherwise, we return Left int1
.
To parse a normal range using intOrPairParser
we only have to apply the corresponding constructor to each possible result:
normalRangeParser :: Parser Range
normalRangeParser = either OneLine (uncurry StartEnd) <$> intOrPairParser
I take the opportunity to define a function to calculate the number of lines in a range, which I’ll use to parse normal hunks, and in some tests for unified hunks:
rangeLength :: Range -> Int
rangeLength (OneLine _ln) = 1
rangeLength (StartEnd start end) = end - start + 1
rangeLength (StartLen _start len) = len
The normal hunks
Let us now define the parsers for the three types of normal hunks. As mentioned above, these parsers receive the range of lines from the left and right files in the hunk. The ranges are the first two components of each hunk, and they also provide the number of lines to be parsed for the hunk.
To parse an add hunk, we supply the ranges to the constructor and apply it to the parsers for the other two components:
addHunkParser :: Range -> Range -> Parser NormalHunk
addHunkParser leftRange rightRange =
AddHunk leftRange rightRange
<$> addedLinesParser (rangeLength rightRange)
<*> rightNoteParser
Here, The addedLinesParser
receives the number of lines to be parsed, and rightNoteParser
returns an EndNote
indicating if the last line of the hunk does not have an ending newline character.
Since the added lines come from the right file its count is calculated from rightRange
, and the end note corresponds to that file.
The definition of deleteHunkParser
follows the same pattern:
deleteHunkParser :: Range -> Range -> Parser NormalHunk
deleteHunkParser leftRange rightRange =
DeleteHunk leftRange rightRange
<$> deletedLinesParser (rangeLength leftRange)
<*> leftNoteParser
Here, the number of lines to be parsed by deletedLines
is provided by leftRange
, and the result of leftNoteParser
indicates the presence or absence of the end note for the left file.
A change hunk contains the lines and the note from the left file, as in a delete hunk, and the lines and note from the right file, as in a add hunk, separated by a (discarded) line of dashes:
changeHunkParser :: Range -> Range -> Parser NormalHunk
changeHunkParser leftRange rightRange = do
deleted <- deletedLinesParser (rangeLength leftRange)
leftNote <- leftNoteParser
_dashes <- lineAfterParser "-"
added <- addedLinesParser (rangeLength rightRange)
rightNote <- rightNoteParser
pure $
ChangeHunk leftRange rightRange deleted added (leftNote <> rightNote)
The EndNote
for the hunk is the combination of the left and right notes.
I use the semigroup operator since I make EndNote
an instance of Semigroup
, as we’ll see later.
To parse the dashes we parse the line after a ‘-’ with the function lineAfterParser
, which we’ll define in a moment.
The lines from the diffed files
I now define the parsers for the lines from the diffed files reported in the hunks in normal format. These lines are prefixed by “> " or “< “, that is, by ‘>’ or ‘<’, and a space.
I define a parser to match the rest of the line after a given prefix:
import Data.Word8 (_lf)
lineAfterParser :: ByteString -> Parser Line
lineAfterParser prefix =
A.string prefix *> A.takeTill (== _lf) <* A.take 1
First, we match the given prefix with the A.string`` prefix
parser and discard it, hence the *>
.
Then we parse the rest of the line up to, but not including the newline character (_lf
).
Finally, we match the newline character with A.take1
and discard it.
Note that to parse the line after the prefix I could have used:
A.takeTill AC.isEndOfLine <* AC.endOfLine
This would not include the ‘\r’ in the parser result when the line ended in ‘\r\n’. But what if the diffed files have different line endings?
Consider, for example, an empty line.
In one file this line would be "\r\n"
and in the other "\n"
, but both would be parsed as ""
, that is, the empty string.
Hence, lines reported as different would be parsed as the same line.
This could be a valid option, but I prefer to have the difference explicit.
The downside is that for files with ‘\r\n’ line endings, all the parsed lines will contain an ending ‘\r’.
Using lineAfterParsing
it is quite easy to define addedLines
and deletedLines
: we just have to invoke lineAfterParser
repeatedly and collect its result in a non-empty list.
I could use some
as I did for the normal hunks, but since the number of lines to be parsed is known, I instead use A.count
and then convert the list to NonEmpty
:
import Control.Exception (assert)
import Data.List.NonEmpty (fromList)
addedLinesParser, deletedLinesParser :: Int -> Parser (NonEmpty Line)
addedLinesParser nLines =
assert (nLines > 0) $
fromList <$> A.count nLines (lineAfterParser "> ")
deletedLinesParser nLines =
assert (nLines > 0) $
fromList <$> A.count nLines (lineAfterParser "< ")
The fromList
function converts a plain list into a NonEmpty
and raises an exception when its input is empty.
Our parser is intended to be used on diff outputs, so `fromList`` should not raise an exception here.
However, I have added an assertion just to be reassured, and to have a clearer conscience?!
The end notes
To complete the definition of normalDiffParser
We need to define EndNote
as well as leftNoteParser
and rightNoteParser
.
I’ll also define bothNoteParser
, which we’ll need when parsing unified diffs.
The type is defined as:
data EndNote = None | LeftNote | RightNote | BothNote
The constructors indicate whether a hunk has no missing newline note, a note for its left file, one for its right file, or notes for both files.
An end note is a line that starts with a backslash (’\’) and a space, followed by a localized message, which in English reads “No newline at end of file”. To recognize an end note I just match a line prefixed with “\ “, and void the rest of the line:
import Control.Monad (void)
-- "\\" as prefix would work, since no other line starts with '\'
-- and the rest of the line is discarded
noteLineParser :: Parser ()
noteLineParser = void $ lineAfterParser "\\ "
For leftNoteParser
I just return LeftNote
if noteLineParser
succeeds, and None
otherwise.
This is easily done with A.option
.
The definition of rightNoteParser
and bothNoteParser
follow the same pattern:
leftNoteParser, rightNoteParser, bothNoteParser :: Parser EndNote
leftNoteParser = A.option None (LeftNote <$ noteLineParser)
rightNoteParser = A.option None (RightNote <$ noteLineParser)
bothNoteParser = A.option None (BothNote <$ noteLineParser)
As we have seen, when we parse a changeHunk
, we need to combine two endNotes
parsed separately to obtain the note for the hunk.
To parse unified diffs later, I’ll need to combine more than two notes.
For these combinations, I make EndNote
an instance of Semigroup
and define the <>
operator for end notes.
The operation needs to preserve any note when combined with None
and produce BothNote
when a left note and a right note are combined.
Therefore, we define the semigroup instance by:
instance Semigroup EndNote where
None <> other = other
other <> None = other
_ <> _ = BothNote
Technically, <>
should be idempotent on end notes, given that, for example, two left notes should produce a left note, and similarly for right notes.
The definition above, however, yields BothNote
for both cases.
Nevertheless, I’ll keep the definition above since the potentially wrong cases do not occur in valid diffs.
Note also that None
is a neutral element, and therefore EndNote
is a monoid.
I make EndNote
an instance of Monoid
since I’ll make use of this fact when parsing unified diff.
instance Monoid EndNote where
mempty = None
Parsing unified diffs
As we saw earlier, a unified dif contains two elements of tyepe NameTime
with info about the left and right file and a non-empty list of hunks in the unified format.
The info about the left file appears in a line that starts with 3 dashes and a space ("--- "
), followed by the info about the right file, in a line starting with 3 plus signs and a space ("+++ "
).
Then, to parse a unified diff we parse the info about the left and right files, the hunks, and the end of input, and return a UniDiff
with these components:
{-# LANGUAGE OverloadedStrings #-}
uniDiffParser :: Parser Diff
uniDiffParser = do
leftFile <- "--- " *> nameTimeParser
rightFile <- "+++ " *> nameTimeParser
uniHunks <- some uniHunkParser
A.endOfInput
pure $ UniDiff leftFile rightFile uniHunks
The OverloadedStrings extension allows us to use strings as parsers.
That’s why, for example, I used "--- "
instead of A.string "--- "
.
Parsing NameTime
The information about each file in the diff consists of the name of the file (or its path), and its modification time:
type TimeStamp = Text
data NameTime = NameTime Text TimeStamp
We could, of course, use a proper date and time type for the time stamp, but I am leaving it as Text
for this post.
The two elements of information are separated by a tab character, so we can define the parser for NameTime
as:
import Data.Word8 (_tab)
nameTimeParser :: Parser NameTime
nameTimeParser =
NameTime
<$> takeTillAsText (_tab ==)
<* A.take 1 -- take _tab
<*> takeRestAsText
The A.take 1
parser matches 1 byte, in this case, the tab character.
I define takeTillAsText
by applying A.takeTill
and converting its result to Text
, and use it to define takeRestAsText
, which matches the rest of the line as Text
:
import Data.Text.Encoding (decodeUtf8With)
import Data.Text.Encoding.Error (lenientDecode)
takeRestAsText :: Parser Text
takeRestAsText = takeTillAsText AC.isEndOfLine <* AC.endOfLine
takeTillAsText :: (Word8 -> Bool) -> Parser Text
takeTillAsText p = decodeUtf8With lenientDecode <$> A.takeTill p
The unified hunks
A unified hunk is made of:
- The range of lines in the hunk for the left and right file, of the already described type
Range
. - An optional heading or section that the hunk is part of, which I represent as a
Maybe Line
. - A non-empty list of groups of lines, each of type
LinesGroup
. These groups indicate whether the lines come from the left file (deleted lines), the right file (added lines), or are context lines (in both files). - An
EndNote
, like in the normal hunks.
This gives the following definition for UniHunk
and LinesGroup
:
type Section = Maybe Line
data UniHunk = UniHunk Range Range Section (NonEmpty LinesGroup) EndNote
data LinesGroup
= Added (NonEmpty Line)
| Deleted (NonEmpty Line)
| Context (NonEmpty Line)
The first line of each unified hunk contains the line ranges and the optional section. Let’s define the parsers for these components before the parser for the hunk lines.
Recall that a range in the unified format is either one line number or the start line number and the length of the range, for which we have the constructors OneLine
and StartLen
, respectively.
Thus, we define UniRangeParser
just like normalRangeParser
but with StartLen
instead of StartEnd
:
uniRangeParser :: Parser Range
uniRangeParser = either OneLine (uncurry StartLen) <$> intOrPairParser
The section heading, if present, is preceded by a space and extends to the end of the line, so that we can parse it by:
import Data.Word8 (_space)
sectionParser :: Parser Line
sectionParser = A.word8 _space *> A.takeTill AC.isEndOfLine
To parse a UniHunk
I use the previous parsers for the first line and then parse the hunk lines.
The ranges are between the strings “@@ ” and “ @@” and are separated by a space.
The left range is prefixed by a plus sign and the right one by a dash.
We have, then:
{-# LANGUAGE OverloadedStrings #-}
import Control.Applicative (optional)
uniHunkParser :: Parser UniHunk
uniHunkParser = do
leftRange <- "@@ -" *> uniRangeParser
rightRange <- " +" *> uniRangeParser <* " @@"
section <- optional sectionParser
AC.endOfLine
(groups, note) <- combineNotes <$> some groupParser
pure $ UniHunk leftRange rightRange section groups note
The last line of each group of lines could be the last line of its file and therefore could be followed by an end note.
Hence, each LinesGroup
is parsed paired with an EndNote
.
These notes are then combined, as we’ll see later, to get the end note for the hunk.
The hunk lines
A parser for LinesGroup
and its end note just has to match any of its variants:
groupParser :: Parser (LinesGroup, EndNote)
groupParser = leftGroupParser <|> rightGroupParser <|> contextGroupParser
For each group type, I use lineAfterParser
with the appropriate prefix and collect the lines with som
, and parse the end note for the group.
The grouped lines and the note are paired, and wrapped with the constructor for the group:
leftGroupParser :: Parser (LinesGroup, EndNote)
leftGroupParser =
(,) . Deleted
<$> some (lineAfterParser "-")
<*> leftNoteParser
rightGroupParser :: Parser (LinesGroup, EndNote)
rightGroupParser =
(,) . Added
<$> some (lineAfterParser "+")
<*> rightNoteParser
contextGroupParser :: Parser (LinesGroup, EndNote)
contextGroupParser =
(,) . Context
<$> some (lineAfterParser " ")
<*> bothNoteParser
Note that context lines appear in both the left and right files and therefore an end note after them is parsed with BothNoteParser
, which we did not use for normal hunks.
Note also that added, deleted, and context lines are prefixed by the characters ‘+’, ‘-’ and ’ ‘’, respectively, which are passed to lineAfterParser
as strings of length 1, rather than defining a similar parser for a single-character prefix.
Combining notes
In uniHunkParser
, I used some groupParser
to get the groups of lines in the hunk paired with their respective end notes.
This parser has the type:
some groupParser :: Parser (NonEmpty (LinesGroup, EndNote))
From the non-empty list produced by this parser, we need to collect the first components into a list and combine the second components into the end note for the hunk.
This gives the following type for combineNotes
:
combineNotes :: NonEmpty (LinesGroup, EndNote) -> (NonEmpty LinesGroup, EndNote)
We could unzip the input list and then use mconcat
over the list of notes;
however, I’ll use the Control.Foldl
module provided by the foldl package.
This module allows us to define two independent folds which are then executed in a single pass over the input list: collect
and combine
.
For collect
, the module provides a predefined list
precisely to collect into a list, but we must previously get the first components of the pairs, which is done with premap
:
import Control.Foldl qualified as Fold
collect = Fold.premap fst Fold.list
To combine, I use the Monoid
instance of EndNote
and use the mconcat
fold provided by the Foldl
module, “pre-mapping” with snd
:
combine = Fold.premap snd Fold.mconcat
Using these two folds we can define combineNotes
as:
import Data.Bifunctor (first)
import Data.List.NonEmpty (fromList)
combineNotes :: NonEmpty (LinesGroup, EndNote) -> (NonEmpty LinesGroup, EndNote)
combineNotes =
first fromList . Fold.fold ((,) <$> collect <*> combine)
where
collect = Fold.premap fst Fold.list
combine = Fold.premap snd Fold.mconcat
Since Fold.list
produces a normal list, I use first
from the Bifunctor
module to convert the first component of the result into a non-empty list.
Some comments about alternative implementations for combineNotes
:
- The foldl package provides a
Control.Foldl.NonEmpty
which defines folding overNonEmpty
. However, the module does not offerpremap
, which allows simple definitions for the folds I used. - We could define
combine
asFold.foldMap snd id
. I do not have any strong preference here, I used the version above because it is similar to the definition ofcollect
. - You may have realized that only the last two groups of lines may have end notes other than
None
. We could, then, useFold.lastN
to get the last two notes and combine them withmconcat
. However, after having a look atlastN
implementation, I don’t think it would be more efficient except for very long lists. It could be worth exploring, though, and some may find this definition clearer:
import Data.Bifunctor (bimap)
combineNotes =
bimap fromList mconcat . Fold.fold ((,) <$> collect <*> combine)
where
collect = Fold.premap fst Fold.list
combine = Fold.premap snd (Fold.lastN 2)
Parsing git diffs
Recall from the definition of the Diff
type that a gitDiff
contains the manes and hashes of the diffed files and a non-empty list of unified hunks.
I group the names and hashes in a NameHash
, defined as:
type Hash = Text
data NameHash = NameHash Text Hash
In the git format, there are five lines before the hunks, from which we get the file names and hashes. Let’s see how to parse them.
The diff starts with a header line, which starts with “diff –git” and is followed by the file names.
I parse this line with lineAfterParser "diff --git"
but ignore its content, since the file names appear later in lines 4 and 5.
The second line starts with the string “index ”, followed by the two hashes, which I parse with hashesParser
:
"index " *> hashesParser
Lines 4 and 5 contain the name of the left and right files, preceded by “--- ” and “+++ ” respectively.
I parse line 4 with `" *> takeRestAsText` and for line 5 I just change the prefix.
Like in the unified format, the left and right file names are actually their paths, and in this case, they are prefixed with “a/” and “b/” respectivley.
We already know how to parse the unified hunks, so we define `gitDiffParser` as:
gitDiffParser :: Parser Diff
gitDiffParser = do
_gitHeader <- lineAfterParser "diff --git"
(leftHash, rightHash) <- "index " *> hashesParser
leftName <- "--- " *> takeRestAsText
rightName <- "+++ " *> takeRestAsText
uniHunks <- some uniHunkParser
A.endOfInput
let leftNameHash = NameHash leftName leftHash
let rightNameHash = NameHash rightName rightHash
pure $ GitDiff leftNameHash rightNameHash uniHunks
The remaining parser to define is hashesParser
, which is quite simple: we just have to match the text before the space and the one after it and pair them.
hashesParser :: Parser (Hash, Hash)
hashesParser =
(,)
<$> takeTillAsText (_space ==)
<* A.take 1 -- take _space
<*> takeRestAsText
The real code
The seemingly unstructured code I have described so far is organized in five modules available in GitHub as a stack project:
- DiffParse, which exports the
diffParse
function and the parser types. - Types, with the types described above, along with their Show instance declarations. This module makes use of the
StrictData
language extension so that all field types are strict. - NormalHunk and UniHunk, with the code for
normalHunkParser
anduniHunkParser
. - Common, with some functions used in the other modules.
The project contains a Main module so that it can build a simple demo app to run and test the parser. The app reads from the standard input if no argument is given, or reads from the given file. It then parses its input and prints the result:
module Main (main) where
import Control.Exception (IOException, catch)
import Data.ByteString qualified as B
import DiffParse
import System.Environment (getArgs)
import System.Exit (die)
import System.IO (IOMode (ReadMode), stdin, withBinaryFile)
main :: IO ()
main =
do
args <- getArgs
diff <- case args of
[] -> diffParse <$> B.hGetContents stdin
[fp] ->
withBinaryFile fp ReadMode (fmap diffParse . B.hGetContents)
_more_than_2_args ->
die
"Please provide one file to read from,\
\ or none to read from standard input."
print diff
`catch` ( \(e :: IOException) -> do
die ("Problem reading input: " <> show e)
)
The repository includes some sample .diff files which can be used as input to try the parser, and some tests. I use the hspec and hspec-golden packages to make three kinds of tests:
- A simple test that checks if the samples parse to a
Right
diff, that is, that they do not return a parser error. - A golden test for each sample.
- For the samples in unified and git formats, a test that checks that the number of lines parsed in each hunk is correct according to the ranges of the hunk.
Originally, I was going to describe the tests here, but this article is quite long as it is. I invite you to check the repo if interested.
The end
Thank you for reading all the way through this post! This should serve as an intermediate example of monadic/applicative parsing in Haskell, particularly using the Attoparsec library.
To use as a diff parser there are, of course, several implementation and design alternatives, some of which I have already described.
One worth mentioning is matching end notes but ignoring them.
They are, in fact, ignored in some diff output descriptions, and discarding them simplifies both the parser and the Diff type
.
I hope you found this post helpful. Please consider sharing it on your social networks, and any feedback will be appreciated!