TL;DR; Haskell can process flat files as though they were hierarchal with the help of lazy evaluation and one ridiculously clever one-line function.
When working on Day 7 of the Advent of Code, I asked an innocuous question on r/haskell that led me down an extremely deep rabbit hole. The first part of the problem gives you a textual specification of a tree as an unordered series of nodes and asks you to compute the depth. It doesn’t take a genius to figure out that the second part will also involve computing some property of the graph.
I’d originally planned to solve the problem like this:
- parse the lines
- assemble the lines into a tree structure
- use recursion-schemes to evaluate the answer
recursion-schemes
is a powerful but rather impenetrable library (rather like lens
) that provides generalisations of foldr
and unfoldr
that work for recursive structures other than List
. Those familiar with the library will see that my final solution looks pretty similar to a solution using cata
.
Unfortunately, my plan somewhat resembled the following plan:
- steal underpants
- ???
- profit!
in that, I had nice elegant ways of dealing with 1 and 3, but 2 was going to be just plain annoying to write. So I asked a question as to whether there was a better way to do it. There was, courtesy of a function called loeb
. loeb
is a way of using Haskell’s pervasive laziness to circumvent a lot of busy work by the neat trick of constructing the final object as if it already existed. This makes it a generalisation of those Stupid Dwarf Tricks like the Fibonacci ZipList implementation we trot out when showing how powerful laziness is.
This gave me a fairly neat way of doing processing the tree structure by creating a map of node names to parsed assembled nodes, but something still bothered me: I had to write one function that created the map in the first place, and one function that looked up values in the map. This offended my sense of symmetry but I finally came up with what I believe to be a very nice way of solving it that uses the generalisation moeb
, which is described in more detail in the above linked David Luposchainsky article.
Preamble
I’ve written this up as literate Haskell, so next we have the inevitable headers:
{-# LANGUAGE ScopedTypeVariables #-}
Although not used in the final code, the ability to constrain types of identifiers in where
clauses is vital for me when I try to figure out why my code isn’t typechecking.
Some of the following declarations are used in part two.
module Day7blog2 where
import qualified Text.Megaparsec as MP
import qualified Text.Megaparsec.String as S
import qualified Text.Megaparsec.Char as C
import qualified Text.Megaparsec.Lexer as L
Megaparsec is a fork of the Parsec library with a very similar API that aims to be faster and produce better error messages.
import qualified Data.Map.Lazy as M
import qualified Control.Monad as CM
import qualified Control.Monad.Trans.Either as E
import qualified Control.Monad.State.Lazy as St
import Data.Function (on)
import Data.List (maximumBy)
import Data.Monoid (<$>)
Parsing the input file
We start by defining a type that corresponds to a line in the file.
data Node = Node {
name :: String,
weight :: Integer,
children :: [String]
} deriving (Show)
This is the first surprise of the code: if I were using recursion-schemes, I’d have created Node a
where children :: [a]
and derived Functor
. If I had actually wanted to construct the tree as I had originally planned I’d have still gone with that design, but as we’ll see, it’s possible to never actually construct the tree.
Next, we need a parser to map the lines of the file to our Node
representation. We’ll recall that the format looks like this:
ktlj (57) fwft (72) -> ktlj, cntj, xhth
nodeParser :: S.Parser Node
nodeParser = do
n <- identifier
_ <- C.char ' '
w <- C.char '(' *> L.integer <* C.char ')'
c <- MP.try parseChildren MP. pure []
_ <- MP.try C.newline -- the final line of the file doesn't have a newline
pure Node { name = n, weight = w, children = c }
where comma = C.char ',' <* C.char ' '
parseChildren = C.string " -> " *> MP.sepBy identifier comma
identifier = MP.many MP.letterChar
I’m growing very fond of parser combinators very fast. I had trouble getting Megaparsec to work when I used its whitespace functions, but the file is very regular so I just threw them away.
day7text :: IO String
day7text = readFile "C:\\Users\\me\\advent\\day7.txt"
Yes, I’m using Windows. As long as you don’t try to use a C library it’s not that bad.
nodes :: IO (Either (MP.ParseError Char MP.Dec) [Node])
nodes = MP.parse (MP.many nodeParser) "" <$> day7text
This type is pretty ugly, but I don’t know what I could do to fix it.
Computing the depth of the tree
Now we’ve got a list of Node
s, we need to figure out how to process them as or into tree. We currently have no idea which Node
relates to which Node
and the data isn’t organised in a way to make this easy. Indeed, if we had constructed a tree of Node
s computing the depth would be pretty simple. The standard way I’d handle this in C# would involve multiple passes building up partial results until we’d finally take everything off the “todo” pile. It works, but it’s not an approach I much like.
Wouldn’t it be great if we could process the nodes without worrying about the order, and let Haskell’s laziness resolve everything as appropriate. It turns out we can, using moeb
.
moeb :: (((result -> intermediate) -> intermediate) -> input -> result) -> input -> result
moeb f x = go where go = f ($ go) x
This type signature is, frankly, too complex for someone at my level to work out, but the article by David Luposchainsky mentions that ((r -> m) -> m) -> i -> r
is satisfied by fmap
, foldMap
, traverse
. (As an aside moeb foldr
attempts to construct an infinite type, so maybe don’t use that.) I don’t know if there are any other interesting functions that satisfy it.
So, here’s what those applications give us:
moeb fmap :: Functor f => f (f b -> b) -> f b
moeb foldMap :: (Monoid r, Foldable t) => t (r -> r) -> r
moeb traverse :: (Applicative f, Traversable t) => t (f (t b) -> f b) -> f (t b)
As established in the article, moeb fmap
is loeb
. Using f = Map String
would give you Map String (Map String b -> b) -> Map String b
. moeb traverse
gives you a monadic version of loeb
, which doesn’t seem to buy us much over loeb
.
Originally I tried implementing this code using loeb
. It would have worked, but as mentioned earlier I still had some reservations. It was then that it occurred to me that, as well as a Functor
and Applicative
, Map String b
is a Monoid
.
So, a specialized version of moeb foldMap
would be [(Map String b) -> (Map String b)] -> (Map String b)
where the String
s are the node names and the b
s are whatever values we want to compute. We need a list of (Map String b) -> (Map String b)
and obviously they need to be generated from the list of Node
s or we haven’t got a result that depends in the inputs. (There’s probably some Yoneda-adjacent insight to be had here, but I’m not there yet.)
So that means what we actually need is a function like this: Node -> (Map String b) -> (Map String b)
. The first parameter is the Node
, the second parameter the lazily evaluated final result. And the result will be the singleton map entry for the original node. This can use the “final result” map to look up the child nodes and compute the current node’s value.
Then moeb foldMap
constructs these singleton maps and smushes them together to get the final result. i.e. final result -> intermediate results -> final result. As I’ve said before, none of this would work without a serious amount of lazy evaluation.
For such a long explanation, the resultant code is extremely short.
getDepth1 :: Node -> M.Map String (Maybe Integer) -> M.Map String (Maybe Integer)
getDepth1 n m = M.singleton (name n) (((1+) . foldr max 0) <$> z)
where z = traverse CM.join $ (`M.lookup` m) <$> children n
We’re using Maybe Integer
rather than Integer
to capture the possibility that there’s a broken reference in the file. So this function:
- takes the node names of the children,
- looks up the depths in
m
(adding another Maybe to the type in the process) - takes
[Maybe (Maybe Integer)]
and turns it intoMaybe [Integer]
courtesy oftraverse CM.join
- finds the largest value (
0
if it’s empty) - adds one
- makes a singleton map for the node to that new value
That’s the hard work done, now we just follow the types to get this solution
day7a = do
Right n <- nodes
(pure . maximumBy (on compare snd) . M.assocs . moeb foldMap) $ getDepth1 <$> n
Print that out in GHCI and it gives us the depth and the name of the root node. This is good, but still a bit ugly and also not very general.
Taking it further
Let’s imagine that, instead, we wanted to sum the weights. At this point, we should come up with something more reusable. After all, all we really want to write is a catamorphism function that takes the current node, the results of its children and gives you the result for the current node. The type for this would be Node -> [a] -> a
and we want a function that would take that and give us something that looked like getDepth1
.
So cleaning up the code we’ve already written gives us this:
nodeReduce :: (Node -> [a] -> a) -> Node -> M.Map String (Maybe a) -> M.Map String (Maybe a)
nodeReduce f n m = M.singleton (name n) (f n <$> z)
where z = traverse CM.join $ (`M.lookup` m) <$> children n
This is pretty similar to the code in getDepth1
. Now we can write a general routine for processing Node
s. It’s also obvious how to generalize this even further.
process :: (Node -> [a] -> a) -> [Node] -> M.Map String (Maybe a)
process f l = moeb foldMap $ nodeReduce f <$> l
getDepth2 :: Node -> [Integer] -> Integer
getDepth2 _ = (1+) . foldr max 0
day7a2 = do
Right n <- nodes
(pure . maximumBy (on compare snd) . M.assocs . process getDepth2) n
So, now we have a cleaner way of getting the same result as last time, let’s now extend it to sum the weights.
data NodeWeight = NodeWeight {
node :: Node,
totalWeight :: Integer
} deriving (Show)
We could just compute the Integer
and throw away the type, but when you’re debugging it’s rather useful to have all the relevant information around.
sumNodes :: Node -> [NodeWeight] -> NodeWeight
sumNodes n l = NodeWeight {
node = n,
totalWeight = weight n + sum (totalWeight <$> l)
}
So, by dropping sumNodes
in instead of getDepth2
we can solve a different problem.
day7sumNodes = (fmap . fmap) (process sumNodes) nodes
lookup x = (fmap . fmap) (M.! x) day7sumNodes
Of course, no-one was actually asking us for this answer, and the second part is significantly harder. However, this has hopefully shown the power of the technique, that allows you to process flat files in an order imposed by their logical structure. In the second part we’ll show how this can be used to solve the much harder part b of Day 7: finding the incorrect value in the file.
Many thanks must go to Kris Jenkins for reviewing this post providing some really valuable feedback, and those that helped me out and showed me interesting things on r/haskell.