Text sorting is a classic problem. It turns out to be incredibly useful, as a command line tool, and a poor implementation could be unusable, even if it performed to its specification.
K&R broke this chapter into two implementations: a simpler, in-memory sort program, and a out-of-core sort that uses temporary files to reduce its memory footprint and improve its performance (successfully, even in Haskell).
Following along, here is the initial attempt:
PROGRAM simplesort - simple in-memory line sorter USAGE sort FUNCTION simplesort reads lines from the standard input, sorts them into ascending lexicographic order, then writes them back to its standard output.
The original read the incoming text into a buffer, arranging an array containing offsets of the beginning of each line in the text. This array was then sorted and each line printed in order. K&P used a quicksort to sort the array. (By the way, I have a peculiar question about the original: The rquick procedure has:
if (i - lo < hi - i) then begin rquick(lo, i-1); rquick(i+1, hi) end else begin rquick(i+1, hi); rquick(lo, i-1) end
as the recursive phase. What the heck is with the reversible order of the recursive operations? It doesn't seem to be described in the text.)
simplesort in Haskell is much, much simpler:
sortLines :: String -> String sortLines = unlines . sort . lines main = interact sortLines
The Haskell Prelude includes, in addition to the lines (String -> [String], breaking the incoming text into lines) and unlines ([String] -> String, reassembling the lines into a single block of text) functions, a List sorting function, sort. GHC currently uses a merge sort, which is tolerably fast even in this context.
simplesort, however, has one major problem: it requires the input to fit into memory. Given the overhead of Haskell's the garbage collector, that restriction is almost as problematic on modern hardware as it would be in Pascal on K&P's hardware.
Sorting big files
PROGRAM sort - sort text lines USAGE sort FUNCTION sort sorts its input into ascending lexicographic order. Two lines are in order if they are identical or if the leftmost character position in which they differ contains characters which are in order, using the internal numeric representation of characters. If a line is a proper prefix of another line, it precedes that line in sort order. sort writes intermediate data to files named stemp#, where # is a small decimal digit string; these filenames should be avoided. EXAMPLE To print the sorted output of a program: program | sort | print
The Haskell approach to I/O makes an external sort significantly different from the previous basic internal sort. The main function is built from three IO actions:
main :: IO () main = do n <- doRuns n' <- mergeRuns n showRun n'
The first step reads the input, breaking it into bite-sized chunks. These runs are sorted and written to temporary files. The first step returns an integer, where temporary files numbered from 1 to n-1 were created. The second step merges the n files, producing a file identified by n'. This step deletes the temporary files after they have been merged. The final step prints the contents of the final temporary file to standard out and deletes it.
doRuns uses another standard library function, getContents. This lazily reads stdin, producing a String:
doRuns :: IO Int doRuns = do text <- getContents doRuns' 0 text
The contents of stdin are broken into approximately 64k chunks, continuing until the completion of the final line. Each run is sorted, using the same function as simplesort, and then written to a temporary file.
runSize = 2^16 doRuns' :: Int -> String -> IO Int doRuns' i  = return i doRuns' i text = do putTemp i $ sortLines this doRuns' (i+1) rest where (this,rest) = splitAtNextLine runSize text splitAtNextLine sz text = (pre ++ eol, rest') where (pre,suf) = splitAt sz text (eol,rest) = span (/= '\n') suf rest' = if (null rest) then rest else tail rest
Temporary file handling is relatively simple, with the file named "stemp" followed by the identifying number.
tmpname i = "stemp" ++ (show i) putTemp :: Int -> String -> IO () putTemp i text = do hnd <- mustopen (tmpname i) IO.WriteMode hPutStr hnd text hClose hnd
In order to fix the memory used to merge the temporary files, each temporary file is read one line at a time, and a limited number of files are merged at once. Each incoming line (and the remainder of the file) is held in a priority queue until it is written to a new temporary.
Another fancy data structure
This is a leftist heap, courtesy of Chris Okasaki, Purely Functional Data Structures. The heap is used to provide the priority queue to manage the temporary files during the merge phase, by allowing easy access to the next line to merge.
data Heap a = Ord a => Empty | Node (a,Int,Heap a,Heap a) rankH :: Heap a -> Int rankH Empty = 0 rankH (Node (_,r,_,_)) = r makeNodeH :: Ord a => a -> Heap a -> Heap a -> Heap a makeNodeH x a b = let ra = rankH a rb = rankH b in if ra >= rb then Node (x, rb + 1, a, b) else Node (x, ra + 1, b, a) emptyH :: Ord a => Heap a emptyH = Empty isEmptyH :: Ord a => Heap a -> Bool isEmptyH Empty = True isEmptyH _ = False mergeH :: Ord a => Heap a -> Heap a -> Heap a mergeH h Empty = h mergeH Empty h = h mergeH h1@(Node (x,_,a1,b1)) h2@(Node (y,_,a2,b2)) = case compare x y of GT -> makeNodeH y a2 (mergeH h1 b2) _ -> makeNodeH x a1 (mergeH b1 h2) insertH :: Ord a => a -> Heap a -> Heap a insertH x h = mergeH (Node (x,1,Empty,Empty)) h findMinH Empty = error "Heap: findMinH Empty" findMinH (Node (x,_,_,_)) = x deleteMinH Empty = error "Heap: deleteMinH Empty" deleteMinH (Node (_,_,a,b)) = mergeH a b
Since the contents of each temporary file is read and converted to lines lazily, the first remaining line is separated and kept exposed, while the rest of the list is unevaluated until it is needed. These two elements make up a SF structure (String-File, maybe?), which is made an instance of Ord in order to be stored in the heap.
Insertions into the heap require some slight special support: an empty list of lines is ignored, returning the heap unchanged, while a non-empty list has the first line read to build the SF.
data SF = SF (String, [String]) instance Eq SF where (SF l) == (SF r) = (fst l) == (fst r) instance Ord SF where compare (SF l) (SF r) = compare (fst l) (fst r) insertSFHeap :: [String] -> Heap SF -> Heap SF insertSFHeap  heap = heap insertSFHeap (h:t) heap = insertH (SF (h,t)) heap
mergeRuns goes over the 0..(i-1) temporary files, in groups of 8, writing the merged output to file i. When the 8 files are merged, they are removed, the new temporary is appended to the list of temporaries to be processed, and mergeRuns' recurses. This process completes when only one temporary is left.
The function merge itself inserts the texts from the temporary files into the priority queue and then pulls each line out in turn.
mergeorder = 8 mergeRuns :: Int -> IO Int mergeRuns i = mergeRuns' [0..(i-1)] i mergeRuns' :: [Int] -> Int -> IO Int mergeRuns'  _ = error "no runs to merge" mergeRuns' [i] _ = return i mergeRuns' files i = do fds <- mapM (\i -> mustopen IO.ReadMode $ tmpname i) files' texts <- mapM hGetContents fds putTemp i $ merge texts mapM_ hClose fds mapM_ (removeFile . tmpname) files' mergeRuns' (files'' ++ [i]) (i+1) where (files',files'') = splitAt mergeorder files merge :: [String] -> String merge texts = unlines $ merge' texts'' where texts' = map lines texts texts'' = foldr insertSFHeap emptyH texts' merge' :: Heap SF -> [String] merge' h | isEmptyH h =  | otherwise = min : (merge' rest) where SF (min,minS) = findMinH h rest = insertSFHeap minS $ deleteMinH h
The final step is to copy the contents of the final temporary file to standard output and remove it:
showRun :: Int -> IO () showRun n = do fd <- mustopen IO.ReadMode fn hGetContents fd >>= putStr hClose fd removeFile fn where fn = tmpname n
Throughout sort, I have used a pair of problematic functions: getContents and hGetContents. These two functions return Strings which can be evaluated a character at a time to read the contents of the file. The problem is the relation of this lazy input to the strict file manipulations. In particular, if any input not read before the file is closed will not be available. For example, the following two functions produce different output:
f1 = do fd <- openFile "/etc/motd" IO.ReadMode f <- hGetContents fd putStrLn f hClose fd f2 = do fd <- openFile "/etc/motd" IO.ReadMode f <- hGetContents fd hClose fd putStrLn f
The first prints the contents of the file; the second prints an empty line. (Things get more complicated if reading and writing are mixed.)
sort requires lazy I/O to limit its memory usage, and avoids the difficulties with laziness. For example, when merging the IO action which consumes the text output is invoked before the IO action which closes the input file. As a result, the ordering of the IO monad ensures that the operations take place in the correct order. The key is that I have taken care to ensure that in no case does an expression involving the input text "leak" past the operation closing the input file.
Separation of function: unique
K&R use the unique program to make a point about separation of function: sorting and unique-ifying are commonly used together, and
Why should there be two separate programs when a single slightly more complicated one will do?
The answer is that on the one hand, the unique program is useful on its own, and on the other, the two programs implement different functionality, and
In its early stages, at least, a program should implement a single function.
PROGRAM unique - delete adjacent duplicate lines USAGE unique FUNCTION unique writes to its output only the first line from each group of adjacent identical input lines. It is most useful for text that has been sorted to bring identical lines together; in this case it passes through only unique instances of input lines. EXAMPLE To eliminate duplicate lines in the output of a program: program | sort | unique
The unique program itself is fairly simple, since it only works on adjacent lines.
unique :: String -> String unique text = unlines $ unique' $ lines text where unique' :: [String] -> [String] unique' (l:l':ls) | l == l' = unique' (l':ls) unique' (l:ls) = l : unique' ls unique'  =  main = interact $ unique
One use for a text sorting program, particularly for someone writing a book, is to produce an index. A particularly fancy form of index is a "keyword-in-context" or permuted index, which presents the index entries in the textural context that they appear. The kwic.hs and unrotate.hs programs produce a permuted index of every word of the input, sorted and rearranged so that the keywords line up.
PROGRAM kwic - produce lines for KWIC index USAGE kwic FUNCTION kwic writes one or more "folded" versions of each line to its output. A line is "folded" at the beginning of each nonwhitespace string within the line by writing from that string through the end of the line, followed by the fold character, #, followed by the beginning of the line. kwic is used with sort and unrotate to produce a Key Word In Context, or KWIC, index. EXAMPLE $ echo "This is a test." | runhaskell kwic.hs | runhaskell simplesort.hs This is a test.# a test.#This is is a test.#This test.#This is a
The program reads input lines, producing all possible rotations of the words on the line. The end of the original line is marked in each output line by a '#'. To do that, it uses two functions:
- doLines breaks the input into lines, maps the rotate across each line and concatenates the result into a list of new lines, then uses unlines to produce the final String.
- rotate breaks a line into words and finds all possible breaks of the line into a prefix and suffix of the words. Ignoring the possibility of the empty suffix, it joins the suffix followed by the "#" character followed by the prefix, producing all of the relevant rotations of the line.
rotate :: String -> [String] rotate line = map splice $ init $ zip (inits ws) (tails ws) where splice (pre,suf) = unwords suf ++ "#" ++ unwords pre ws = words line doLines :: String -> String doLines = unlines . concatMap rotate . lines main = interact $ doLines
Following the production of the rotated lines and sorting, an unrotate program is used to produce the final output.
PROGRAM unrotate - format lines for a KWIC index USAGE unrotate FUNCTION unrotate reads its input a line at a time and writes an "unfolded" version to its output.. A line is "folded" if it contains within it an instance of the fold character #; "unfolding" involves writing from the end of the line down to but not including the fold character, starting in column 39 of the output line, wrapping characters that would thus appear before column 1 around to the end of the line, then writing the remainder of the line starting at column 41 and wrapping around at column 80 if necessary. unrotate is used with kwic and sort to produce a Key Word In Context, or KWIC, index. EXAMPLE $ echo "This is a test." | runhaskell kwic.hs | runhaskell simplesort.hs \ | runhaskell unrotate.hs This is a test. This is a test. This is a test. This is a test. BUGS This does not currently handle line wrapping correctly.
unrotate uses one function, which re-formats each incoming line:
middle = 40 unrotate line = replicate (middle - lenp) ' ' ++ pre ++ " " ++ suf where (suf,('#':pre)) = span (/= '#') line lenp = length pre main = interact $ unlines . map unrotate . lines
This program will fail if an incoming line does not contain a '#' character, since the pattern matching used with span in unrotate would fail.