CHAPTER FOUR
Sorting text
Text sorting is a classic problem. It turns out to be incredibly useful, as a command line tool, and a poor implementation could be unusable, even if it performed to its specification.
K&R broke this chapter into two implementations: a simpler, in-memory sort program, and a out-of-core sort that uses temporary files to reduce its memory footprint and improve its performance (successfully, even in Haskell).
Following along, here is the initial attempt:
PROGRAM simplesort - simple in-memory line sorter USAGE sort FUNCTION simplesort reads lines from the standard input, sorts them into ascending lexicographic order, then writes them back to its standard output.
The original read the incoming text into a buffer, arranging an array containing offsets of the beginning of each line in the text. This array was then sorted and each line printed in order. K&P used a quicksort to sort the array. (By the way, I have a peculiar question about the original: The rquick procedure has:
if (i - lo < hi - i) then begin
rquick(lo, i-1);
rquick(i+1, hi)
end
else begin
rquick(i+1, hi);
rquick(lo, i-1)
end
as the recursive phase. What the heck is with the reversible order of the recursive operations? It doesn't seem to be described in the text.)
simplesort in Haskell is much, much simpler:
sortLines :: String -> String sortLines = unlines . sort . lines main = interact sortLines
The Haskell Prelude includes, in addition to the lines (String -> [String], breaking the incoming text into lines) and unlines ([String] -> String, reassembling the lines into a single block of text) functions, a List sorting function, sort. GHC currently uses a merge sort, which is tolerably fast even in this context.
simplesort, however, has one major problem: it requires the input to fit into memory. Given the overhead of Haskell's the garbage collector, that restriction is almost as problematic on modern hardware as it would be in Pascal on K&P's hardware.
Sorting big files
PROGRAM
sort - sort text lines
USAGE
sort
FUNCTION
sort sorts its input into ascending lexicographic order. Two lines
are in order if they are identical or if the leftmost character
position in which they differ contains characters which are in
order, using the internal numeric representation of characters. If
a line is a proper prefix of another line, it precedes that line in
sort order.
sort writes intermediate data to files named stemp#, where # is a
small decimal digit string; these filenames should be avoided.
EXAMPLE
To print the sorted output of a program:
program | sort | print
The Haskell approach to I/O makes an external sort significantly different from the previous basic internal sort. The main function is built from three IO actions:
main :: IO ()
main = do
n <- doRuns
n' <- mergeRuns n
showRun n'
The first step reads the input, breaking it into bite-sized chunks. These runs are sorted and written to temporary files. The first step returns an integer, where temporary files numbered from 1 to n-1 were created. The second step merges the n files, producing a file identified by n'. This step deletes the temporary files after they have been merged. The final step prints the contents of the final temporary file to standard out and deletes it.
Chunking input
doRuns uses another standard library function, getContents. This lazily reads stdin, producing a String:
doRuns :: IO Int
doRuns = do
text <- getContents
doRuns' 0 text
The contents of stdin are broken into approximately 64k chunks, continuing until the completion of the final line. Each run is sorted, using the same function as simplesort, and then written to a temporary file.
runSize = 2^16
doRuns' :: Int -> String -> IO Int
doRuns' i [] = return i
doRuns' i text = do
putTemp i $ sortLines this
doRuns' (i+1) rest
where
(this,rest) = splitAtNextLine runSize text
splitAtNextLine sz text = (pre ++ eol, rest')
where
(pre,suf) = splitAt sz text
(eol,rest) = span (/= '\n') suf
rest' = if (null rest) then rest else tail rest
Temporary files
Temporary file handling is relatively simple, with the file named "stemp" followed by the identifying number.
tmpname i = "stemp" ++ (show i)
putTemp :: Int -> String -> IO ()
putTemp i text = do
hnd <- mustopen (tmpname i) IO.WriteMode
hPutStr hnd text
hClose hnd
Merging
In order to fix the memory used to merge the temporary files, each temporary file is read one line at a time, and a limited number of files are merged at once. Each incoming line (and the remainder of the file) is held in a priority queue until it is written to a new temporary.
Another fancy data structure
This is a leftist heap, courtesy of Chris Okasaki, Purely Functional Data Structures. The heap is used to provide the priority queue to manage the temporary files during the merge phase, by allowing easy access to the next line to merge.
data Heap a = Ord a => Empty | Node (a,Int,Heap a,Heap a)
rankH :: Heap a -> Int
rankH Empty = 0
rankH (Node (_,r,_,_)) = r
makeNodeH :: Ord a => a -> Heap a -> Heap a -> Heap a
makeNodeH x a b = let ra = rankH a
rb = rankH b
in if ra >= rb
then Node (x, rb + 1, a, b)
else Node (x, ra + 1, b, a)
emptyH :: Ord a => Heap a
emptyH = Empty
isEmptyH :: Ord a => Heap a -> Bool
isEmptyH Empty = True
isEmptyH _ = False
mergeH :: Ord a => Heap a -> Heap a -> Heap a
mergeH h Empty = h
mergeH Empty h = h
mergeH h1@(Node (x,_,a1,b1)) h2@(Node (y,_,a2,b2))
= case compare x y of GT -> makeNodeH y a2 (mergeH h1 b2)
_ -> makeNodeH x a1 (mergeH b1 h2)
insertH :: Ord a => a -> Heap a -> Heap a
insertH x h = mergeH (Node (x,1,Empty,Empty)) h
findMinH Empty = error "Heap: findMinH Empty"
findMinH (Node (x,_,_,_)) = x
deleteMinH Empty = error "Heap: deleteMinH Empty"
deleteMinH (Node (_,_,a,b)) = mergeH a b
Since the contents of each temporary file is read and converted to lines lazily, the first remaining line is separated and kept exposed, while the rest of the list is unevaluated until it is needed. These two elements make up a SF structure (String-File, maybe?), which is made an instance of Ord in order to be stored in the heap.
Insertions into the heap require some slight special support: an empty list of lines is ignored, returning the heap unchanged, while a non-empty list has the first line read to build the SF.
data SF = SF (String, [String])
instance Eq SF where
(SF l) == (SF r) = (fst l) == (fst r)
instance Ord SF where
compare (SF l) (SF r) = compare (fst l) (fst r)
insertSFHeap :: [String] -> Heap SF -> Heap SF
insertSFHeap [] heap = heap
insertSFHeap (h:t) heap = insertH (SF (h,t)) heap
mergeRuns goes over the 0..(i-1) temporary files, in groups of 8, writing the merged output to file i. When the 8 files are merged, they are removed, the new temporary is appended to the list of temporaries to be processed, and mergeRuns' recurses. This process completes when only one temporary is left.
The function merge itself inserts the texts from the temporary files into the priority queue and then pulls each line out in turn.
mergeorder = 8
mergeRuns :: Int -> IO Int
mergeRuns i = mergeRuns' [0..(i-1)] i
mergeRuns' :: [Int] -> Int -> IO Int
mergeRuns' [] _ = error "no runs to merge"
mergeRuns' [i] _ = return i
mergeRuns' files i = do
fds <- mapM (\i -> mustopen IO.ReadMode $ tmpname i) files'
texts <- mapM hGetContents fds
putTemp i $ merge texts
mapM_ hClose fds
mapM_ (removeFile . tmpname) files'
mergeRuns' (files'' ++ [i]) (i+1)
where
(files',files'') = splitAt mergeorder files
merge :: [String] -> String
merge texts = unlines $ merge' texts''
where
texts' = map lines texts
texts'' = foldr insertSFHeap emptyH texts'
merge' :: Heap SF -> [String]
merge' h | isEmptyH h = []
| otherwise = min : (merge' rest)
where
SF (min,minS) = findMinH h
rest = insertSFHeap minS $ deleteMinH h
Output
The final step is to copy the contents of the final temporary file to standard output and remove it:
showRun :: Int -> IO ()
showRun n = do
fd <- mustopen IO.ReadMode fn
hGetContents fd >>= putStr
hClose fd
removeFile fn
where
fn = tmpname n
Laziness
Throughout sort, I have used a pair of problematic functions: getContents and hGetContents. These two functions return Strings which can be evaluated a character at a time to read the contents of the file. The problem is the relation of this lazy input to the strict file manipulations. In particular, if any input not read before the file is closed will not be available. For example, the following two functions produce different output:
f1 = do fd <- openFile "/etc/motd" IO.ReadMode f <- hGetContents fd putStrLn f hClose fd f2 = do fd <- openFile "/etc/motd" IO.ReadMode f <- hGetContents fd hClose fd putStrLn f
The first prints the contents of the file; the second prints an empty line. (Things get more complicated if reading and writing are mixed.)
sort requires lazy I/O to limit its memory usage, and avoids the difficulties with laziness. For example, when merging the IO action which consumes the text output is invoked before the IO action which closes the input file. As a result, the ordering of the IO monad ensures that the operations take place in the correct order. The key is that I have taken care to ensure that in no case does an expression involving the input text "leak" past the operation closing the input file.
Separation of function: unique
K&R use the unique program to make a point about separation of function: sorting and unique-ifying are commonly used together, and
Why should there be two separate programs when a single slightly more complicated one will do?
The answer is that on the one hand, the unique program is useful on its own, and on the other, the two programs implement different functionality, and
In its early stages, at least, a program should implement a single function.
PROGRAM
unique - delete adjacent duplicate lines
USAGE
unique
FUNCTION
unique writes to its output only the first line from each group of
adjacent identical input lines. It is most useful for text that has
been sorted to bring identical lines together; in this case it
passes through only unique instances of input lines.
EXAMPLE
To eliminate duplicate lines in the output of a program:
program | sort | unique
The unique program itself is fairly simple, since it only works on adjacent lines.
unique :: String -> String
unique text = unlines $ unique' $ lines text
where
unique' :: [String] -> [String]
unique' (l:l':ls) | l == l' = unique' (l':ls)
unique' (l:ls) = l : unique' ls
unique' [] = []
main = interact $ unique
Permuted index
One use for a text sorting program, particularly for someone writing a book, is to produce an index. A particularly fancy form of index is a "keyword-in-context" or permuted index, which presents the index entries in the textural context that they appear. The kwic.hs and unrotate.hs programs produce a permuted index of every word of the input, sorted and rearranged so that the keywords line up.
KWIC
PROGRAM
kwic - produce lines for KWIC index
USAGE
kwic
FUNCTION
kwic writes one or more "folded" versions of each line to its
output. A line is "folded" at the beginning of each nonwhitespace
string within the line by writing from that string through the end
of the line, followed by the fold character, #, followed by the
beginning of the line.
kwic is used with sort and unrotate to produce a Key Word In
Context, or KWIC, index.
EXAMPLE
$ echo "This is a test." | runhaskell kwic.hs | runhaskell simplesort.hs
This is a test.#
a test.#This is
is a test.#This
test.#This is a
The program reads input lines, producing all possible rotations of the words on the line. The end of the original line is marked in each output line by a '#'. To do that, it uses two functions:
- doLines breaks the input into lines, maps the rotate across each line and concatenates the result into a list of new lines, then uses unlines to produce the final String.
- rotate breaks a line into words and finds all possible breaks of the line into a prefix and suffix of the words. Ignoring the possibility of the empty suffix, it joins the suffix followed by the "#" character followed by the prefix, producing all of the relevant rotations of the line.
rotate :: String -> [String]
rotate line = map splice $ init $ zip (inits ws) (tails ws)
where
splice (pre,suf) = unwords suf ++ "#" ++ unwords pre
ws = words line
doLines :: String -> String
doLines = unlines . concatMap rotate . lines
main = interact $ doLines
unrotate
Following the production of the rotated lines and sorting, an unrotate program is used to produce the final output.
PROGRAM
unrotate - format lines for a KWIC index
USAGE
unrotate
FUNCTION
unrotate reads its input a line at a time and writes an "unfolded"
version to its output.. A line is "folded" if it contains within it
an instance of the fold character #; "unfolding" involves writing
from the end of the line down to but not including the fold
character, starting in column 39 of the output line, wrapping
characters that would thus appear before column 1 around to the end
of the line, then writing the remainder of the line starting at
column 41 and wrapping around at column 80 if necessary.
unrotate is used with kwic and sort to produce a Key Word In
Context, or KWIC, index.
EXAMPLE
$ echo "This is a test." | runhaskell kwic.hs | runhaskell simplesort.hs \
| runhaskell unrotate.hs
This is a test.
This is a test.
This is a test.
This is a test.
BUGS
This does not currently handle line wrapping correctly.
unrotate uses one function, which re-formats each incoming line:
middle = 40
unrotate line = replicate (middle - lenp) ' ' ++ pre ++ " " ++ suf
where
(suf,('#':pre)) = span (/= '#') line
lenp = length pre
main = interact $ unlines . map unrotate . lines
This program will fail if an incoming line does not contain a '#' character, since the pattern matching used with span in unrotate would fail.
