Basic character I/O
The basic requirement for most of these programs is to read input, process it, and write output. Most of the programs are designed to act as filters, to read from standard input and write to standard output. Haskell provides several approaches to reading standard input. First, I will discuss the low-level (but slightly more general) interface reading a single character at a time. Then, I present (following excellent suggestions on the haskell-cafe mailing list) a higher-level, more Haskell-ish interface.
getc reads a character from stdin. The original used a flag value, ENDFILE or -1, to indicate the end of input, while I find this both bad practice and unnecessary. Instead, I am using Haskell's Maybe type, with Nothing indicating no further input: the end of the file, and a Just value indicating a newly read character.
This is the first monadic code to be seen here. getc builds an IO action out of three parts:
- It checks for the end of the input.
- If the end has been reached, it returns Nothing, elevating it via "return".
- If the end has not been reached, it reads the next character and returns Just that character.
The return value of getc is not a character or any other simple value; it is not even a Maybe Char. Instead, it is an IO action, and a complex one that involves a decision. This is hidden by the do notation, including the "<-" pseudo-assignment operator. As a result, I wrote this code as if I were writing in a simple, albeit strange, imperative language.
Note: in most cases, at least until later chapters, the type annotations were added after the functions were written. The annotations make good (sort of) documentation, but one of the problems I've had getting a grip on Haskell has been the IO monad and types, so I am writing this without specifying (or thinking too much about) the types.
Haskell does make a pretty good imperative language, although I find its lack of control structures disturbing.
(Yes, I know I can write my own damn control structures, more or less. It's just not the same. Besides, that would ruin the joke.)
getc :: IO (Maybe Char) getc = do eof <- isEOF if eof then return Nothing else do c <- getChar return (Just c)
putc writes a character to stdout. Just another name for putChar, really.
putc :: Char -> IO () putc = putChar
Copying stdin to stdout
Copy is the basic filter: copy the standard input to the standard output.
This uses a recursive function call rather than the loop of the original. Right offhand, I do not know whether Haskell's standard library provides anything like a "while" loop, but using functional programming's native control flow is completely workable.
Again, this function is based on the do notation for monadic IO. However, and here's the weird part, the IO "action" returned by this function involves a loop and is potentially arbitrarily large or long lasting. "Actions" are functions.
For a very good view of the interaction between functions, actions-as-data-structures, and actions-as-code-that-does-stuff, see "Tackling the Awkward Squad: monadic input/output, concurrency, exceptions, and foreign-language calls in Haskell," by Simon Peyton Jones.
copy :: IO () copy = do ch <- getc case ch of Nothing -> return () Just c -> do putc c copy
The copy program
The copy program satisfies the following documentation:
PROGRAM copyprog copy input to output USAGE copyprog FUNCTION copyprog copies its input to its output unchanged. It is useful for copying from a terminal to a file, from file to file, or even from terminal to terminal. It may be used for displaying the contents of a file, without interpretation or formatting, by copying from a file to terminal. EXAMPLE To echo lines typed at your terminal: $ runhaskell copyprog.hs hello there, are you listening hello there, are you listening yes, I am. yes, I am. ^Z
Note that this program simply calls the SoftwareTools.copy function.
import SoftwareTools main = copy
Counting characters in stdin
The charcount program count characters from the standard input.
PROGRAM charcount count characters in input USAGE charcount FUNCTION charcount counts the characters in its input and writes the total as a single line of text to the output. Since each line of text is internally delimited by a NEWLINE character, the total count is the number of lines plus the number of characters within each line. EXAMPLE $ runhaskell charcount.hs A single line of input. ^Z 24 (I apologize for the Windows-ism.)
Once again with the recursive functions. (I'm not sure why I feel the need to point that out; probably because the original uses loops and I feel slightly bad for not doing a direct translation rather than a more idiomatic one.) In this case, a helper does the actual looping. This helper returns the character count. Note that Haskell's stdio support (correctly) converts Windows' CRNL pairs into NL, so the character count is off by the number of lines. (I'm currently developing this on Windows. Bleah.)
The $! operator in the recursive call is need to force the addition, to make it strict. In this case that prevents a stack overflow on large inputs. See wc.hs below for further details. Here, it is syntactically similar to the $ parentheses-elimination operator.
charcount :: IO () charcount = do nc <- charcount' 0 putStrLn (show nc) where charcount' nc = do ch <- getc case ch of Nothing -> return nc Just c -> charcount' $! nc + 1
linecount counts lines from standard input.
PROGRAM linecount count lines in input USAGE linecount FUNCTION linecount counts the lines in its input and writes the total as a line of text to the output. EXAMPLE $ runhaskell linecount.hs A single line of input. ^Z 1
This case is slightly different from the preceding, in that it deals with a specific character, the newline. The original introduces a NEWLINE constant, and the equivalent would be a new data type which is either a Newline or some RandomCharacter c. Since I'm an old-school Unix/C guy, I'm comfortable with fuddling around with character constants and am just using 'n' (and this is probably decent Haskelly style, since excessive data typery is excessive anywhere). On the other hand, the case expression has sprouted an extra guard, to handle difference between the newline and J. Random Characters. Pattern matching is fun. The "_" in the second Just case is a pseudovariable that indicates, "don't care".
linecount :: IO () linecount = do lc <- linecount' 0 putStrLn (show lc) where linecount' lc = do ch <- getc case ch of Nothing -> return lc Just '\n' -> linecount' $! lc + 1 Just _ -> linecount' lc
wordcount counts words from standard input.
PROGRAM wordcount count words in input USAGE wordcount FUNCTION wordcount counts the words in its input and writes the total as a line of text to the output. A "word" is a maximal sequence of characters not containing a blank or tab or newline. EXAMPLE $ runhaskell wordcount.hs A single line of input ^Z 5 BUGS The definition of "word" is simplistic.
In the wordcount function, the code was getting wider than I wanted, so I broke out another helper function, handlechar, that is mutually recursive with wordcount'. The extra parameter of wordcount' is a flag, indicating whether the current position in the input is inside of a word or not. The word/nonword decision, and the count, is made when the first nonwhitespace character is read following whitespace.
The handlechar function is defined piecewise. The first case is when the character is whitespace, the second is when it is not whitespace but inword is false, indicating a non-word to word transition. The final case handles a nonwhitespace character while the position is already in a word. This code counts the non-word to word transitions in the input by incrementing the count in the second case.
wordcount :: IO () wordcount = do wc <- wordcount' False 0 putStrLn (show wc) where wordcount' inword wc = do ch <- getc case ch of Nothing -> return wc Just c -> handlechar c inword wc handlechar c _ wc | isSpace c = wordcount' False wc handlechar _ False wc = wordcount' True $! wc + 1 handlechar _ True wc = wordcount' True wc isSpace c = (c == ' ' || c == '\n' || c == '\t')
There are alternatives to character-by-character I/O in Haskell. One of the nicest is based on the function interact from the standard prelude. interact has the type (String -> String) -> IO (); in other words, it takes a function from a String to a String and produces an IO action. Internally, interact pulls characters from the result of the function and prints them to the standard output, while at the same time reading the standard input lazily, one character at a time. Here is an example, a replacement for copy:
copy' = interact id
id is the identity function, which returns its input unchanged; interact uses this to bind the input to the output.
To get fancier, a replacement for charcount would be:
charcount' = interact $ show1Ln . length where show1Ln v = (show v) ++ "\n"
The function length returns the number of elements in a list, and a String is a list of characters. show1Ln takes that length (as a number), converts it to a string and adds a newline; the result of combining the two is a function that takes a list and produces a newline-terminated string containing the number of elements in the list. interact uses that function to count the characters in the input and write that count to the output.
This function uses the "pointfree" style; it is identical to the function:
charcount'' = interact (\input -> show1Ln (length input))
To go further, an IO action that counts the number of words in the input (a replacement for wordcount) and another that counts the number of lines would be:
wordcount' = interact $ show1Ln . length . words linecount' = interact $ show1Ln . length . lines
(Using the same show1Ln helper function and the Haskell Prelude's words (which breaks a String into a list of Strings containing "words") and lines (which breaks a String into a list of Strings containing lines) functions.)
interact has a very nice property: it completely hides the monadic, imperative IO from the purely functional processing of the function. In all of the examples above, the logic of the program is dealing solely with a (admittedly, lazily read) String, not any IO actions.
interact is appropriately named; it can be used for interactive tasks, as seen in the following, peculiar, program:
main = interact f where f l = "I'm starting\n" ++ (g l) g ('I':chs) = "Saw an I" ++ (g chs) g (ch:chs) = ch : (g chs) g  = "I'm done\n"
This program prints "I'm starting", then copies its input to its output, noting when it sees the character 'I' and completing with the message, "I'm done". Running the program interactively results in the initial line, one line of user input followed by an output copy of the line (with I's noted), the second line of user input, and so on.
Functions similar to interact can be defined to deal with named files (which will need to be managed later), allowing this technique to deal with almost all of the programs from Software Tools. While the code in this translation was originally written using an imperative, getc-ish approach, I am replacing that code with interact-ish definitions because they have a number of advantages:
- interact's use of a string to string function typically results in clearer and shorter code,
- the code is much better Haskell style, and
- the results are typically significantly faster than the imperative versions. (The reverse of an "abstraction penalty".)
On the other hand, interact (and similar techniques) appear not to cover all of the needs of IO (possibly including exceptions and multi-file manipulations, for an example of which see the sort program later).
[This section was added following many suggestions from the haskell-cafe mailing list, including Gwern Branwen, apfelmus, and many others.]
wc - Haskell wc
wc here is not part of Software Tools, but is a natural extension of the individual programs above. I could not resist an implementation.
PROGRAM wc Count lines, words, and characters USAGE wc FUNCTION Counts the lines, words, and characters in its input and writes the total of each as a line of text to the output. Lines are distinguished by the newline character. Words are a maximal sequence of characters not containing a blank, tab, or newline. EXAMPLE $ runhaskell wc.hs A single line input. ^Z 1 4 21 BUGS The definition of "word" is simplistic.
One complication is that a naive version produces the following error:
$ ./wc < /usr/share/emacs/21.2/etc/DOC-X-21.2.1 Stack space overflow: current size 8388608 bytes. Use `+RTS -Ksize' to increase it.
The problem is that, even though the function is written to be tail recursive, the additions performing the counting in handlechar are evaluated lazily, meaning that the thunks computing (c+1), for example, pile up until they are finally forced by the (show c):
handlechar ch w l c -> wc' w l (c+1) False -> handlechar ch' w l (c+1) -> wc' w l ((c+1)+1) False -> handlechar ch'' w l ((c+1)+1) -> wc' w l (((c+1)+1)+1) False -> handlechar ch''' w l (((c+1)+1)+1) -> wc' w l ((((c+1)+1)+1)+1) False ... -> putStrLn $ (show l) ++ ... ++ (show ((...((c+1)+1)...)+1))
In general, it is possible that laziness would be a performance win, if the (+) operation were expensive, the number of them were small enough not to overflow the stack, and the result were ultimately not evaluated by the overall context. In this case, addition is cheap, the number of additions can grow large, and show will always have to evaluate the expression to determine the number.
There are two functions which force expressions to be evaluated strictly: $! and seq. The first acts like the $ operator. The second takes two arguments, forces evaluation of the first, and returns the second. Given that c, for example, is forced by seq, the further uses in the recursive call of handlechar will be of the integer value.
The code below, hints for which I found on the Haskell wiki and elsewhere on the web, forces the parameters to wc' to be evaluated strictly, preventing the thunks from accumulating. See the first line of wc'.
This first version uses getc and the imperative style.
wc = do (w,l,c) <- wc' 0 0 0 False putStrLn $ (show l) ++ " " ++ (show w) ++ " " ++ (show c) where wc' w l c inword = w `seq` l `seq` c `seq` do ch <- getc case ch of Nothing -> return (w,l,c) Just char -> handlechar char w l c inword handlechar ' ' w l c _ = wc' w l (c+1) False handlechar '\t' w l c _ = wc' w l (c+1) False handlechar '\n' w l c _ = wc' w (l+1) (c+1) False handlechar _ w l c False = wc' (w+1) l (c+1) True handlechar _ w l c True = wc' w l (c+1) True
There are a couple of ways of approaching wc using interact. A simple one is:
wc' = interact $ show3Ln . (\text -> (length $ lines text, length $ words text, length text)) where show3Ln (x,y,z) = (show x) ++ " " ++ (show y) ++ " " ++ (show z) ++ "\n"
This is indeed simple, but it has the weakness that the entire input must be read and stored: the input variable allows the program to use the incoming string three different times.
On the other hand, this definition effectively exploints laziness; it does not need extra strictness analysis to avoid failing on medium to large inputs.
To avoid forcing the storage of the entire input, the program can be "deforested"; the three separate uses combined:
wc'' = interact $ show3Ln . (cnt (0,0,0) False) where cnt a _  = a cnt a _ (ch:cs) | isSpc ch = force a `seq` cnt (l,w,c+1) False cs where (l,w,c) = a cnt a _ (ch:cs) | isNl ch = force a `seq` cnt (l+1,w,c+1) False cs where (l,w,c) = a cnt a False (_:cs) = force a `seq` cnt (l,w+1,c+1) True cs where (l,w,c) = a cnt a True (_:cs) = force a `seq` cnt (l,w,c+1) True cs where (l,w,c) = a force (l,w,c) = l `seq` w `seq` c isSpc ch = ch == ' ' || ch == '\t' isLn ch = ch == '\n'
This version uses the same logic as the original wc to process each character from the input string one at a time, and uses an accumulator (the triple) to allow it to be tail-recursive. (As a reminder, the Boolean flag is used to indicate whether or not the processing is currently in a word. Further, note that this version is not correctly strict and will fail on large inputs.)
PROGRAM detab convert tabs to blanks USAGE detab FUNCTION detab copies its input to its output, expanding horizontal tabs to blanks along the way, so that the output is visually the same as the input, but contains no tab characters. Tab stops are assumed to be set every four columns (i.e., 1, 5, 9,...), so that each tab character is replaced by from one to four blanks. EXAMPLE Using -> as a visible tab: $ runhaskell detab.hs ->col 1->2->34->rest col 1 2 34 rest ^Z BUGS detab is naive about backspaces, vertical motions, and non-printing characters.
The default tabstops are set every four characters; tabstops is provided as an explicit array based on that to allow extensions (for example, by providing weird, arbitrary tabstops):
tabspace = 4 tabstops :: [Bool] tabstops = map (\col -> col `mod` tabspace == 1) [ 1 .. ]
tabspaces is a function that converts the list of boolean tabstops to the number of spaces that a tab in a given column would need to produce:
tabspaces :: [Bool] -> [Int] tabspaces ts = spacesToTabstop tablessPrefix ++ tabspaces rest' where (tablessPrefix, (_:rest)) = span (\stop -> not stop) ts rest' = False:rest spacesToTabstop prefix = reverse [1 .. (length prefix)]
tabspaces walks down the tapstops list, breaking off spans that do not contain a tabstop and using that span to compute the number of spaces to the next tabstop. (A tab occurring in a column with a tabstop advances to the next tabstop, a behavior accomodated by replacing the True of the tabstop in the suffix provided by span with a False, as seen in rest'.)
detab is implemented using interact and an algorithm inspired by Tillmann Rendel and apfelmus:
detab' :: [Bool] -> String -> String detab' tabstops text = detab'' ts text where detab'' _  =  detab'' _ ('\n':text') = '\n' : detab'' ts text' detab'' (s:r) ('\t':text') = (toSpc s) ++ detab'' (adv s r) text' detab'' (_:r) (char:text') = char : detab'' r text' ts = tabspaces tabstops -- convert tabstops to column spaces toSpc n = replicate n ' ' -- produce n spaces adv n l = drop (n - 1) l -- skip n-1 elements of l detab :: IO () detab = interact $ detab' tabstops
detab' uses detab'' to work its way down the text string one character at a time, ensuring that it also knows how many spaces would need to be inserted when it sees a tab character. The key part of the latter is the third line of the detab'' definition: after inserting spaces to the next tabstop, it advances a similar distance into the tabspaces for the line. As a result the function ensures that the tabstops occur at the fixed intervals.
A piece of wisdom from K&P:
Most real programs are subjected to a steady flow of changes and improvements over their lifetimes, and many programmers spend most of their time maintaining and modifying existing programs. This is an expensive process, so one of the most important design considerations for a program is that it be easy to change.