CHAPTER TWO
Basic processing
Putting tabs back
entab is very nearly the inverse of detab:
PROGRAM
entab -- convert runs of blanks to tabs
USAGE
entab
FUNCTION
entab copies its input to its output, replacing strings of blanks by
tabs so that the output is visually the same as the input, but
contains fewer characters. Tab stops are assumed to be set every
four columns (i.e., 1, 5, 9,...), so that each sequence of one to
four blanks ending on a tab stop is replaced by one tab character.
EXAMPLE
Using > as a visible tab:
$ runhaskell entab.hs
col 1 2 34 rest
>col>1>2>34>rest
^Z
$ runhaskell detab.hs | runhaskell entab.hs
>col 1>2>34>rest
^Z
>col>1>2>34>rest
BUGS
entab is naive about backspaces, vertical motions, and non-printing
characters. entab will convert a single blank to a tab if it occurs
at a tab stop. Thus entab is not an exact inverse of detab.
The code for entab uses (once again) interact and a String -> String function. The function, entab', is further simplified by using perLine to map across the lines of the input:
perLine f = unlines . map f . lines
In this case, entab' breaks each input line into tabstop-separation blocks, and replaces any trailing spaces in each block with a tab (unless it is the last block of the input).
entab' :: [Bool] -> String -> String
entab' ts = perLine $ entab'' (tabspaces ts)
where
entab'' :: [Int] -> String -> String
entab'' _ [] = []
entab'' (col:rest) txt = doBlock col pre ++ entab'' (drop (col-1) rest) suf
where
(pre, suf) = splitAt col txt
doBlock len blk | length blk < len = blk
doBlock _ blk = case spcs
of "" -> blk -- in case the final blk ends in spaces
_ -> (reverse rest) ++ "\t"
where
(spcs, rest) = break (/= ' ') $ reverse blk
entab = interact $ entab' tabstops
As a side note, this is why tabspaces produces a list of integers rather than a list of space-strings.
Overstrikes
Back in the day, people used teletypewriters to interact with computers. These ttys typed user input onto a roll of paper while passing it to the computer and then printed the computer's output on the paper. One nifty trick was to emulate bold text by doubling a character: print the character, backspace, then print the character again. The duplication would produce a larger, darker glyph. The same trick would allow multiple overlapping glyphs, to produce underlining for example. When CRT terminals entered the scene, they typically followed the conventions of ttys (called "glass ttys", they were), including allowing the nifty backspace trick.
Printers (including, apparently, those available to the Bell Labs folks), however, did not follow this convention. Instead, the input had to be built of a special control character at the beginning of each line. An initial space allowed the line to be printed normally, but an initial '+' caused the print head to return to the beginning of the line without advancing the paper. (A "carriage return" without the matching "newline".) Then, the line output would be printed over the previous line, allowing bold, underlines, and whatever other special effects that could be done with overlapping glyphs.
The overstrike program converts between the first convention and the second.
To preserve what remains of my sanity, I used the '@' character rather than a backspace. If I remember correctly, some systems did use it so, since their terminals did not have backspace keys, so maybe I am not entirely off in the weeds.
PROGRAM
overstrike - replace overstrikes by multiple lines
USAGE
overstrike
FUNCTION
overstrike copies its input to its output, replacing lines
containing backspaces by multiple lines that overstrike to print the
same as the input, but contain no backspaces. It is assumed that
the output is to be printed on a device that takes the first
character of each line as a carriage control; a blank carriage
control causes normal space before print, while a plus sign '+'
suppresses space before print and hence causes the remainder of the
line to overstrike the previous line.
EXAMPLE
$ runhaskell overstrike.hs
abc@@@___
abc
+___
^Z
BUGS
overstrike is naive about vertical motions and non-printing
characters. It produces one overstruck line for each sequence of
backspaces.
overstrike based on interact starts by breaking the incoming text into lines then breaking each line into a list of segments, where each segment is a run of backspaces (i.e. '@') or regular text. mapAccumL is used to process each segment, passing an accumulating parameter through the segments. An '@' segment is replaced by a newline plus a continuation character followed by enough spaces to skip the non-overstricken prefix. The accumulator collects the indentation needed to space over the prefix.
When the segments are combined with concat, the result is that a line has been converted to one or more lines where any new lines begin with the continuation character and represent the necessary overstrike.
overstrike = interact $ overstrike1'
where
overstrike1' txt = " " ++ body txt ++ "\n"
body = concat $ intersperse "\n " . map doLine . lines
doLine ln = concat $ snd $ mapAccumL doSeg 0 $ segment ln
doSeg i l@('@':_) = (n, "\n+" ++ replicate n ' ') where n = i - (length l)
doSeg i l = (n, l) where n = i + (length l)
segment = groupBy (\l r -> (l == '@' || r /= '@') && (l /= '@' || r == '@'))
Text compression
This program provides compression using simple run-length encoding. The escape character is ~, the run length is encoded as A-Z where A=1, B=2, etc. One or more ~'s in the input should be encoded as ~l~ where l is A, B, .... Normally, the threshold for compression is 4, so ~D will be the lowest escape and run length value.
PROGRAM
compress - compress input by encoding repeated characters
USAGE
compress
FUNCTION
compress copies its input to its output, replacing strings of
four or more identical characters by a code sequence so that
the output generally contains fewer characters than the input.
A run of x's is encoded as ~nx, where the count n is a
character: 'A' calls for a repetition of one x, 'B' a
repetition of two x's, and so on. Runs longer than 26 are
broken into several shorter ones. Runs of ~'s of any length
are encoded.
EXAMPLE
$ runhaskell compress.hs >out
Item Name Value
1 car ~$7,000.00
^Z
$ cat out
Item~D Name~I Value
1~G car~J ~A~$7,000.00
BUGS
The implementation assumes 26 legal uppercase letters beginning
with A.
This is another case where using interact and higher-level functional programming simplifies the solution greatly. My original version, using character-by-character input, was significantly longer and much less comprehensible (which also describes the original).
In this implementation, compress' uses group to break the incoming string into runs of single characters. Those runs are transformed into (length, character) pairs, which are then transformed back into strings following the rle (run-length-encoding) rules.
- A length greater than 26 (since the length is encoded by 'A'-'Z') is replaced by an encoding of the first 26 followed by an encoding of the remainder.
- Any number of tildes is encoded, to prevent it from being misunderstood on decompression.
- A length less than 4 is simply used as-is, since "~Ab" is longer than "b".
- Anything else is encoded.
compress :: IO ()
compress = interact compress'
compress' :: String -> String
compress' = concat . map rle . map toPair . group
where
toPair :: String -> (Int, Char)
toPair str = (length str, head str)
rle :: (Int,Char) -> String
rle (n,ch) | n > max = rle (max, ch) ++ rle (n - max, ch)
rle (n,ch) | n < min && ch /= '~' = replicate n ch
rle (n,ch) = ['~', toChr n, ch]
min = 4
max = 26
toChr i = chr (ordA + i - 1)
ordA = ord 'A'
Text expansion
Text expansion, from run-length encoding, is pretty straightforward.
PROGRAM
expand - expand compressed input
USAGE
expand
FUNCTION
expand copies its input, which has presumably been encoded by
compress, to its output, replacing code sequences ~nc by the
repeated characters they stand for so that the text output exactly
matches that which was originally encoded. The occurrence of the
warning character ~ in the input means that the next character is a
repetition count; 'A' calls for one instance of the following
character, 'B' calls for two, and so on up to 'Z'.
EXAMPLE
$ runhaskell expand.hs
Item~D Name~I Value
Item Name Value
1~G car~J ~A~$7,000.00
1 car ~$7,000.00
^Z
expand is very simple; it merely needs to handle three cases: the end of the text, text matching the tilde escape string, and everything else.
There seems to be an error in the original; the way I read it, if the file terminates with a ~, the function will call getc to get the EOF, fall through and print the ~, then go to the top of the loop to call getc again, which is undefined at that point. Likewise if the file ends with ~A.
expand = interact expand'
expand' [] = []
expand' ('~':n:ch:rest) = replicate count ch ++ expand' rest where count = (ord n) - (ord 'A') + 1)
expand' (ch:rest) = ch : expand' rest
Arrows
As an aside, Don Stewart has a discussion of run-length encoding using Arrows. It's neat, but....
Command arguments
echo! Whoohoo!
PROGRAM
echo - echo arguments to output
USAGE
echo [ argument ... ]
FUNCTION
echo copies its command line arguments to its output as a line of
text with one space between each argument. If there are no
arguments, no output is produced.
EXAMPLE
$ runhaskell echo.hs Hello World!
Hello World!
Gets the command line arguments from System.getArgs, interleaves a space between the elements, joins all the elements into a single string and then prints it. Extra complexity is needed to avoid printing a newline if no arguments were provided.
Note the extravagant use of mapM_ to call putc on all of the characters of the string.
echo = do
args <- getArgs
mapM_ putc $ concat $ intersperse " " args
if (length args) > 0 then putc '\n' else return ()
Character transliteration
translit is very similar to the Unix program tr.
PROGRAM
translit - transliterate characters
USAGE
translit [^]src [dest]
FUNCTION
translit maps its input, on a character by character basis, and writes
the translated version to its output. In the simplest case, each
character in the argument src is translated to the corresponding
character in the argument dest; all other characters are copied as is.
Both src and dest may contain substrings of the form "c1-c2" as a
shorthand for all of the characters in the range c1..c2. c1 and c2
must both be digits, or both be letters of the same case.
If dest is absent, all characters represented by src are deleted.
Otherwise, if dest is shorter than src, all characters in src that
would map to or beyond the last character of dest are mapped to the
last character in dest; moreover adjacent instances of such characters
in the input are represented in the output by a single instance of the
last character in dest. Thus
translit 0-9 9
converts each string of digits to the single digit 9.
Finally, if src is preceded by a ^, then *all but* the characters
represented by src are taken as the source string; i.e., they are all
deleted if dest is absent, or they are all collapsed if the last
character in dest is present.
EXAMPLE
To convert upper case to lower:
translit A-Z a-z
to discard punctuation and isolate words by spaces on each line:
translit ^a-zA-Z@n " "
Oh, and tabs can be specified as @t and newlines as @n in the command line arguments....
Minor functions
index returns Just the index of the first occurrance a character in a list (or string), or Nothing. Is strict in the counter argument to index', n. See the wordcount program, wc.hs, for a more in-depth look at strictness and laziness. (There is a library function for this already, List.elemIndex.)
index :: (Num n, Eq a) => [a] -> a -> Maybe n
index str c = index' 0 str
where
index' n [] = Nothing
index' n (ch:_) | c == ch = Just n
index' n (_:rest) = let n' = (n+1) in n' `seq` index' n' rest
addchar adds c to a set s. It is written to be polymorphic over list elements, but it is intended to be used on strings (hence the name).
addchar :: a -> [a] -> [a] addchar c s = c:s
addstrs appends the characters between l and u to s.
addstr :: Char -> Char -> String -> String addstr l u s = s `union` [ l .. u ]
isAlphaNum is true if c is alphanumeric.
isAlphaNum :: Char -> Bool isAlphaNum c = (isAlpha c) || (isDigit c)
esc converts escaped characters to their normal representation: 'n' becomes newline, etc.
esc :: Char -> Char esc 'n' = '\n' esc 't' = '\t' esc c = c
translit
dodash expands runs of characters described as 'a-z', for example, to 'abc...xyz'.
K&P claim to have written dodash in a general fashion with the arbitrary ending delimeter to use "in later dealings with sets of characters." I have left it so for the moment, although it would be better to remove the generalization. (The delimeter, an argument of chr 0, is unused in translit.)
dodash :: Char -> String -> String -> String
dodash delim src dst = dodash' src dst
where
dodash' [] dst = dst
dodash' (c:src) dst | c == delim = dst
dodash' ('@':c:src) dst = dodash delim src (addchar (esc c) dst)
dodash' (l:'-':r:src) dst
| (isAlphaNum l) && (isAlphaNum r) && l < r =
dodash delim src (addstr l r dst)
dodash' (c:src) dst = dodash delim src (addchar c dst)
makeset expands a string inset to a full character set using dodash.
makeset inset = dodash (chr 0) inset []
The first problem for translit is to parse the command line arguments, producing the negation flag and the from and to sets.
To keep things simple, parseargs works in stages. The function itself first generally parses the arguments, which can be either one string or two. parseargs' takes over from there.
The task of parseargs' is to check for the negation of the from string, the first argument. If it is specified, it sets "allbut" in the output, passing along the results of parseargs''.
parseargs'' does the (relatively) heavy lifting, making sets from the two arguments and doing some input error checking.
parseArgs is an IO action because it potentially throws an error, and it seemed cleaner to keep that out of pure code.
parseargs :: IO (Bool,String,String)
parseargs = do
args <- getArgs
case args of
(from:to:[]) -> parseargs' from to
(from:[]) -> parseargs' from ""
_ -> error "usage: translit from [ to ]"
where
-- parseargs' picks off the negation operator if necessary
parseargs' :: String -> String -> IO (Bool,String,String)
parseargs' ('^':from) to = do
(fromset,toset) <- parseargs'' from to
return (True,fromset,toset)
parseargs' from to = do
(fromset,toset) <- parseargs'' from to
return (False,fromset,toset)
-- parseargs'' provides the base parsing, after negation
parseargs'' :: String -> String -> IO (String,String)
parseargs'' from to =
let fromset = makeset from
toset = makeset to
in
if length(fromset) < length(toset)
then error "translit: \"from\" shorter than \"to\""
else return (fromset,toset)
xindex is an interface for index. Given a list (or string), a character, a flag indicating whether the result should be inverted, and an integer representing a squash length (...), produce a value which may be an index into the fromset, may be a flag indicating no such index exists (-1), or may be some special key value, that squash length + 1.
In the original, an alternative of xindex is presented that uses more complex boolean logic to compute the same result. However, that version assumes that index will handle the Maybe Char produced by getc to indicate end of file. This translation does not, to keep index more general.
This function is largely the brains of translit. It is used to identify the transliteration characters in the fromset, including handling the special cases of the fromset being negated, and fromset being longer than toset.
This nonsense is nasty, and represents one of the finest examples of a need for data abstraction. I have largely left it in the same form as the original, partially to illustrate the advances made in the last 30 years of programming, and partially because I cannot figure out how it works.
xindex :: Eq a => [a] -> a -> Bool -> Int -> Int
xindex inset c allbut lastto =
case (index inset c) of
Nothing -> if allbut then lastto + 1 else -1
Just i -> if allbut then -1 else i
This is the actual transliteration. translit' is the function from a string to a string, also acting as a placeholder, to provide a scope with the values allbut, fromset, toset, lastto, and squash---these are computed from the command line parameters but are constant in the loop.
The code is fairly hairy. One particual problem is the use of special integer values from xindex. Unfortunately, I could not decide how to clean that up.
The function itself (along with xindex above) was originally more-or-less directly transcribed from the original, then refactored to its present state. The refactoring has made the loop structure of translit'' much clearer.
This code uses the "skip" function to "do nothing" as a monadic statement. The alternative, either using "return ()" or repeating code, would be more confusing. The two skips are important: the first is what doesn't print anything while squashing runs of the last character of a toset, while the second is what doesn't print anything when the toset has been omitted.
The squashing parameter to translit'' is responsible for the requirement that adjacent squashed characters all be replaced by a single squash character. This code would otherwise replace all of the squashed characters by a copy of the terminal character of the toset---no squashing would occur.
translit' :: Bool -> String -> String -> Int -> Bool -> String -> String
translit' allbut fromset toset lastto squash = translit'' False
where
translit'' squashing [] = []
translit'' squashing (c:rest) =
let i = xindex fromset c allbut lastto in
if squash && (i >= lastto) && (lastto >= 0)
then
if not squashing
then (toset !! lastto) : translit'' True rest
else translit'' True rest
else
if (i >= 0) && (lastto >= 0)
then (toset !! i) : translit'' False rest
else if (i < 0)
then c : translit'' False rest
else translit'' False rest
The final IO action is fairly simple, although it needs to integrate the assorted components:
translit = do (allbut,fromset,toset) <- parseargs
let lastto = (length toset) - 1
squash = (length fromset) > lastto || allbut
interact $ translit' allbut fromset toset lastto squash
I theorize that allbut is present because the strings used by K&P cannot hold all possible characters, so negated sets cannot be directly represented. I have kept it here, where its sole purpose is to tell xindex to reverse (more or less) the result of the index test.
Further:
To answer the obvious question raised by this experiment, we did not use sets for fromset and toset because sets that are large enough are not always available. The mapping array suggested in one of the following exercises should make translit run faster, but it does not mesh well with some of the other uses we have planned for dodash in Chapters 5 and 6.
Numbers
Software Tools provides a small number of numeric functions, used to convert numbers to and from strings. These functions are used by the subsequent code. Most of these duplicate elements of the Prelude or library, but their definition is presented here for completeness.
itoc itself is pretty uninteresting. It replicates a standard library function, but illustrates one type-related issue: The Int type is a requirement; no other type would satisfy this because the result of ord is an Int.
Also, beware the difference between quotRem and divMod for negative numbers:
> (-13) `quotRem` 10 (-1,-3) > (-13) `divMod` 10 (-2,7)
The recursive call is not in tail position; this function will not execute as a loop in constant space. However, the number of digits in an Int is not that large. On the other hand converting it to tail-recursive form would not be difficult.
Note that this function does not correctly handle twos-complement numbers; there is one more negative number than can be represented as a positive number, so the first line will do something funny with the minimal negative Int. Like overflow.
itoc :: Int -> String
itoc n | n < 0 = '-' : itoc (-n)
itoc n = let (q,r) = n `quotRem` 10
d = chr ((ord '0') + r)
in if q == 0
then [ d ]
else (itoc q) ++ [d]
putdec calls itoc to do the conversion from Int to String, then uses two mapM_ calls to format the output. The first prints spaces to pad the field width from nd, the length of the string, to w. The second actually prints the characters of the string.
putdec :: Int -> Int -> IO ()
putdec n w = do
mapM_ (\n -> putc ' ') [ nd .. (w - 1) ]
mapM_ putc s
where
s = itoc n
nd = length s
ctoi is written as a case expression rather than as a sequence of pattern-matched definitions because the ctoi' helper is called from several branches of ctoi. (A 'where' definition like ctoi' would only be available to the immediately previous branch.)
The original uses a var i argument to index into the string being parsed. This argument is updated by ctoi, and presumably will be further used when parsing the string following the number. (They have used this and similar approaches in several other functions including getc, as well.) This code replaces that with a pair value: the first element is the number and the second is the remainder of the string with the number removed.
ctoi :: String -> (Int,String)
ctoi s = case s of
(c:s') | c == ' ' ||
c == '\t' -> ctoi s'
('-':s') -> let (n,s'') = (ctoi' 0 s') in (-1*n,s'')
('+':s') -> ctoi' 0 s'
(c:s') | isDigit c -> ctoi' 0 (c:s')
s' -> (0,s')
where
ctoi' n (c:s) | isDigit c = let d = (ord c) - (ord '0') in
ctoi' (10*n + d) s
ctoi' n s = (n,s)
