CHAPTER FIVE

Text patterns

Regular expressions are a staple of software tools, an incredibly useful tool for describing and manipulating text. Software Tools presents a regular expression engine and a number of tools based on it. The first of those tools is find, a tool similar to grep.

PROGRAM

  find - find patterns in text

USAGE

  find pattern

FUNCTION

  find reads its input a line at a time and writes to its output those
  lines that match the specified text pattern.  A text pattern is a
  concatenation of the following elements:

    c          literal character c
    ?          any character except newline
    ^          beginning of line
    $          end of line (null string before newline)
    [ ... ]    character class (any one of these characters)
    [^ ... ]   negated character class (all but these characters)
    *          closure (zero or more occurrences of previous pattern)
    @c         escaped character (e.g., @^, @[, @*)

  Special meaning of characters in a text pattern is lost when
  escaped, inside [...] (except @]), or for:

    ^   not at beginning
    $   not at end
    *   at beginning

  A character class consists of zero or more of the following
  elements, surrounded by [ and ]:

    c      literal character c, including [
    c1-c2  range of characters (digits, lower or upper case letters)
    ^      negated character class if at beginning
    @c     escaped character (e.g., @^ @- @@ @])

  Special meaning of characters in a character class is lost when
  escaped or for:

    ^  not at beginning
    -  at beginning or end

  An escape sequence consists of the character @ followed by a single
  character:

    @n   newline
    @t   tab
    @c   c (including @@)

EXAMPLE

  To print lines ending ina Pascal keyword or identifier:

Writing a regular expression engine is not terribly difficult (writing a fast, featureiferous engine, however....), and Haskell makes it possible to write an elegant one.

The original has the observation:

Often it is possible not only to write code without knowing entirely where you're going, but also to test it. That's what we did with find. By using dummy versions of getpat and match, we were able to verify that lines of text are properly read and written....

I used a dummy version of getpat to experiment with representations for regular expressions. It definitely worked well: I was able to develop the internal pattern representation without dealing with the external representation.

Originally, I used a Pattern type:

--  data Pattern = Seq Pattern Pattern
--               | KStar Pattern
--               | Lit String
--               | Any
--               ...

The code depended on an amatch function, which interpreted the Pattern values:

--   ...
--   amatch (Lit s) str = if s `isPrefixOf` str
--                        then (True, drop (length s) str)
--                        else (False, str)
--   ...

In this code, each case looks at the current string and returns either True or False, depending on whether the pattern matches, and the suffix of the string following what the pattern accepted. This suffix was not used if the function returned False.

Two things come to mind once this pattern (ahem) becomes clear:

  • A value which is either False and empty or True and carrying an internal value sounds an awful lot like Maybe.
  • Threading the state of the string yet to be examined through the matching interpreter is an awful lot like what monads do.

I knew, of course, that Maybe was a monad. Proving that I can, in fact, put 2 and 2 together when they're bouncing off my head like little rocks, I ditched the code above and looked at a monadic regular expression matcher.

The type of a Pattern is a function from a String (the input to be examined) to a Maybe String (Nothing, if the Pattern does not match, or Just and the remaining string if it does.)

Except for kStarP (the Kleene star operator), each of the operators is written in a straightforward monadic way:

  • return <value> replaces Just <value>.
  • fail "ignored string" replaces Nothing. (fail in the Maybe monad ignores the string and becomes Nothing.)

Note: none of the code in the regular expression system involves the IO monad; it is all functional and referentially transparent. Wahoo!

type Pattern = String -> Maybe String

Matching

doneP is an end-of-pattern case, accepting any input.

doneP :: Pattern
doneP str = return str

anyP is only slightly more restrictive: it fails only when there is no more input.

anyP :: Pattern
anyP []  = fail "no match"
anyP str = return $ drop 1 str

The end-of-line Pattern matches either the newline at the end of the line, or the end of the string (for final lines that are not newline terminated).

eolP :: Pattern
eolP []       = return []
eolP ('\n':t) = return t
eolP _        = fail "no match"

litP accepts a string representing the literal text to match. It returns the suffix of the string if the text is a prefix of the string, and it fails otherwise.

Note that if str is empty, s (which is not empty) cannot be a prefix.

litP :: String -> Pattern
litP s str | s `isPrefixOf` str = return $ drop (length s) str
           | otherwise          = fail "no match"

classP (character classes) are similar to litP, except for the semantics of the test. nclassP (negated character classes) reverses the element test.

classP :: [Char] -> Pattern
classP cls (h:t) | h `elem` cls = return t
                 | otherwise    = fail "no match"
classP _   []                   = fail "no match"

nclassP :: [Char] -> Pattern
nclassP _   []                   = fail "no match"
nclassP cls (h:t) | h `elem` cls = fail "no match"
                  | otherwise    = return t

seqP represents the concatenation of regular expression operators; it sequences between two patterns.

seqP :: Pattern -> Pattern -> Pattern
seqP p1 p2 str = p1 str >>= p2

One regular expression operator which is not present in K&P's version is "or"; combining two patterns such that if the first fails, the second is tried at the same location. This particular version is greedy, since if the first pattern succeeds but subsequent patterns fail it does not backtrack through to try the second pattern, but it is useful in implementing the star regex operator.

orP :: Pattern -> Pattern -> Pattern
orP p1 p2 str = (p1 str) `mplus` (p2 str)

kStarP is a special Pattern. Like seqP, it provides sequencing between two patterns, because it has special failure semantics: It first attempts to match as much text as possible, repeating the first pattern. When it can no longer match anything, it calls on the second pattern with the remaining text.

Unfortunately, if the second pattern fails, the closure must backtrack one character at a time, repeatedly trying the second pattern. This process is recursive, and because it modifies the way the Maybe monad is used in the Patterns, it

  • has to be a sequencing operator connecting two patterns (like seqP), and
  • has to break the encapsulation of the Maybe monad used by the previous operators by invoking pat and interpreting the results.
kStarP :: Pattern -> Pattern -> Pattern
kStarP p1 p2 = (p1 `seqP` (kStarP p1 p2)) `orP` p2

The recursive use of kStarP in the sequence implements the initial maximum length matching, while orP implements the backtracking to the subsequent patterns.

With Patterns represented by functions, doesMatch, the function which tests the pattern on a given string, could hardly be simpler. It calls the pattern function with the string, then tests the result: Nothing indicates failure and a False value; Just indicates success and a True value.

doesMatch :: Pattern -> String -> Bool
doesMatch pat str = isJust (pat str)

And end-of-line anchor is a Pattern, but a beginning-of-line (BOL) anchor is not; instead it is represented by a boolean flag passed to the match function. The reason is simultaneously simplicity and efficiency. Normal pattern matching (for a simple matcher like this) requires checking the pattern beginning at every position on the line: Does the pattern match beginning at the first character? No? Does it match beginning at the second? A pattern anchored at the beginning of the line, however, only needs to be checked starting at the first character of the line. Thus, a special case handling a BOL anchor is well worth some extra code.

On the other hand, representing the BOL anchor as a Pattern would require unknown changes to the Patterns or how they are handled, since two patterns differing only in the presence of a BOL anchor have vastly different behavior. Thus, the special case actually simplifies the code.

So, here is match. If the pattern came with a BOL anchor, the pattern is applied to the beginning of the line; it succeeds if the one call to doesMatch succeeds; it fails if the one call fails. If there is no anchor, the standard library function tails is used to sequentially (and lazily) generate all of the successive suffixes of the line, starting with the whole line. doesMatch is then mapped over the suffixes by any, which returns True if any of the calls to doesMatch returns True.

To simplify things, the BOL anchor flag and the Pattern are joined into an AnchoredPattern.

type AnchoredPattern = (Bool, Pattern)

match :: AnchoredPattern -> String -> Bool
match (anchored, pat) ln | anchored  = doesMatch pat ln
                         | otherwise = any (doesMatch pat) (tails ln)

Patterns

Parsing the command line argument to build the Pattern used by match is more gruesome than complex. This code is rather ugly, fragile, and really ought to be replaced with a better parser. (Keep going for an alternative.)

getpat creates a combination of a pattern and a BOL flag. The actual pattern is built by getpat'.

getpat :: String -> AnchoredPattern
getpat ('^':t) = (True, getpat' t)
getpat      t  = (False, getpat' t)

getpat' is quite hairy. The first two cases handle the end of the pattern, and the end-of-line marker, at the end of the pattern (it is not special in any other position).

getpat' []    = doneP
getpat' ['$'] = eolP

An escape character followed by any other character, including the assorted metacharacters, is a literal pattern. (Some characters, such as 'n' and 't' are special when escaped: newline and tap, respectively.)

getpat' ('@':ch:t) = (litP [esc ch]) `seqP` (getpat' t)

There are two cases handling the wildcard character, because the Kleene star operator becomes a sequencing operation and needs to be found at the same time as the pattern it is acting on. The first case handles "?*", the second just "?".

getpat' ('?':'*':t) = anyP `kStarP` (getpat' t)
getpat' ('?'    :t) = anyP `seqP`   (getpat' t)

Character classes are significantly complex, because not only can they be modified by the Kleene star, the contents of the class also need special parsing by getccl.

getpat' ('[':t) = case t' of ('*':t'') -> cls `kStarP` (getpat' t'')
                             _         -> cls `seqP`   (getpat' t')
    where
      (len, cls) = getccl t
      t'         = drop len t

A special pattern handles a literal character followed by the Kleene star.

getpat' (ch:'*':t) = (litP [ch]) `kStarP` (getpat' t)

Finally, the case for literal text patterns. As much text as possible is accumulated into the literal by getlit.

getpat' (ch:t) = (litP lit) `seqP` (getpat' t')
    where
      lit = ch:(getlit t)
      t'  = drop ((length lit) - 1) t

In a literal string, any metacharacter can terminate the literal pattern (although some, such as '$' in the middle of a pattern, actually need not do so). getlit collects as much text as possible, then lets getpat' sort out the meaning of any special characters.

isMeta ch = ch == '?' || ch == '[' || ch == '$' || ch == '*' || ch == '@'

getlit :: String -> String
getlit (ch:t) | not (isMeta ch) = case t of ('*':_) -> [];
                                                  _ -> ch:(getlit t)
getlit _ = []

Collecting a character class requires a two-phase process. The outermost phase checks for the presence of '^' as the first character of the class, indicating a negated class. It is also responsible for converting the class string to the correct Pattern.

Since the length of a class expression is not necessarily related to the size of the resulting class, the second phase, getccl', counts the number of characters it eats, allowing getpat' to ignore the class expression.

It would be possible for getccl to specially handle a class containing ']', by identifying "[]...]" or "[^]...]". However, the specs in the original do not call for it. ('@' would escape ']' in the class, anyway.)

getccl :: String -> (Int, Pattern)
getccl ('^':t) = let (len,cls) = getccl' t in (len+1, nclassP cls)
getccl t = let (len,cls) = getccl' t in (len, classP cls)

getccl' is moderately complex, but not severely so. The major complication is accumulating the count of characters from the original expression; I was lazy and did not even try to make this tail-recursive. There's probably a snazzy way to do it with fold, too.

getccl' :: String -> (Int, String)
getccl' ('@':ch:t) = let (len,cls) = getccl' t
                     in (len+2, (esc ch):cls)
getccl' (']':t) = (1,[])
getccl' (ch1:'-':ch2:t) | ch2 /= ']' = let (len,cls) = getccl' t
                                       in (len+3, [ch1 .. ch2] ++ cls)
getccl' (ch:t) = let (len,cls) = getccl' t in (len+1, ch:cls)
getccl' [] = error "incomplete character class"

Find

find, ultimately, is incredibly simple; it only prints output lines if the pattern matched in the line.

main =
    do
      [arg] <- getArgs
      interact $ find arg
          where
            find arg = unlines . filter (match $ getpat arg) . lines

Changing text

The quintessential software tool is sed, the stream editor. Software Tools presents a simple version of a stream editor, change.

PROGRAM

  change - change patterns in text

USAGE

  change pattern [ newstuff ]

FUNCTION

  change copies its input to its output except that each
  non-overlapping string that matches pattern is replaced by the
  string newstuff.  A non-existent newstuff implies deletion of the
  matched string.  The patterns accepted by change are the same as
  those used by find.

  The replacement string newstuff consists of zero or more of the
  following elements:

    c    literal character
    &    "ditto", i.e., whatever was matched
    @c   escaped character c (e.g., @&)

EXAMPLE

  To parenthesize all sums and differences of identifiers:

    change '[a-zA-Z][a-zA-Z0-9]*[ ]*[+-][ ]*[a-zA-Z][a-zA-Z0-9]*' '(&)'

To implement change, pattern matching remains the same, but more information is needed than just whether or not a match exists. getmatch possibly returns a match, suffix pair. The match element is the portion of the string that matched; the suffix is the remainder.

getmatch :: Pattern -> String -> Maybe (String,String)
getmatch pat str =
    do
      suf <- pat str
      let mlen = (length str) - (length suf)
      return (take mlen str, suf)

An alternative, non-monadic implementation of the same algorithm would be:

getmatch pat str
    = case pat str of Nothing -> Nothing
                      Just suf -> let mlen = (length str) - (length suf)
                                  in Just (take mlen str, suf)

getallmatches is a bit fancy. It returns a list of (prefix, match, suffix) tuples, showing all of the possible matches of the (anchored) pattern in the string. (If the pattern is anchored at the beginning, only one match is possible, at the beginning of the string.)

The phrase

zip (inits str) (tails str)

generates a list consisting of pairs, where the first element of each pair is a prefix of the string and the second is the corresponding suffix. getallmatches' applies getmatch to the suffix component, returning a (match, subsuffix) pair; this is mated with the prefix and any failures are filtered out.

Unfortunately, that list contains duplicate suffixes: matches that ended at the same location in the string and thus cover the same ground. nubBy and suffixeq are used to filter out the duplicates.

On this point, the original says,

The main problem is what to do with null string matches, because unless one is careful, there can be unexpected null strings.... This ensures that the pattern a* matches the line xy at three points---before x, between x and y, and after y. We are also careful that a* matches xay at exactly three places as well; this is proper behavior even though it may not be immediately obvious.

Some unneeded work is done, generating and filtering the duplicates. I must be getting tired.

getallmatches :: AnchoredPattern -> String -> [(String, String, String)]
getallmatches (anchor,pat) str | anchor    = anchored
                               | otherwise = floating
    where
    anchored =                  allmatches [("",str)]
    floating = nubBy suffixeq $ allmatches allbreaks

    allbreaks = zip (inits str) (tails str)

    findmatch (prefix,rest) =
        do
          (match,suffix) <- getmatch pat rest
          return (prefix,match,suffix)

    allmatches = mapMaybe findmatch

    suffixeq (_,_,l) (_,_,r) = l == r

subline takes an anchored pattern and a change function, and applies them to a string to generate a new string. The work is done by the recursive function subline'.

  • The pre parameter represents the length of current prefix of the input string. This is used to ignore the first part of the match prefix p, in order to add the unchanged parts of the string. (Note that (drop pre p) will be [] if pre > (length p).)
  • The most recently seen suffix is carred through the recursion, to allow the terminal portion of the line to be generated at the end of the recursion. Note that the initial argument for this parameter is the string itself, which will be the return value if the pattern does not match.
subline :: AnchoredPattern -> (String -> String) -> String -> String
subline pat chng str = subline' 0 str (getallmatches pat str)
    where
    subline' _ suf [] = suf
    subline' pre _ ((p,m,s):rest)
        = let pre' = (length p) + (length m)
          in (drop pre p) ++ (chng m) ++ (subline' pre' s rest)

subline1 is an alternative to subline that will be used in the editor in the next chapter. Where subline changes all the occurrences of the matched pattern in the string, subline1 only changes the first (i.e., the leftmost) match. subline1 was based on subline, for the simple reason that subline was written first and in the most general fashion.

subline1 :: AnchoredPattern -> (String -> String) -> String -> String
subline1 pat chng str = subline1' str (getallmatches pat str)
    where
    subline1' suf [] = suf                                 -- no match
    subline1' _ ((p,m,s):_) = p ++ (chng m) ++ s   -- match and change

makesub is a function which returns a change function, for use with subline. Given a substitution string, the resulting change function either:

  • Changes the input string into the constant substitution string, if the substitution does not contain '&', or
  • Replaces the '&' with the match that satisfied the pattern.
makesub :: String -> String -> String
makesub subst
    = case elemIndex '&' subst
      of Nothing -> (\_ -> subst)
         Just n -> case splitAt (n-1) subst
                   of (pre, '@':'&':suf) -> (\_ -> (pre ++ '&':suf))
                      (pre, '&':suf) -> (\s -> pre ++ s ++ suf)

The change program itself is fairly simple.

main = do
       args <- getArgs
       case args
            of [pat,subst] -> dochange (getpat pat) (makesub subst)
               [pat]       -> dochange (getpat pat) (makesub "")
           where
             dochange p c = interact $ unlines . map (subline p c) . lines

gloria i ad inferni
faciamus opus

Return to Top | About this site...
Last edited Sat Aug 8 03:29:10 2009.
Copyright © 2005-2012 Tommy M. McGuire