Archive | Regular Expressions RSS for this section

An Intro To Parsec

Taking a break from the Photo Booth, I’ve been working on a re-write of dmp_helper. If you’ve been following me for a while, you may remember that as a quick, dirty hack I did in perl one day to avoid manually writing out HTML snippets. While it mostly works, it’s objectively terrible. I guess if my goal is to never get a job doing Perl (probably not the worst goal to have) then it might serve me well.

But I digress. DmpHelper, the successor to dmp_helper, is going to be written in Haskell, and make extensive use of the Parsec parser library. The old dmp_helper makes extensive use of regular expressions to convert my custom markup to HTML, but I’d prefer to avoid them here. I find regular expressions to be hard to use and hard to read.

So I set off to work. And not 3 lines into main :: IO (), I came across a need to parse something! This is something that has been parsed many times before, but it’s good practice so it’s a wheel I’ll be re-inventing. I’m talking of course about parsing the ArgV!

Some Choices To Make

But before I get to that, I need to make some choices about how my arguments will look. I’ve decided to use the “double dash” method. Therefore, all flags will take the form of --X, where X is the flag. Any flag can take an optional argument, such as --i foo, all text following a flag is considered to belong the the preceeding flag until another flag is encountered. Some flags are optional, and some are mandatory. The order flags appear is not important.

So far, I have 3 flags implemented: --i [INPUT_FILE], the file to be converted, --o [OUTPUT_FILE], the location to save the resulting HTML, and --v, which toggles on verbose mode. Other flags may be added in as development progresses.

Data ArgV

Next, I’ll create a type for my ArgV.

data ArgV = ArgV {inputFile :: FilePath, outputFile :: FilePath, isVerbose :: Bool} deriving (Show)

As you can see, my type has three fields, which line up with my requirements above.

Introducing Parsec

Parsec is a fairly straight-forward library to use. It has a lot of operators and functions, whcih require some thought. However they all amount to basic functions that do a specific parsing task. Your job is to tell these functions what to do. So let’s get right down to it.

parseArgV :: [String] -> Either ParseError ArgV parseArgV i = parse pArgV "" $ foldArgV i foldArgV :: [String] -> String foldArgV i = foldl' (++) [] i

Here we have two functions to get us started. The function foldArgV collapses the list of strings into a single string, to be parsed. Because of the way that arguments are parsed by the operating system, the argument string --i foo --o bar --v will be collapsed to --ifoo--obar--v. This is good for us because this means we don’t have to worry about parsing white space.

The second function, parseArgV is our entry point into Parsec. All of Parsec’s combinator functions return a Parser monad. The function parse parses the data in the parser monad and returns either an error, or a whatever. In our case, it’s parsing to an ArgV.

Parse takes 3 arguments: a parser, a “user state”, and a string to parse. The first and third are pretty self explanatory, but the second is a mystery. Unfortunately I don’t have an answer for you as to it’s nature. I spent a good hour or two researching it yesterday, and the best I could come up with is “don’t worry about it”. None of the tutorials use it, and the documentation basically just says “this is state. Feel free to pass an empty string, it’ll be fine”. I’ve decided I won’t worry my little head about it for the time being.

pArgV :: CharParser st ArgV pArgV = do permute $ ArgV <$$> pInputFile <||> pOutputFile <|?> (False, pIsVerbose)

This function is the true heart of my parser. I’d like to draw your attention to the invocation of permute. This function implements the requirement that the order of arguments shouldn’t matter. Ordinarily, parsec will linearly go through some input and test it. But what do you do if you need to parse something that doesn’t have a set order? The library Text.Parsec.Perm solves this problem for us.

The function permute has a difficult type signature to follow, so a plain-english explanation is in order. Permute is used in a manner that resembles an applicative functor. Permute takes a function, and then using it’s operators, it takes functions that return a parser for the type of each field in the function in a pseudo-applicative style. These functions don’t need to come in the order that they’re parsed in, but in the order that the initial function expects them to be. Permute will magically parse them in the correct order and return a parser for your type. Let’s break this down line-by-line:

... permute $ ArgV <$$> pInputFile ...

Here we call permute on the constructor for ArgV. We feed it the first function, pInputFile using the <$$> operator. This operator is superficially, and logically similar to the applicative functor’s <$> operator. It creates a new permutation parser for the type on the left using the function on the right.

... <||> pOutputFile ...

Here we feed the second parser function, pOutputFile to the permutation parser using the <||> operator. This operator is superficially, and logically similar to applicative’s <*> operator.

... <|?> (False, pIsVerbose) ...

Finally, we feed it a parser for isVerbose. The operator <|?> works the same as the <||> operator, except that this operator is allowed to fail. If <||> fails, the whole parser fails. If <|?> fails, then the default value (False, in this case) is used. This allows us to have optional parameters.

Moving along from here, here are some extra parsing functions we need.

pInputFile :: CharParser st FilePath pInputFile = do try $ string "--i" manyTill anyChar pEndOfArg

This parser parses the --i parameter. The first line of the do expression parses the input to see if it is --i. Because it is wrapped in a call to try, it will only consume input if it succeeds. Since we don’t actually need this text, we don’t bind it to a name. If the parser had failed, it would exit the function and not evaluated the second line. On the second line, we take all the text until pEndOfArg succeeds. This text is what is returned from the function.

pEndOfArg :: CharParser st () pEndOfArg = nextArg <|> eof where nextArg = do lookAhead $ try $ string "--" return ()

This function detects if the parser is at the end of an argument. Notice the <|> operator. This is the choice operator. Basically, if the parser on the left fails, it runs the parser on the right. You can think of it like a boolean OR.

The parser eof is provided by parsec, and succeeds if there is no more input. The function lookAhead is the inverse of try. It only consumes input if the parser fails. By wrapping a try inside of a lookAhead, we create a parser that never consumes input.

The parser functions pOutputFile and pIsVerbose are almost identical to pInputFile, so I’m not going to bother typeing them out here.

Additional Reading

That’s about all there is to my ArgV parser. You can find some additional reading on the subject in Real World Haskell, whcih has an entire chapter dedicated to this subject. Also, check out the Parsec haddoc page on hackage.

I Used a Regex, Now I Have Two Problems…

So, I’m 16 posts in on this blog now. Every post, I’m getting fancy with code blocks and such. You know,

like this.

To make one of those, it takes a div, span, code, and pre tag, along with some custom CSS. You know how irritating that gets typing it over and over again? Did you guess “very”? You’d be right.

Luckily, I’m a programmer. I know regular expressions! I decided I’d try to remember how to do Perl. I wrote a script to handle regex find and replacing these tags. It can be found here.

I’ll not go over each line of that script, but here are some of the highlights:

On line 19:

if ($current =~ /^#/) { #comment found }

Checks to see if a line begins with a #. If it does, that line is a comment.

On line 23:

elsif ($current =~ /=/) { my @working = split(/=/, $current, 2); chomp($working[0]); chomp($working[1]); $return_value{$working[0]} = $working[1]; }

Checks to see if the current line contains an equals sign. If it does, it splits the current line, using the first equals sign as the delimeter. Thereby allowing replacement text to contain an equals sign. The two tokens then have leading and trailing whitespace trimmed before being added to the return_value hash.

On line 48:

for my $pair (keys(%config)) { $current =~ s/$pair/$config{$pair}/; }

Iterates through the token/replacement pair, replaces tokens in the current line.

The script takes a text file as an argument, reads a dmp_helper.rc file located in the same directory as the script, and prints the expanded text to STDOUT.

Given an input file containing:

I'm %j% a %p% %b%, nobody loves me... He's %j% a %p% %b%, %f% a %p% family! spare him his life %f% %t% monstrosity!

…and a dmp_helper.rc containing:

%j%=just %p%=poor %f%=from %t%=this %b%=boy

…this output is produced:

I'm just a poor boy, nobody loves me... He's just a poor boy, from a poor family! Spare him his life from this monstrosity!