I will be referring back to this paper and encourage my readers to at

*least*skim it as well. It may provide more understanding in how it works.

It describes a simple, yet hard to comprehend functional parser. I have attempted to translate this to C++, but first I wrote a more intuitive, less functional one. Just a simple calculator, somewhat close to what's described in the pdf, and it worked like this:

The first stage took the input string and created a list of token type-lexeme pairs. Given the input

*"1+2*2"*, it would spit back

*{(INT,"1"),(OP,"+"),...}*. The next stage uses two mutually recursive functions to represent two parse states: that of sums and subtractions and that of numbers, multiplications and divisions. Since addition has the lowest precedence in this mini-language, it's safe to parenthesize the rest of the expression, meaning that if it parsed

*"1+"*, it will expect the next number to be a number, and it might want to add one to it, eagerly, but not if fallowed by a

*"2*2"*since multiplication has higher precedence.

These functions convert the vector of tokens to an expression tree that evaluates the arguments.

*"2*2"*would parse to a tree with a multiplier at its root and two numbers as its leaves. So, after the first state (sums) had read

*"1+"*, it would eagerly construct an addition object with the left argument being

*1*, and the right argument being the result of the rest of the parse! So it reads

*"2*2"*and builds a tree that becomes the right-side argument for our

*"1+"*tree.

This solution is theoretically consistent with building a language using flex and bison. The source can be found here (gist). But it isn't functional. That's when I stumbled back onto this research paper and decided to give a real go at it.

#### Functional Parsing.

The parser described in this paper does not act in the same way as my parser does. Rather than lexing, tokenizing, and building an expression tree, it cuts straight to the point. A*Parser<int>*will be a function that accepts a

*string*and produces

*int*s. But, the job may not be done; there may be more to parse. For every

*int*it parses, it will also store the suffix of the parse. So, if

*p*is some parser of type

*Parse<int>*,

*p("1abc")*will return a vector holding one match: the value,

*1*, and the suffix,

*"abc"*.

What does this class look like?

/* A Parser is a function taking a string and returning a vector of matches. */ template< class X > struct Parser { // The value is the target of the parse. (For example "1" may parse to int(1). using value_type = X; // A match consists of a value and the rest of the input to process. using parse_state = std::pair<X,std::string>; // A parse results in a list of matches. using result_type = std::vector< std::pair<X,std::string> >; // A parser is a function that produces matches. using function_type = std::function< result_type( const std::string& ) >; function_type f; Parser( function_type f ) : f(std::move(f)) { } Parser( const Parser<X>& p ) : f(p.f) { } Parser() { } result_type operator () ( const std::string& s ) const { return f( s ); } }; template< class X, class F > Parser<X> parser( F f ) { using G = typename Parser<X>::function_type; return Parser<X>( G(std::move(f)) ); }

I have mentioned in previous articles that one should avoid

*std::function*for efficiency reasons, but it vastly simplifies things here. As you can see,

*Parser*merely wraps itself around

*std::function*. I would encourage the reader to think of it as a type alias--not deserving of being called a new type, but more than a typedef.

This is consistent with how the paper defines this type:

*newtype Parser a = Parser (String -> [(a,String)])*

If nothing can be parsed, and empty list will be returned. If many things are parsed, a list of each match will be returned.

#### The Parser Monad.

As a reminder, the basic monadic operations are these:*a >> b = b -- Do a, then b.*

*a >>= f = b -- For each x from a, construct b with f(x).*

*mreturn x = a -- Construct a monad from a value.*

How does this relate to parsing? A parser is basically just a function, so if

*p*and

*q*are both parsers,

*p >> q*must return a new parser, a new function. The simple explanation (do

*p*, then

*q*) is correct. First,

*p*parses and let's say it returns a match,

*(x,rest)*.

*rest*is sent to

*q*for parsing and the

*x*is thrown away. It may sound odd to just throw away a value, but it will become more obvious soon.

If

*p*had failed to parse a value, then

*q*would not have been run.

The bind operation,

*p >>= f*, extracts the parsed value,

*x*, from

*p*and creates a new parser from

*f(x)*.

*mreturn x*creates a new parser that returns

*x*as its value. It accepts any string, even an empty one. Ideally,

*x*came from the output of another parser.

*p >> q -- Run parser p, then q.*

*p >>= f -- Construct a parser with p's matches.*

*mreturn x -- Construct a parser that returns x.*

We can define it like so:

template< class X > struct Monad< Parser<X> > { using Pair = typename Parser<X>::parse_state; using R = typename Parser<X>::result_type; /* * mreturn x = parser(s){ vector( pair(x,s) ) } * Return a parsed value. Forwards rest of input to the next parser. */ template< class M > static M mreturn( X x ) { return parser<typename M::value_type> ( [x]( const std::string& s ) { return R{ Pair(std::move(x),s) }; } ); } /* a >> b = b */ template< class Y, class Z > template< class Y, class Z > static Parser<Y> mdo( Parser<Z> a, Parser<Y> b ) { return a >>= [b]( const Z& z ) { return b; }; } /* Continue parsing from p into f. */ template< class F, class Y = typename std::result_of<F(X)>::type > static Y mbind( F f, Parser<X> p ) { using Z = typename Y::value_type; return parser<Z> ( [f,p]( const std::string& s ) { // First, extract p's matches. return p(s) >>= [&]( Pair p ) { // Then construct the new parser from the p's output. // Continue parsing with the remaining input with the new parser. return f( std::move(p.first) )( std::move(p.second) ); }; } ); } };

Do not worry if this source is difficult to understand. It is more important to understand how it is used (which is perhaps common with monads). Note that

*mdo*is defined such that for every successful parse of

*a*,

*b*is parsed. If

*a*fails to parse anything,

*b*fails, too.

These monadic operations are the core building blocks from which we can build more complex system, but the paper also discusses

*MonadZero*and

*MonadPlus*. They are both type classes, like

*Monad*, but extend it to do a few interesting things. In C++, one can concatenate two string by using simple addition:

*s1 + s2 = s12*.

*MonadPlus*is generalization of this.

*MonadZero*completes this generalization by supplying the additive identity. For example, we know that

*zero + x = x*. Thus,

*"" + s = s*.

In parsing terms,

*zero*would refer to a parser that matches nothing and adding two parsers,

*p+q*, will produce a third parser that accepts either

*p*'s or

*q*'s. For example,

*itentifier+number*would create a function that parses either identifiers or numbers.

We can define

*MonadPlus*and

*MonadZero*in the same way we would define

*Monad*.

template< class ... > struct MonadZero; template< class ... > struct MonadPlus; template< class M, class Mo = MonadZero< Cat<M> > > auto mzero() -> decltype( Mo::template mzero<M>() ) { return Mo::template mzero<M>(); } template< class A, class B, class Mo = MonadPlus<Cat<A>> > auto mplus( A&& a, B&& b ) -> decltype( Mo::mplus(std::declval<A>(),std::declval<B>()) ) { return Mo::mplus( std::forward<A>(a), std::forward<B>(b) ); } template< class X, class Y > auto operator + ( X&& x, Y&& y ) -> decltype( mplus(std::declval<X>(),std::declval<Y>()) ) { return mplus( std::forward<X>(x), std::forward<Y>(y) ); }

First, we define these for sequences.

template<> struct MonadZero< sequence_tag > { /* An empty sequence. */ template< class S > S mzero() { return S{}; } }; /* mplus( xs, ys ) = "append xs with ys" */ template<> struct MonadPlus< sequence_tag > { template< class A, class B > static A mplus( A a, const B& b ) { std::copy( b.begin(), b.end(), std::back_inserter(a) ); return a; } };

And then for Parsers.

/* mzero: a parser that matches nothing, no matter the input. */ template< class X > struct MonadZero< Parser<X> > { template< class P > static P mzero() { return parser<X> ( []( const std::string& s ){ return std::vector<std::pair<X,std::string>>(); } ); } }; /* mplus( pa, pb ) = "append the results of pa with the results of pb" */ template< class X > struct MonadPlus< Parser<X> > { using P = Parser<X>; static P mplus( P a, P b ) { return parser<X> ( [=]( const std::string& s ) { return a(s) + b(s); } ); } };

Since we usually only want the first successful parse, the paper define an operator,

*+++*, that does this.

template< class X > Parser<X> mplus_first( const Parser<X>& a, const Parser<X>& b ) { return parser<X> ( [=]( const std::string s ) { using V = std::vector< std::pair< X, std::string > >; V c = (a + b)( s ); return c.size() ? V{ c[0] } : V{}; } ); }

This completes the required building blocks. A parser of significant complexity could be made using only the above functions and types. The paper describes building a notably simple parser, so let's do that instead!

#### Basic Monadic Parsers.

The simplest parser is*item*, which accepts any char.

std::string tail( const std::string& s ) { return std::string( std::next(s.begin()), s.end() ); } /* * Unconditionally match a char if the string is not empty. * Ex: item("abc") = {('a',"bc")} */ auto item = parser<char> ( []( const std::string& s ) { using Pair = Parser<char>::parse_state; using R = Parser<char>::result_type; return s.size() ? R{ Pair( s[0], tail(s) ) } : R(); } );

To demonstrate its usage, the paper defines a simple parser that takes a string of length three or more and returns the first and third values.

auto p = item >>= []( char c ) { return item >> item >>= [c]( char d ) { return mreturn<Parser>( std::make_pair(c,d) ); }; };

*p*first runs

*item*to extract

*c*, then runs

*item*again but throws away the value. It runs

*item*a third time to extract

*d*and finally returns as its value

*(c,d)*.

*p("abcd")*would return

*{(('a','c'),"d")}*.

The next function creates a parser that is a little more helpful:

/* sat( pred ) = "item, if pred" */ template< class P > Parser<char> sat( P p ) { return item >>= [p]( char c ) { return p(c) ? mreturn<Parser>(c) : mzero<Parser<char>>(); }; }

Given some function that operates on chars, this extracts an

*item*, but then checks the condition without consuming any additional input. If

*p(c)*returns true, then it returns a parser with the value

*c*, otherwise

*zero*, a failed parse. Using this, we can define a parser to accept only a specific char.

Parser<char> pchar( char c ) { return sat( [=](char d){ return c == d; } ); }

And then, a parser that accepts only a specific string.

Parser<std::string> string( const std::string& s ) { if( s.size() == 0 ) return mreturn<Parser>( s ); Parser<char> p = pchar( s[0] ); for( auto it=std::next(s.begin()); it != s.end(); it++ ) p = p >> pchar( *it ); return p >> mreturn<Parser>(s); }

*Note: There is no name conflict with std::string because the source does not contain "using namespace std;".*

This function does something very odd. For every

*char*in

*s*, it chains together char parsers. For example,

*string("abc")*would return a parser equivalent to

*pchar('a') >> pchar('b') >> pchar('c')*>>

*mreturn<Parser>("abc")*. If any of the char parsers fail down the line,

*mreturn<Parser>(s)*will fail. Since we already know the values of the successful parses, their values are thrown away.

Though faithful to the paper, this may be less efficient than desirable. One could implement

*string*in this way, too:

Parser<std::string> string( const std::string& str ) { return parser<std::string> ( [str]( const std::string& s ) { using R = typename Parser<std::string>::result_type; if( std::equal(str.begin(),str.end(),s.begin()) ) { return R { { str, s.substr( str.size() ) } }; } else { return R(); } } ); }

It can at times be simpler to write out these functions instead of composing them, however, that can be thought of as an optimization.

The next function creates a new parser from a parser,

*p*, that accepts one or zero

*p*'s.

template< class X > Parser<std::vector<X>> some( Parser<X> p ) { using V = std::vector<X>; using Pair = std::pair<V,std::string>; using R = std::vector< Pair >; using P = Parser<V>; return mplus_first( parser<V> ( [=]( const std::string& s ) { auto matches = p(s); return matches.size() == 0 ? R{} : R{ Pair( V{std::move(matches[0].first)}, std::move( matches[0].second ) ) }; } ), mreturn<Parser>( V{} ) ); }

It was unfortunately difficult to translate, but does work. (The gist also contains a version called

*many*, which accepts zero or many

*p*'s.)

*some*converts a parser of type

*X*to one that produces a

*vector*or

*X'*s. It always succeeds, even if it does not successfully parse anything.

To create a parser that consumes whitespace is now trivial.

template< class X > using ManyParser = Parser< std::vector<X> >; ManyParser<char> space = some( sat([](char c){return std::isspace(c);}) );

We require one more function to parse alternating sequences. Like what? A sum, like "1+2+3" is a sort of alternating sequence of numbers and +'s. A product is an alternating sequence of numbers and *'s.

Here's the weird part: what does a parser that accepts a "+" return? What about a parser that accepts a "-"? The value of such a parser, as it turns out, is the binary function that it represents! In the case of the implementation below, this is a function pointer of type

*int(*)(int,int)*.

/* * chain(p,op): Parse with p infinitely (until there is no match) folding with op. * * p and op are both parsers, but op returns a binary function, given some * input, and p returns the inputs to that function. For example, if: * input: "4" * p returns: 4 * No operator is read, no operation is performed. But: * input: "4+4" * p returns: 4 * op is then parsed with the function, rest: * input: "+4" * op returns: do_add * input: "4" * p returns: 4 * rest returns: 8 * rest applies the operation parsed by op. It alternates between parsing p and * op until there are no more matches. */ constexpr struct Chainl1 { template< class X, class F > static Parser<X> rest( const Parser<X>& p, const Parser<F>& op, const X& a ) { // Alternate between op and p until input is consumed or a parse fails. auto r = op >>= [=]( const F& f ) { return p >>= [&]( const X& b ) { return rest( p, op, f(a,b) ); }; }; // Return the first successful parse, or a if none. return mplus_first( r, mreturn<Parser>(a) ); } template< class X, class F > Parser<X> operator () ( Parser<X> p, Parser<F> op ) const { return p >>= closet( rest<X,F>, std::move(p), std::move(op) ); } } chainl1{};

#### Adding the final touches.

The paper describes a few generally useful functions for constructing parsers based off the above function.*space*(defined above) consumes whitespace.

*token*consumes any trailing whitespace after parsing

*p*.

constexpr struct Token { template< class X > Parser<X> operator () ( Parser<X> p ) const { return p >>= []( X x ) { return space >> mreturn<Parser>(std::move(x)); }; } } token{};

*symb*converts a string to a token.

auto symb = compose( token, string ); // symb(s) = token( string(s) )

*apply*consumes any leading whitespace.

constexpr struct Apply { template< class X > Parser<X> operator () ( Parser<X> p ) const { return space >> std::move(p); } } apply{};

The big idea is that we never want to manually write a function that returns a list of successful parses. It's hard! It's much easier to compose such functions from smaller, more comprehensible ones and use those to build reasonable complex, but more simple to reason about, parsers.

#### The parser itself.

Now, using all of the tools provided, we can create a parser much more trivially than otherwise--though that is perhaps true of any time one has a new set of tools. First, we define a parser that accepts digits.constexpr bool is_num( char c ) { return c >= '0' and c <= '9'; } /* Parse one digit. */ Parser<int> digit = token( sat(is_num) ) >>= []( char i ) { return mreturn<Parser>(i-'0'); };

Now,

*digit("2")*returns

*2*, but

*(digit >> digit)("22")*returns

*2*as well! Why? Because the first run of

*digit*extracts the first

*2*, but throws that value away and the second run of

*digit*extracts the second

*2*. To parse a two-digit number, we need something like this:

Parser<int> twoDigit = digit >>= []( int x ) { return digit >>= [x]( int y ) { return mreturn<Parser>( x*10 + y ); }; };

It extracts the first then the second digit and returns the original number, converted from a string to an int! To parse arbitrarily long numbers (diverging from the paper's version), we can define a chain operation!

int consDigit( int accum, int digit ) { return accum*10 + digit; } Parser<int> num = chainl1( digit, mreturn<Parser>(consDigit) );

For every two digits parse,

*num*calls

*consDigit*to fold the values together. As mentioned earlier,

*chainl1*works by alternating between its two parsers. Since the second argument is a parser which consumes no input,

*num*only accepts

*digit*s.

Next, we can define the parsers for binary operations.

// Binary operations of type int(*)(int,int). int do_add( int x, int y ) { return x + y; } int do_sub( int x, int y ) { return x - y; } int do_mult( int x, int y ) { return x * y; } int do_div( int x, int y ) { return x / y; } auto addop = mplus ( pchar('+') >> mreturn<Parser>(do_add), pchar('-') >> mreturn<Parser>(do_sub) ); auto mulop = mplus ( pchar('*') >> mreturn<Parser>(do_mult), pchar('/') >> mreturn<Parser>(do_div) );

*addop*parses either a "+" or "-" and returns either

*do_add*or

*do_sub*, respectively. Because the parsers must return functions of the same types,

*std::plus*and

*std::minus*could not be used.

With this, we can define a

*term*as an alternating sequence of numbers and multiplications and divisions; and an

*expr*(ession) as an alternating sequence of

*term*s, +'s and -'s.

/* * Parse terms: series of numbers, multiplications and divisions. * Ex: "1*3*2" -> (3,"*2") -> 6 */ Parser<int> term = chainl1( num, mulop ); /* * Parse expressions: series of terms, additions and subtractions. * Ex: "1+7*9-1" -> (1."+7*9-1") -> (63,"-1") -> 62 */ Parser<int> expr = chainl1( term, addop );

And we're done! We have just built a calculator! It can evaluate any expression of additions, subtractions, multiplications, divisions, and it is whitespace agnostic. While implementing

*Parser*itself took a considerable amount of work, using it does not.

int main() { using std::cin; using std::cout; using std::endl; cout << "Welcome to the calculator!\n" << "Press ctrl+d or ctrl+c to exit.\n" << "Type in an equation and press enter to solve it!\n" << endl; std::string input; while( cout << "Solve : " and std::getline(std::cin,input) ) { auto ans = apply(expr)(input); if( ans.size() and ans[0].second.size() == 0 ) cout << " = " << ans[0].first; else cout << "No answer."; cout << endl; } }

I highly encourage anyone reading this to attempt to compile and modify the source code.

**See the gist at github for the source in full:**https://gist.github.com/4112114

**And for the original parser I wrote:**https://gist.github.com/4112114#file_trivial_parser.cpp

This comment has been removed by the author.

ReplyDeleteYou said you "have mentioned in previous articles that one should avoid std::function for efficiency reasons", but I can't find that anywhere (on this site or using Google). Can you give a reference? I'd be interested in reading what you wrote.

ReplyDeleteLooking back, I guess I never fully explained this position though it comes up in the Monadic IO article, where I implement it both with and without std::function. Basically, since std::function is a type eraser; the compiler loses all the useful information for optimization. It may resort to a pointer-to-implementation pattern, meaning heap allocation and indirect calls.

DeleteStill, without it, this would have been extremely challenging to write. (Again, demonstrated in the IO article.)

I see. I'll have to re-read the Monadic IO article again. Thanks.

DeleteCan I put your article on my web resource (with a direct link to your blog)?

ReplyDeleteI would be honoured. All of my work is public domain so feel free to use it in whatever way you feel is best.

DeleteThank you.

Delete