Tuesday, February 17, 2015

Common algorithm patterns.

Of the STL, <algorithm> may be the most well-used non-container library, but often requires a level of verbosity that requires just as much typing as the hand-written loop, making it not always feel so convenient. It benefits code that uses it with increased clarity and mathematical soundness, so reducing the syntactic overhead should be a goal of those using it. Today I will talk about these problems and demonstrate ways of making it more terse and sound.

Usually, I try to put all the code in my articles in one gist, but this time I will base it off my most recent work--a library called FU (Functional Utilities)--for sake of simplicity. Still, it applies as well to fit, FTL, and to some degree, range-v3, if you use any of those.

Projection


We often store large data structures for use in many areas, but a specific algorithm may only need one or two members. Say I have a list of people and I want to sort them in order of their year_of_birth. I'm sure anyone with much experience with C++ has seen code like this:
std::sort(std::begin(people), std::end(people),
          [](const Person& a, const Person& b) {
            return a.year_of_birth < b.year_of_birth;
          });
This pattern occurs so frequently than range-v3 accepts "projection" functions for algorithms such as this.
range::v3::sort(people, std::less<>{}, &Person::year_of_birth);
The projection converts its argument to make it suitable for the comparison function, std::less in this example. Internally, it ends up calculating the same thing as the lambda, extracting year_of_birth from both arguments and calling less on them.

One might ask, "but &Person::year_of_birth is not a function, so how can we pass it as one?" Internally, range-v3 and FU use a more generic concept of "invoking", which does not limit them to regular functions. See n3727, or std::invoke (C++17) for more.

This issue is not limited to <algorithm> or std::sort, it is a generic problem, so I wrote fu::proj for use with the current STL and elsewhere.

std::sort(std::begin(people), std::end(people),
          fu::proj(std::less<>{}, &Person::year_of_birth));
But since requiring a projection over less is so very common,  fu::proj_less can be more convenient.
std::sort(std::begin(people), std::end(people),
          fu::proj_less(&Person::year_of_birth));
Fit also provides a similar utility, by, although the syntax is a bit different.

std::sort(std::begin(people), std::end(people),
          fit::by(std::mem_fn(&Person::year_of_birth), _ < _));
I argue that each of the four versions above would be more clear and less error prone than the first, using a lambda. Rather than specifying exactly how to apply the arguments, we need only specify what to apply.

So why not use std::bind for examples like this?
using namespace std::placeholders;
std::sort(std::begin(people), std::end(people),
          std::bind(std::less<>{},
                    std::bind(&Person::year_of_birth, _1),
                    std::bind(&Person::year_of_birth, _2)));
With examples like that, it's no wonder people prefer lambdas! It actually becomes less readable with the overly-verbose and overly-general std::bind

Projection works well enough for comparisons, but then you have a function like std::accumulate and you only want to project the right-hand argument. Consider calculating the average age of the list of people.
auto acc = [](int accum, const Person& p) { return accum + p.age(); }
int sum = 
  std::accumulate(std::begin(people), std::end(people), 0, acc);
float ave = float(sum) / people.size();
Again, range-v3 allows projection here:
int sum = range::v3::accumulate(people, std::plus<>{}, &Person::age);
float ave = float(sum) / people.size();
One can use fu::rproj (right-projection) with the existing STL.
int sum = 
  std::accumulate(std::begin(people), std::end(people),
                  0, fu::rproj(std::plus<>{}, &Person::age));
float ave = float(sum) / people.size();
While proj and rproj will be sufficient for most <algorithm> functions, we do have one special case: std::transform or std::equal. Here, the right-hand and left-hand arguments of our function may be different types, but we still might want to use something like std::plus or std::equal_to to combine them.

For lack of a better example, consider that my struct, Person, contains a member, year_of_death. I want a sanity check that for any person, year_of_birth < year_of_death.
bool ok = std::equal(std::begin(people), std::end(people),
                     std::begin(people),
                     [](const Person& a, const Person& b) {
                       return a.time_of_birth < b.time_of_death;
                     });
With range-v3:
bool ok = range::v3::equal(people, people, std::less<>{},
                           &Person::year_of_birth, &Person::year_of_death)
For fu, I decided to call this function where the right- and left-hand arguments require different projections join instead of binary_proj or bproj.
auto pred = fu::join(std::less<>{}, &Person::year_of_birth,
                                    &Person::year_of_death);
bool ok = std::equal(std::begin(people), std::end(people),
                     std::begin(people), pred);

Although, a more obvious solution might be to use std::all_of, using fu::split(f, left, right), which produces a function, s, such that s(x) == f(left(x), right(x)).
bool ok = std::all_of(std::begin(people), std::end(people),
                      fu::split(std::less<>{}, &Person::year_of_birth,
                                               &Person::year_of_death));
So, in this instance, the lambda might actually win out on terseness, but projection does one nice thing for us: it asserts mathematical soundness. It might be very tempting to write an equality operator between a Person and an std::string to compare the person's name,  especially for use with <algorithm>s, but it makes no real sense. I also store the person's birthplace as a string, so should person == "place" be valid, too? No, it makes much more sense to define a predicate based on a function with well-known mathematical properties, like std::equal_to, and use projection to obtain the items for which equality has already been defined.

Haskell programmers like to say "if it compiles, it works!" Some of that has to do with the idea of "referential transparency", but another large part of it relates to using mathematically sound operations to assert correctness. If each component, or function, is obviously correct, and the algorithm is made of obviously correct operations, then most likely the algorithm is obviously correct.

Let's use this concept to prove the correctness of our first example, sorting by the projection of std::less on the projection of &Person::year_of_birth. It may seem unnecessary, but the rigors of mathematics require that we can formally prove these things, not just intuit or know them. 
  • We can assume the correctness of std::sort by definition, if the comparison function implements a strict weak order. Strict weak order for any comparison, comp, means that it must be irreflexive (!comp(x,x)), weakly ordered (!comp(a, b) && !comp(b, a) implies a == b), and transitive (comp(a,b) && comp(b,c) => comp(a,c)). 
  • We know by definition that std::less meets this requirement thus proj_less(f) meets this requirement if f(a) < f(b) preserves this property (also by definition). 
  • Finally, we know that f(a) == a.year_of_birth, stored as an integer, and that a set of integers sorted by < are strict-weak ordered. 
Thus, we can call that example "obviously correct." In fact, we can further state that any ordering over proj_less(f) will be correct with very few, if any, exceptions.

Mathematical soundness and obvious correctness make it easy to reason about code, but projection alone does not give us the required tools to accomplish it. We need other fundamental tools, as I will show in the next section.

The building blocks.


Say I want to construct a list of people born in Germany. Like always, we can start with the generic lambda.
std::vector<Person> germans;
std::copy_if(std::begin(people), std::end(people),
             std::back_inserter(germans),
             [](const Person& p) { return p.birth_place == "Germany"; });
Range-v3 does allow a projection for copy_if, but unfortunately, that won't much help us.
range::v3::copy_if(people, std::back_inserter(germans),
                   [](const std::string& place) { return place == "Germany"; },
                   &Person::birth_place);
It would have been less typing without the projection, although we at least know that the lambda and projection both are obviously correct and thus is the expression. To really get the desired level of terseness, we need a partial application of equality.

FU supplies a function, part, so that we could write "auto pred = fu::part(std::equal_to<>{}, "Germany")", and then one could write "pred("Hungary")", which would return false, or "pred("Germany")", true, but it's just not terse. Since one often needs to pass binary operators or partially-applied operators to functions like copy_if, FU also supplies a number of such functions. For this example, we need fu::eq, which acts similarly to std::equal_to except that fu::eq(x) returns the partial application. (But fu::eq(x,y) does compute x == y, like one would expect.) So with that, we can return to our example:
range::v3::copy_if(people, std::back_inserter(germans),
                   fu::eq("Germany"), &Person::birth_place);
Fit's placeholders utilities also recognize the importance of partial applications like this.
range::v3::copy_if(people, std::back_inserter(germans),
                   _ == "Germany", &Person::birth_place);
And with the STL using FU,
std::copy_if(std::begin(people), std::end(people),
             std::back_inserter(germans),
             fu::proj(fu::eq("Germany"), &Person::birth_place));
(Note: fu::ucompose, short for "unary composition", might be more appropriate here, but proj(f,g)(x) is roughly equivalent to ucompose(f,g)(x).)

Let's say I need all the names stored in people in a contiguous string, separated by commas. I could just call std::accumulate, starting with an empty std::string and append each one using rproj(std::plus<>{}, &Person::name), but I want to minimize the number of required allocations so I have to sum up their sizes first. It'll look something like this, with range-v3:
auto name_len = [](const Person& p) { return p.name.size(); };
size_t size = range::v3::accumulate(people, 0, std::plus<>{}, name_len);
                                                                         
std::string names;
names.reserve(size + people.size() * 2);
                                                                         
auto append = [&](const std::string& str) { names.append(str);
                                          names.append(", "); };
range::v3::for_each(people, append, &Person::name);
(For simplicity, I'm ignoring that this algorithm will append an extra ", " at the end.)

Again, we could not represent the mini-functions name_len and append as simple projections and had to resort to writing out the lambdas. However, can we represent them using fu::proj? Well, for name_len, yes.
auto name_len = fu::proj(&std::string::size, &Person::name);
Invoking &Person::name returns an std::string, and &std::string::size gives us the desired result--it's obviously correct. append, however... no, and for two reasons. Firstly, std::string::append is an overloaded function so we can't just take its address; we'd need to wrap it in a lambda, anyway. Secondly, we also need to append ", ", and while one could write a series of generic functional utilities and define append as accordingly, we'd most likely lose our "obvious correctness" with no real gains. For completeness, here is the version for the STL:
auto name_len = fu::proj(&std::string::size, &Person::name);
size_t size = std::accumulate(std::begin(people), std::end(people),
                              0, fu::rproj(std::plus<>{}, name_len));
                                                                    
std::string names;
names.reserve(size + people.size() * 2);
                                                                    
auto append = [&](const std::string& str) { names.append(str);
                                          names.append(", "); };
std::for_each(std::begin(people), std::end(people),
              fu::proj(append, &Person::name));

Conclusions


One might wonder why, in the last example, I prefer to make append operator on just an std::string and insist on projecting over &Person::name. I can trivially prove the correctness of append(str); it appends str followed by ", " to names, and it might even by useful in another part of the program. proj(append, &Person::name) is obviously correct for appending a person's name. Each unit, append and &Person::name operates on one level of abstraction; proj links them together. If the way a Person holds its name changes, I only need to change &Person::name and append remains valid. If I need to change append, the projection still remains valid.

Keeping our functions small and on one layer of abstraction makes the program more robust, aside from maintaining soundness of the algorithm. FU's proj, rproj, part, and split can be used to construct most simple predicates, relational functions, and accumulators, and can also be combined in simple, granular expressions to make more complex ones.

Links

fu::proj_less: proj_less has a very simple definition: proj(std::less<>{}), although it actually uses a helper so that it may be constexpr. Most fu functions are objects that can be implicitly partially applied. In fact, the definition of proj is actually closer to [](auto f, auto pf, auto x, auto y) { return f(pf(x), pf(y)); }, but perfect forwarding. Thus, proj(std::less<>{}, &Person::name) is also a partial application.

fu::rproj(std::plus<>): I considered adding a function to FU called accumulator = rproj(std::plus<>{}), but it didn't seem very intuitive. Please comment below if you have an opinion on that.

range::v3::copy_if: It may be preferable to use range::v3::view::remove_if(rng, std::not_fn(pred)) for the example given. One might want to use range::v3::view::filter, but it has been deprecated.

fit::by: We require std::mem_fn for Fit because it does not use a generalized invoke function.

1 comment:

Please feel free to post any comments, concerns, questions, thoughts and ideas. I am perhaps more interested in what you have to say than you are in what I do.