Fake: Generating Realistic Test Data in Haskell

On a number of occasions over the years I've found myself wanting to generate realistic looking values for Haskell data structures.  Perhaps I'm writing a UI and want to fill it in with example data during development so I can see how the UI behaves with large lists.  In this situation you don't want to generate a bunch of completely random unicode characters.  You want things that look plausible so you can see how it will likely look to the user with realistic word wrapping, etc.  Later, when you build the backend you actually want to populate the database with this data.  Passing around DB dumps to other members of the team so they can test is a pain, so you want this stuff to be auto-generated.  This saves time for your QA people because if you didn't have it, they'd have to manually create it.  Even later you get to performance testing and you find yourself wanting to generate several orders of magnitude more data so you can load test the database, but you still want to use the same distribution so it continues to look reasonable in the UI and you can test UI performance at even bigger scale.

Almost every time I've been in this situation I thought about using QuickCheck's Arbitrary type class.  But that never seemed quite right to me for a couple reasons.  First, Arbitrary requires that you specify functions for shrinking a value to simpler values.  This was never something I needed for these purposes, so it seemed overkill to have to specify that infrastructure.  EDIT: I was mistaken with this.  QuickCheck gives a default implementation for shrink.  Second, using Arbitrary meant that I had to depend on QuickCheck.  This always seemed too heavy to me because I didn't need any of QuickCheck's property testing infrastructure.  I just wanted to generate a few values and be done.  For a long time these issues were never enough to overcome the activation energy needed to justify releasing a new package.

More recently I realized that the biggest reason QuickCheck wasn't appropriate is because I wanted a different probability distribution than the one that QuickCheck uses.  This isn't about subtle differences between, say, a normal versus an exponential distribution.  It's about the bigger picture of what the probability distributions are accomplishing.  QuickCheck is significantly about fuzz testing and finding corner cases where your code doesn't behave quite as expected.  You want it to generate strings with things like different kinds of quotes to verify that your code escapes things properly, weird unicode characters to check encoding issues, etc.  What I wanted was something that could generate random data that looked realistic for whatever kind of realism my domain needed.  These two things are complementary.  You don't just want one or the other.  Sometimes you need both of them at the same time.  Since you can only have one instance of the Arbitrary type class for each data type, riding on top of QuickCheck wouldn't be enough.  This needed a separate library.  Enter the fake package.

The fake package provides a type class called Fake which is a stripped down version of QuickCheck's Arbitrary type class intended for generating realistic data.  With this we also include a random value generator called FGen which eliminates confusion with QuickCheck's Gen and helps to minimize dependencies. The package does not provide predefined Fake instances for Prelude data types because it's up for your application to define what values are realistic.  For example, an Int representing age probably only needs to generate values in the interval (0,120].

It also gives you a number of "providers" that generate various real-world things in a realistic way.  Need to generate plausible user agent strings?  We've got you covered.  Want to generate US addresses with cities and zip codes that are actually valid for the chosen state?  Just import the Fake.Provider.Address.EN_US module.  But that's not all.  Fake ships with providers that include:
See the full list here.

I tried to focus on providers that I thought would be broadly useful to a wide audience.  If you are interested in a provider for something that isn't there yet, I invite more contributions!  Similar packages exist in a number of other languages, some of which are credited in fake's README.  If you are planning on writing a new provider for something with complex structure, you might want to look at some of those to see if something already exists that can serve as inspiration.

One area of future exploration where I would love to see activity is something building on top of fake that allows you to generate entire fake databases matching a certain schema and ensuring that foreign keys are handled properly.  This problem might be able to make use of fake's full constructor coverage concept (described in more detail here) to help ensure that all the important combinations of various foreign keys are generated.

Popular posts from this blog

Efficiently Improving Test Coverage with Algebraic Data Types

Armor Your Data Structures Against Backwards-Incompatible Serializations