Fake: Generating Realistic Test Data in Haskell
On a number of occasions over the years I've found myself wanting to generate realistic looking values for Haskell data structures. Perhaps I'm writing a UI and want to fill it in with example data during development so I can see how the UI behaves with large lists. In this situation you don't want to generate a bunch of completely random unicode characters. You want things that look plausible so you can see how it will likely look to the user with realistic word wrapping, etc. Later, when you build the backend you actually want to populate the database with this data. Passing around DB dumps to other members of the team so they can test is a pain, so you want this stuff to be auto-generated. This saves time for your QA people because if you didn't have it, they'd have to manually create it. Even later you get to performance testing and you find yourself wanting to generate several orders of magnitude more data so you can load test the database, but you still want to use the same distribution so it continues to look reasonable in the UI and you can test UI performance at even bigger scale.
Almost every time I've been in this situation I thought about using QuickCheck'sFirst, EDIT: I was mistaken with this. QuickCheck gives a default implementation for shrink. Second, using
More recently I realized that the biggest reason QuickCheck wasn't appropriate is because I wanted a different probability distribution than the one that QuickCheck uses. This isn't about subtle differences between, say, a normal versus an exponential distribution. It's about the bigger picture of what the probability distributions are accomplishing. QuickCheck is significantly about fuzz testing and finding corner cases where your code doesn't behave quite as expected. You want it to generate strings with things like different kinds of quotes to verify that your code escapes things properly, weird unicode characters to check encoding issues, etc. What I wanted was something that could generate random data that looked realistic for whatever kind of realism my domain needed. These two things are complementary. You don't just want one or the other. Sometimes you need both of them at the same time. Since you can only have one instance of the
The
It also gives you a number of "providers" that generate various real-world things in a realistic way. Need to generate plausible user agent strings? We've got you covered. Want to generate US addresses with cities and zip codes that are actually valid for the chosen state? Just import the Fake.Provider.Address.EN_US module. But that's not all. Fake ships with providers that include:
Almost every time I've been in this situation I thought about using QuickCheck's
Arbitrary
type class. But that never seemed quite right to me for a couple reasons. Arbitrary
requires that you specify functions for shrinking a value to simpler values. This was never something I needed for these purposes, so it seemed overkill to have to specify that infrastructure.Arbitrary
meant that I had to depend on QuickCheck. This always seemed too heavy to me because I didn't need any of QuickCheck's property testing infrastructure. I just wanted to generate a few values and be done. For a long time these issues were never enough to overcome the activation energy needed to justify releasing a new package.More recently I realized that the biggest reason QuickCheck wasn't appropriate is because I wanted a different probability distribution than the one that QuickCheck uses. This isn't about subtle differences between, say, a normal versus an exponential distribution. It's about the bigger picture of what the probability distributions are accomplishing. QuickCheck is significantly about fuzz testing and finding corner cases where your code doesn't behave quite as expected. You want it to generate strings with things like different kinds of quotes to verify that your code escapes things properly, weird unicode characters to check encoding issues, etc. What I wanted was something that could generate random data that looked realistic for whatever kind of realism my domain needed. These two things are complementary. You don't just want one or the other. Sometimes you need both of them at the same time. Since you can only have one instance of the
Arbitrary
type class for each data type, riding on top of QuickCheck wouldn't be enough. This needed a separate library. Enter the fake
package.The
fake
package provides a type class called Fake
which is a stripped down version of QuickCheck's Arbitrary
type class intended for generating realistic data. With this we also include a random value generator called FGen
which eliminates confusion with QuickCheck's Gen
and helps to minimize dependencies. The package does not provide predefined Fake
instances for Prelude data types because it's up for your application to define what values are realistic. For example, an Int
representing age probably only needs to generate values in the interval (0,120].It also gives you a number of "providers" that generate various real-world things in a realistic way. Need to generate plausible user agent strings? We've got you covered. Want to generate US addresses with cities and zip codes that are actually valid for the chosen state? Just import the Fake.Provider.Address.EN_US module. But that's not all. Fake ships with providers that include:
- Correctly formatted US social security numbers
- Various English parts of speech
- English names, including gender-appropriate first names
- US phone numbers with valid area codes and prefixes
See the full list here.
I tried to focus on providers that I thought would be broadly useful to a wide audience. If you are interested in a provider for something that isn't there yet, I invite more contributions! Similar packages exist in a number of other languages, some of which are credited in fake's README. If you are planning on writing a new provider for something with complex structure, you might want to look at some of those to see if something already exists that can serve as inspiration.
One area of future exploration where I would love to see activity is something building on top of
One area of future exploration where I would love to see activity is something building on top of
fake
that allows you to generate entire fake databases matching a certain schema and ensuring that foreign keys are handled properly. This problem might be able to make use of fake
's full constructor coverage concept (described in more detail here) to help ensure that all the important combinations of various foreign keys are generated.