Creating synthesized data in the traditional way of known input and known output is a daunting task when the number of possible permutations is overwhelming and the volume of data is measured in millions of new records a day.
We came up with the concept of Consistency over Accuracy and the idea is pretty simple.
Once we realize that our aggregations and calculations are correct by making small and focused data validation tests, the problem of making sure that the numbers are consistent over time and reproducible with every commit becomes easier to wrap our hands around.
While we use smaller data sets that test data accuracy, the amount of aggregations is overwhelming. Creating a dedicated data set that will mimic a very high volume of data flowing from multiple sources 24/7 and to validate each element in our document is an enormous task.
We decided to create a seed data set with ~20 million records, each represents an event being sent to Totango and holds the activity that was made by the user in the tracked application, and push the same record set into our AWS S3 storage every day while changing only the time stamp.
This will basically create a constant flow of the same data every day and by the end of our daily batch calculation a new document containing all the aggregations is born.
Because the data is identical every day the aggregations should have the same values, every day.
So, comparing results of day 1 with day 2 calculation for some aggregation value might yield the following table
The calculated document is quite big, around 500 MB of data and a comparer code was injected as a last step into our Spark code which runs the daily calculation. Using Spark distributed computing capabilities to run the comparer also ensures that it runs very fast. Fast execution means fast feedback which aligns perfectly with our Scrum methodology.
Difficulties along the way
Without a doubt, the most challenging part was to implement the code responsible for generating the initial seed data. The data had to be random as possible while covering all the possibilities and comply to the Data engineering team specifications.