One of the challenges we’ve been focusing on lately at Totango is improving testing infrastructure for our data processing pipeline. Our goal is to be able to deploy new code into our core data processing platform with confidence and velocity.
While this sounds like a pretty straightforward need, turns out it is quite difficult to do when talking about data processing code, particularly when it deals with large heterogeneous data-sets. We’ve spoken to colleagues in the industry and discovered this is a challenge many data companies face, so we wanted to share some concepts and lessons learned that have been effective for us.
In this post, we’ll focus on how we do Testing In Production and why it has been a great help for us in increasing system quality.
A bit of background - The Totango data processing pipeline
At its core, Totango processes customer event data and produces a variety of metrics that give our users (customer success staff members) important insights about their accounts. For example, Totango computes a metric called “Health Score” based on an account’s activity-levels, lifecycle stage, billing events and so forth. Our data pipeline is arranged in tasks, each one performing calculations of a set of metrics.
Tasks are implemented in Spark. Dependencies between tasks are modeled in a dependency graph which we execute using Spotify Luigi technology. Interim results produced by the compute tasks are stored on a shared file system in a predefined directory structure and are picked up by “merge tasks” along the pipeline.
(more background on how our data platform works is available here)
Our customers increasingly rely on Totango to automate their customer-success process, so it’s really important metrics are accurately and consistently calculated. One thing we need to make sure of is we don’t introduce new bugs or regressions as we change our code.
How we started...
We started off with a fairly traditional approach. Each new code version went through the following testing phases:
- Development: We build robust sets of unit tests for every task in the computation pipeline and run through it to check for regressions for every new build/version.
- Test/Staging Environment: When a version is deemed ready, we deploy it to our testing environment and run the entire pipeline, including the latest production versions of all other tasks on test data-sets. The test data is a combination of synthetic tests and dataset copied from our production environment
- Production: When the version passes it gets deployed to production.
... And what's lacking
This approach didn’t work well and in reality many of our most significant bugs and regressions were discovered in production. Sometimes, and most unfortunately, by our users.
There were a few significant limitations that we were faced with:
Test data does not reflect real life permutations. Try as we might, we could never create test data-sets that mimic the complexities of real customer data. Copying real data from production to test was both technically impractical and problematic from a security standpoint. More importantly, the results are not easily verifiable because we didn’t know to generate expected results from a stream of copied data.
Because test runs in a different network configuration, it did not help us weed out setup problems that only exist in production. For example, if a new version makes a call to an external database, it may work fine in test but fail in production because of a firewall settings. Ideally, the test network would be identical to production but in reality this was not the case and many hours of team members were needlessly waisted on these issues
The test environment has much less compute resources than production. As a result, we often found ourselves dealing with operational problems in the test environment that do not exist in production. This was not only a massive waste of time, but also masked actual performance issues until we were deep into production rollout.
Introducing Testing in production
Because it was really hard for us to get the test environment to properly mimic the real life production complexities, we added a key new step in our testing flow.
If real data will not come to the test, the test will go to the real data
Once a version passes unit and integration tests on the test environments, it is deployed to the production network and subjected to real, production datasets. However, the new version is executed in Shadow mode.
In Shadow mode, a task does not propagate results down the pipeline, instead they are simply stored in a designated location. A compare task then picks those results, along with the result-set of the live version and compares them for correctness.
The results of the compare go through a few iterations and eventually produce a CSV formatted file listing differences in results between the two versions.
How does the Compare Task compare ?
Obviously, the compare task cannot validate results simply by making sure they are the same. The new version may very well produce different results (otherwise -- why did we build it in the first place?)
So as part of the work on the new version some specific compare-validation code is developed that takes into accounts the nature of the change created.
New version fixed a bug that only affects a certain part of the account-base
New version should only deviate by X% from the older values
If the compare task sees a change in a customer that is not supposed to be affected, or a change beyond a
certain threshold, it treats that as an exception while letting other differences pass.
In the first versions of this approach, our team spent a fair amount of time on developing this compare code. Eventually we streamlined the process and created a simple compare framework with customizable comparators. In this way we only need to handle the compare logic itself without all the operational code around it. In some ways, this is akin to writing unit-tests for new code which makes the process straightforward.
The results from this approach are very effective. It immediately pin-points the corner cases we would never find in a sterile test environment. For example, it allowed us to identify cases where Unicode character support was inconsistent or a situation where the new implementation failed only in certain time-zones. In retrospect these are cases that make a lot of sense, but in all honesty we would never have thought of them ourselves - more likely they would have been production bugs, doomed to be discovered by a customer at some point.
The nice thing about this approach is that we don't have to think about these cases ourselves or create test for them in advance. In fact, testing starts before we write even a single test-case. As we discovered errors, we typically go back and develop new unit-tests for them, so they are found in the future much earlier in the cycle.
The full cycle
The final step we take in rolling out into production is flipping between the versions and running the OLD version in Shadow mode. In this stage the new code has taken control of the pipeline, but the old one is still running in the background. Our implementation uses feature-flags so that we can 'flip the switch' selectively on part of our customers
This allows us to gradually rollout the new version to a subset of customers while still working on validation and tuning for the remainder subset. We can also easily roll-back the version (by flipping the switch back) if we ever need to revert for all customers. We stay in this setup until we remove all risks and issues, typically a couple of days.
Building infrastructure for validating new code on real production data-sets has been a good investment for us at Totango.
It means we no longer worry and fret over production roll outs. We know there will be no surprises. As a result, we can take bigger and more ambitious engineering steps and deliver at higher quality.
We'd love to hear feedback and what approaches others have taken in this area.
Discuss on hacker news.