Our data-processing platform ingests data from millions of events per hour on behalf of hundreds of our customers.
Making sure we gracefully handle errors is really important for service continuity.
That is why we created our Tao for processing, to help our dev teams make the right choices when dealing with exceptions or validations.
Thou shalt not stop processing for any input
We sometimes get invalid data. It may be too long to handle or contain bad characters, or characters that can't be stored in our system. The invalid data may be put there intentionally by a hacker or as part of the normal flow in our system. Whatever the reason - the show must go on. We never stop processing of any of our clients because we got bad data. The system should be robust and move on ignoring the bad data.
Thou shalt reject bad data
It's better to prevent bad data from propagating through all of the system, requiring every component in it to understand which data is valid and ignore it if it's not. Better to reject bad data in the beginning of the chain (and better yet - at the gateway to the system).
Thou shalt make sure handling bad data has no impact on good data
Sometimes bad data gets through the cracks. You may have missed something and didn't reject it, or it may be costly to change the reject mechanism you already have to reject more stuff. So there comes a time when you just have to make your system resilient to bad data. When you do, make sure that you limit the effect of the fix. For example, you may be having a batch of updates in which 1 row out of the whole batch isn't valid. Better to exclude that row from the batch and try again than lose the whole batch.
P.S. we're hiring