Totango Engineering

Standalone Spark Deployment for Stability and Performance

Main tips for using Spark Standalone See full details in the presentation below To make cores fill up workers before using other free workers, use spark.deploy.spreadOut=false To make worker clean up directories of completed applications, use spark.worker.cleanup.enabled=true To use External Shuffle Service, use

Chronicles of a Distributed Data Pipeline (part 2)

Quick recap So... in part 1 we had these daily data pipelines running on a bunch of servers. Jenkins schedules the daily runs and Luigi manages the logical flow of each pipeline. The system works, but it's starting to show strain as our data grows. There's a new requirement to

Chronicles of a Distributed Data Pipeline (part 1)

In the beginning Here at Totango we crunch loads of data. Where it’s possible we try to do this in realtime, however inevitably most of our meaningful analytics are processed in batch pipelines on a daily or hourly basis. When Totango was in its infancy these pipelines were basically

How Totango uses Apache Spark

Earlier this week we presented at the Big Data & Data Science - Israel Meetup. Thanks much for all those that attended, we had around 150 big data engineers join us. It was great meeting so many smart people using (or looking to use) Spark for their data processing projects!

AWS re:Invent 2015 - Summary for the busy software engineer

Totango proudly runs on Amazon Web Services. Our technology stack deeply leverages what AWS has to offer - from basic EC2 instances to services such as RDS, DynamoDB and Kinesis that provide higher level abstractions. While it is not without flaws, AWS is a key ingredient in our ability to