Totango Engineering

Standalone Spark Deployment for Stability and Performance

Main tips for using Spark Standalone See full details in the presentation below To make cores fill up workers before using other free workers, use spark.deploy.spreadOut=false To make worker clean up directories of completed applications, use spark.worker.cleanup.enabled=true To use External Shuffle Service, use

Reading a file with colon (:) from S3 in Spark

At Totango, we've decided to standardize all our batch data processing onto Apache Spark, replacing an old investment we had in a Hadoop cluster that does the same. The goal is to use Spark's flexibility and superior performance to allow us to extract more insights about customers with ease. As

Luigi and Spark in production at Totango

Totango transforms customer usage data into actionable insight and analytics. The pipeline execution is orchestrated by luigi open-source project by spotify, and the data transformation is done in Apache Spark. Our pipeline runner makes sure the right compute jobs are being scheduled and processed so our users get accurate and