Totango Engineering

Reading a file with colon (:) from S3 in Spark

At Totango, we've decided to standardize all our batch data processing onto Apache Spark, replacing an old investment we had in a Hadoop cluster that does the same. The goal is to use Spark's flexibility and superior performance to allow us to extract more insights about customers with ease.

As with any such migration project, there is legacy to deal with. Sure it's fun to write everything from scratch on a clean sheet, but having a designed working system means that the use-case has already been proven and some initial challenges were already solved.

However, sometimes you need to work with what you have. And sometimes what you have are file names with special characters inside.

And sometimes those special characters are ":", and you need to read them in Apache Spark.

Then you get this nasty exception:

Caused by: java.net.URISyntaxException: Relative path in absolute URI: 15-08-20T00:00:00.csv at java.net.URI.checkPath(URI.java:1804) at java.net.URI.(URI.java:752) at org.apache.hadoop.fs.Path.initialize(Path.java:203)

From searching online, I realized that I'm not the only one with this issue, but it's by design so an official solution is not expected. A colon character is expected not to be a part of the filename.

The problem is that when giving a path for Hadoop to read, it tries to parse it with all the wildcards, by using org.apache.hadoop.fs.FileSystem.globStatus().
You can check out org.apache.hadoop.fs.Globber.glob() to see what it does exactly.

Long story short - it takes your path, splits it into a directory and "filename", and then creates a URI object with the filename.
When the filename contains ":" but not schema (since the schema is part of the part), it confuses the URI creator and it fails.

Usually, however, you don't need Globber at all. If your read only files in a specific path, then you need to list only the files there and not care about parsing wildcards.

To resolve the issue for me, when reading the specific files, I have overridden the filesystem implementation, with a globStatus that uses listStatus inside, and therefore avoids parsing the filenames as paths.
Then instead of accessing the file using s3n:// path, I use custom://, and everything works!

P.S. we're hiring


Romi Kuntsman