Zephyr was the genesis of porting a proprietary, real time, single node ETL process to Storm. Through work done on the previous system, the generic steps in a common ETL process was clearly defined;

  1. Initial Data enters system
  2. Pre-process
  3. Parse
  4. Normalize, Validate, and Canonicalize
  5. Enrich
  6. Output
With the exception of output, all of these steps are BigData platform agnostic.

Who can use zephyr?

You essentially need to be a developer to use Zephyr, because it requires knowledge of ETL, Java, MapReduce, and distributed systems.

I am a developer! Why should I use it?

Have you ever had a data set that required tuning before it could be operated on? Zephyr is a tool that distributes your data and transforms it into a format of your choosing. The zephyr-core and zephyr-mapreduce projects do the heavy lifting for you. You only need to write the schema. We already have a few parsers written that might work for you already. If our parsers and pre-processors don't work for you, then you can always write your own. We have interfaces that you can extend to create your own parsers, normalizers, pre-processors, enrichers, and validators.

Design Goals

Zephyr was conceived as a way to formalize these steps in a Java API for use in any of the most prominent (or promising) BigData processing platforms. Steps 2-5 in the above steps could be created irrespective of the platform they would run in for a given feed, with steps 1 and 6 being determined for the chosen processing platform. Above all else, the vast majority of work done to your data should be dealt with through Zephyr's Core library (or your extensions to that). Further, by following this pattern, it is overwhelmingly likely that steps 1 and 6 should be solved for you right out of the box by using the Zephyr MapReduce or Zephyr Storm (or Spark Streaming) implementations.

Spring configuration

Spring is used for the configuration of a Zephyr Schema - the normalization, validation, and canonicalization phase. It's also used as the final glue for a Zephyr MapReduce job - it pulls together steps 1 through 6 listed above and makes your job easily configurable at runtime - provided no new parsers, enrichers, or outputters are required. However, even if these are required, you need only build your own jar and drop it into the Zephyr lib directory and refer to the appropriate classes in the Spring configuration file. To see more, go to the Zephyr Configuration page for more information.


After Zephyr's core library was completed, focus shifted from a rigid, hardcoded Zephyr topology to a flexible, configurable MapReduce job instead. By chaining the Zephyr ETL process in a map task, we're able to achieve the same process as we did with Storm. Business needs shifted, and primary development of Zephyr focused on MapReduce only, with the Storm branch falling by the wayside.

While real time development did not continue at the same pace as batch, the core library continues to support either - or even a standalone process.


Please feel free to contact Dwayne Pryce with any questions you might have about Zephyr.