Tag Archives: Cloudera

Rosetta Stone: MySQL, Pig and Spark (Basics)

In a world where new data processing languages appear every day, it can be helpful to have tutorials explaining language characteristics in detail from the ground up.  This blog post is not such a tutorial.   It also isn’t a tutorial on getting started with MySQL or Hadoop, nor is it a list of best practices for the various languages I’ll reference here – there are bound to be better ways to accomplish certain tasks, and where a choice was required, I’ve emphasized clarity and readability over performance.  Finally, this isn’t meant to be a quickstart for SQL experts to access Hadoop – there are a number of SQL interfaces to Hadoop such as Impala or Hive that make Hadoop incredibly accessible to those with existing SQL skills.

Instead, this post is a pale equivalent of the Rosetta Stone – examples of identical concepts expressed in three different languages:  SQL (for MySQL), Pig and Spark.  These are the exercises I’ve worked through in order to help think in Pig and Spark as fluently as I think in SQL, and I’m recording this experience in a blog post for my own benefit.  I expect to reference it periodically in my own future work in Pig and Spark, and if it benefits anybody else, great. Continue reading Rosetta Stone: MySQL, Pig and Spark (Basics)