Liveblogging: Leveraging Hadoop to Augment MySQL Deployments by Sarah Sproehnle, Cloudera, Hadoop Instructor

Sarah Sproehnle is an excellent speaker, so I was excited to see her speak about Hadoop, which I know very little about...so here's the liveblogging:

Hadoop is an "open source framework for storing and processing data on a cluster of computers."  Processing is key -- analysis of data.

Built-in HA, scalability.  Designed for batch processing, Optimized for streaming reads.  Not meant for queries returned in milliseconds, it's meant for queries that run in minutes -- but in those minutes it processes terabytes of data.

Why consider Hadoop?  LOTS of data (multi-terabytes) - Facebook has roughly 15 petabytes in their Hadoop cluster.  SANs can hold a lot of data, but not process it all.

Traditional ETL separates storage from processing -- it moves data in order to process it, because processing is a bottleneck.

Yahoo Mail does spam detection using Hadoop

Facebook and LinkedIn suggest friends/connections using Hadoop.  eHarmony uses Hadoop to help match people.

How is Hadoop different from MySQL?  It's not a substitute.

Data capacity:  Hadoop petabytes, mysql terabytes

Data per query: Hadoop petabytes, mysql gigabytes

read/write: Hadoop sequential scans, append-only.  MySQL random read/write

query language:  Hadoop Java MapReduce, scripting langs, HiveQL.  MySQL is SQL and UDFs.

(Sarah mentioned that Java MapReduce is not the easiest way to process data, hence getting HiveQL)

Transactions:  Hadoop No, MySQL Yes

Indexes: Hadoop No (always scanned in parallel), MySQL Yes

Latency: Hadoop minutes to hours (if that's too long, add nodes), MySQL sub-second (hopefully)

Data structure: Hadoop can be structured or unstructured, MySQL is structured.

How Hadoop works:

spreads data onto tens, hundreds or thousands of machines usin gthe Hadoop Distributed File System (HDFS).  You just put a file into Hadoop and it decides where to put it.  Built-in redundancy (replication) for fault-tolerance.  Replicates data 3x.  

MapReduce is based on "map" and "reduce" -- basically divide and conquer -- split tasks and run in parallel.  Hadoop is an open-source implementation using Google's whitepaper on map reduce.  Google's implementation costs $$, it's not open source.

Problem with MapReduce:

The developer has to worry about job setup, InputFormat and OutputFormat, may have to write custom key/value classes, not just analysis and processing logic. 

Data being schema-less makes it hard to process, the developer needs to know what the data looks like.

Often requires several MapReduce passes to accomplish a final result.

Facebook as an example:

TB of data per day, coming from all over - MySQL, Oracle, Apache web logs.  Using Hadoop for log processing, text mining, doc indexing, BI/analytics.  They wanted a commandline interface, a schema for the data, and an easier dev environment that supported ad hoc queries (not Java classes for everything).

Out of Facebook's desires came Hive, often called "A data warehouse for Hadoop".  HiveQL is an SQL-like language for queries, and a metastore which stores schema information (typically on a mysql server).  Hive translates the HiveQL to MapReduce code.  About a year ago Facebook runs over 7500 Hive queries per day and scans more than 100Tb per day.

Differences between HiveQL and SQL:

RDBMS:  

SQL-92 standard, maybe

INSERT, UPDATE, DELETE

Transactions, indexes

 

HiveQL:

Subset of SQL-92 plus Hive-specific extensions

DELETE only

No transactions, no indexes

 

Get started now, it's easy!

1) setup Hadoop (NameNode, DataNodes, JobTracker, TaskTrackers).  Or go to Cloudera's website and download a free VM to play with it.

2) create hive tables -- similar to MySQL create table, except uses STRING instead of TEXT.

3) load data in.  LOAD DATA INPATH '/path-in-hdfs' INTO TABLE t;

or

LOAD DATA LOCAL INPATH '/local-path' INTO TABLE t;

Under the hood, each part of a query gets converted to a map part and a reduce part. 

WHERE -- as a filter, this becomes a map task.  Map scans data

GROUP/ORDER BY -- reduce task

JOIN -- map or reduce depending on optimizer

EXPLAIN shows the MapReduce plans.  Very different from MySQL EXPLAIN plan.

Extra features:  

Partitioning (horizontally)

UDF's/UDAF's (written in Java)

support for sampling -- instead of going through all data, pick out a random, say, 10 Mbs.  

JDBC and ODBC interfaces, so that BI vendor support can come out very soon.

Integration with HBase (Hadoop NoSQL solution)

 

It gets even easier with Sqoop (pronounced "scoop"), which is just SQL-to-Hadoop.  This is an open source product from Cloudera.  Parallel import of data from many databases to Hadoop, and Parallel export of data from Hadoop to databases.

example:

sqoop import --connect jdbc:mysql://hostname/dbname --table employees --hive-import --fields-terminated by '\t' --lines-terminated-by '\n'

can choose rows with --where, can also choose which columns to import.  Can do incremental loads by TIMESTAMP or AUTO_INCREMENT field, etc.

Comments

Thanks for the summary.  Very helpful for me to figure out where to start.

Anonymous
Tue, 03/05/2013 - 21:50

Thank you so much for this nice summary of Hadoop. I am just evalueating this stuff for a project and this clearifies some questions I had.Daniel

Fri, 04/29/2011 - 08:39

Reply

The content of this field is kept private and will not be shown publicly.
  • Web page addresses and e-mail addresses turn into links automatically.
  • Allowed HTML tags: <a> <em> <strong> <cite> <code> <ul> <ol> <li> <dl> <dt> <dd>
  • Lines and paragraphs break automatically.

More information about formatting options

By submitting this form, you accept the Mollom privacy policy.
Website by Digital Loom