Liveblogging: Leveraging Hadoop to Augment MySQL Deployments by Sarah Sproehnle, Cloudera, Hadoop Instructor
Hadoop is an "open source framework for storing and processing data on a cluster of computers." Processing is key -- analysis of data.
Built-in HA, scalability. Designed for batch processing, Optimized for streaming reads. Not meant for queries returned in milliseconds, it's meant for queries that run in minutes -- but in those minutes it processes terabytes of data.
Why consider Hadoop? LOTS of data (multi-terabytes) - Facebook has roughly 15 petabytes in their Hadoop cluster. SANs can hold a lot of data, but not process it all.
Traditional ETL separates storage from processing -- it moves data in order to process it, because processing is a bottleneck.
Yahoo Mail does spam detection using Hadoop
Facebook and LinkedIn suggest friends/connections using Hadoop. eHarmony uses Hadoop to help match people.
How is Hadoop different from MySQL? It's not a substitute.
Data capacity: Hadoop petabytes, mysql terabytes
Data per query: Hadoop petabytes, mysql gigabytes
read/write: Hadoop sequential scans, append-only. MySQL random read/write
query language: Hadoop Java MapReduce, scripting langs, HiveQL. MySQL is SQL and UDFs.
(Sarah mentioned that Java MapReduce is not the easiest way to process data, hence getting HiveQL)
Transactions: Hadoop No, MySQL Yes
Indexes: Hadoop No (always scanned in parallel), MySQL Yes
Latency: Hadoop minutes to hours (if that's too long, add nodes), MySQL sub-second (hopefully)
Data structure: Hadoop can be structured or unstructured, MySQL is structured.
How Hadoop works:
spreads data onto tens, hundreds or thousands of machines usin gthe Hadoop Distributed File System (HDFS). You just put a file into Hadoop and it decides where to put it. Built-in redundancy (replication) for fault-tolerance. Replicates data 3x.
MapReduce is based on "map" and "reduce" -- basically divide and conquer -- split tasks and run in parallel. Hadoop is an open-source implementation using Google's whitepaper on map reduce. Google's implementation costs $$, it's not open source.
Problem with MapReduce:
The developer has to worry about job setup, InputFormat and OutputFormat, may have to write custom key/value classes, not just analysis and processing logic.
Data being schema-less makes it hard to process, the developer needs to know what the data looks like.
Often requires several MapReduce passes to accomplish a final result.
Facebook as an example:
TB of data per day, coming from all over - MySQL, Oracle, Apache web logs. Using Hadoop for log processing, text mining, doc indexing, BI/analytics. They wanted a commandline interface, a schema for the data, and an easier dev environment that supported ad hoc queries (not Java classes for everything).
Out of Facebook's desires came Hive, often called "A data warehouse for Hadoop". HiveQL is an SQL-like language for queries, and a metastore which stores schema information (typically on a mysql server). Hive translates the HiveQL to MapReduce code. About a year ago Facebook runs over 7500 Hive queries per day and scans more than 100Tb per day.
Differences between HiveQL and SQL:
SQL-92 standard, maybe
INSERT, UPDATE, DELETE
Subset of SQL-92 plus Hive-specific extensions
No transactions, no indexes
Get started now, it's easy!
1) setup Hadoop (NameNode, DataNodes, JobTracker, TaskTrackers). Or go to Cloudera's website and download a free VM to play with it.
2) create hive tables -- similar to MySQL create table, except uses STRING instead of TEXT.
3) load data in. LOAD DATA INPATH '/path-in-hdfs' INTO TABLE t;
LOAD DATA LOCAL INPATH '/local-path' INTO TABLE t;
Under the hood, each part of a query gets converted to a map part and a reduce part.
WHERE -- as a filter, this becomes a map task. Map scans data
GROUP/ORDER BY -- reduce task
JOIN -- map or reduce depending on optimizer
EXPLAIN shows the MapReduce plans. Very different from MySQL EXPLAIN plan.
UDF's/UDAF's (written in Java)
support for sampling -- instead of going through all data, pick out a random, say, 10 Mbs.
JDBC and ODBC interfaces, so that BI vendor support can come out very soon.
Integration with HBase (Hadoop NoSQL solution)
It gets even easier with Sqoop (pronounced "scoop"), which is just SQL-to-Hadoop. This is an open source product from Cloudera. Parallel import of data from many databases to Hadoop, and Parallel export of data from Hadoop to databases.
sqoop import --connect jdbc:mysql://hostname/dbname --table employees --hive-import --fields-terminated by '\t' --lines-terminated-by '\n'
can choose rows with --where, can also choose which columns to import. Can do incremental loads by TIMESTAMP or AUTO_INCREMENT field, etc.