從MySQL到Hive,數(shù)據(jù)遷移就這么簡單
譯文使用Sqoop能夠極大簡化MySQL數(shù)據(jù)遷移至Hive之流程,并降低Hadoop處理分析任務(wù)時的難度。
先決條件:安裝并運行有Sqoop與Hive的Hadoop環(huán)境。為了加快處理速度,我們還將使用Cloudera Quickstart VM(要求至少4 GB內(nèi)存),不過大家也可以使用Hortonworks Data Platform(至少要求8 GB內(nèi)存)。由于我的筆記本電腦只有8 GB內(nèi)存,因此我在這里使用CLoudera VM鏡像。
如果大家需要使用Virtualbox運行Cloudera/HDP VM,則可以輕松使用其它多種Hadoop生態(tài)系統(tǒng)預(yù)裝軟件包(包括MySQL、Oozie、Hadoop、Hive、Zookeeper、Storm、Kafka以及Spark等等)。
在MySQL中創(chuàng)建表
在Cloudera VM中,打開命令界面并確保MySQL已經(jīng)安裝完畢。
- shell> mysql --version
 - mysql Ver 14.14 Distrib 5.1.66, for redhat-linux-gnu (x86_64) using readline 5.
 
示例當(dāng)中自然要使用自己的數(shù)據(jù)庫,因此使用以下命令在MySQL中創(chuàng)建一套數(shù)據(jù)庫:
- mysql> create database sqoop;
 
接下來:
- mysql> use sqoop;
 - mysql> create table customer(id varchar(3), name varchar(20), age varchar(3), salary integer(10));
 - Query OK, 0 rows affected (0.09 sec)
 - mysql> desc customer;
 - +--------+-------------+------+-----+---------+-------+
 - | Field | Type | Null | Key | Default | Extra |
 - +--------+-------------+------+-----+---------+-------+
 - | id | varchar(3) | YES | | NULL | |
 - | name | varchar(20) | YES | | NULL | |
 - | age | varchar(3) | YES | | NULL | |
 - | salary | int(10) | YES | | NULL | |
 - +--------+-------------+------+-----+---------+-------+
 
- mysql> select * from customer;
 - +------+--------+------+--------+
 - | id | name | age | salary |
 - +------+--------+------+--------+
 - | 1 | John | 30 | 80000 |
 - | 2 | Kevin | 33 | 84000 |
 - | 3 | Mark | 28 | 90000 |
 - | 4 | Jenna | 34 | 93000 |
 - | 5 | Robert | 32 | 100000 |
 - | 6 | Zoya | 40 | 60000 |
 - | 7 | Sam | 37 | 75000 |
 - | 8 | George | 31 | 67000 |
 - | 9 | Peter | 23 | 70000 |
 - | 19 | Alex | 26 | 74000 |
 - +------+--------+------+-----
 
開始Sqoop之旅
如大家所見,其中customer表中并不包含主鍵。我在該表中并未添加多少記錄。默認(rèn)情況下,Sqoop能夠識別出表中的主鍵列(如果有的話),并將其作為劃分列。該劃分列的低值與高值檢索自該數(shù)據(jù)庫,而映射任務(wù)則指向符合區(qū)間要求的均勻部分。
如果主鍵并未均勻分布在該區(qū)間當(dāng)中,那么任務(wù)將出現(xiàn)不平衡狀況。這時,大家應(yīng)當(dāng)明確選定一個與--split-by參數(shù)不同的列,例如--split-by id。
由于我們希望將此表直接導(dǎo)入至Hive中,因此需要在Sqoop命令中添加–hive-import:
- sqoop import --connect jdbc:mysql://localhost:3306/sqoop
 - --username root
 - -P
 - --split-by id
 - --columns id,name
 - --table customer
 - --target-dir /user/cloudera/ingest/raw/customers
 - --fields-terminated-by ","
 - --hive-import
 - --create-hive-table
 - --hive-table sqoop_workspace.customers
 
下面來看Sqoop命令各選項的具體作用:
connect – 提供jdbc字符串
username – 數(shù)據(jù)庫用戶名
-P – 將在控制臺中詢問密碼。大家也可以使用-passwaord,但并不推薦這種作法,因為其會顯示在任務(wù)執(zhí)行日志中并可能導(dǎo)致問題。解決辦法之一在于將數(shù)據(jù)庫密碼存儲在HDFS中的文件內(nèi),并將其向運行時交付。
- table – 告知計算機(jī)我們希望導(dǎo)入哪個MySQL表。在這里,表名稱為customer。
 - split-by – 指定劃分列。在這里我們指定id列。
 - target-dir – HDFS目標(biāo)目錄。
 - fields-terminated-by – 我已經(jīng)指定了逗號作為分隔值(默認(rèn)情況下,導(dǎo)入HDFS的數(shù)據(jù)以逗號作為分隔值)。
 - hive-import – 將表導(dǎo)入Hive(如果不加設(shè)置,則使用Hive的默認(rèn)分隔符)。
 - create-hive-table – 檢查如果已經(jīng)存在一個Hive表,任務(wù)設(shè)置是否會因此失敗。
 - hive-table – 指定.。本示例中為sqoop_workspace.customers,其中sqoop_workspace為數(shù)據(jù)庫名稱,而customers則為表名稱。
 
如下所示,Sqoop為一項map-reduce任務(wù)。請注意,這里我使用-P作為密碼選項。除了這種方式,我們也可以使用-password實現(xiàn)參數(shù)化,并從文件中讀取密碼內(nèi)容。
- sqoop import --connect jdbc:mysql://localhost:3306/sqoop --username root -P --split-by id --columns id,name --table customer --target-dir /user/cloudera/ingest/raw/customers --fields-terminated-by "," --hive-import --create-hive-table --hive-table sqoop_workspace.customers
 - Warning: /usr/lib/sqoop/../accumulo does not exist! Accumulo imports will fail.
 - Please set $ACCUMULO_HOME to the root of your Accumulo installation.
 - 16/03/01 12:59:44 INFO sqoop.Sqoop: Running Sqoop version: 1.4.6-cdh5.5.0
 - Enter password:
 - 16/03/01 12:59:54 INFO manager.MySQLManager: Preparing to use a MySQL streaming resultset.
 - 16/03/01 12:59:54 INFO tool.CodeGenTool: Beginning code generation
 - 16/03/01 12:59:55 INFO manager.SqlManager: Executing SQL statement: SELECT t.* FROM `customer` AS t LIMIT 1
 - 16/03/01 12:59:56 INFO manager.SqlManager: Executing SQL statement: SELECT t.* FROM `customer` AS t LIMIT 1
 - 16/03/01 12:59:56 INFO orm.CompilationManager: HADOOP_MAPRED_HOME is /usr/lib/hadoop-mapreduce
 - Note: /tmp/sqoop-cloudera/compile/6471c43b5c867834458d3bf5a67eade2/customer.java uses or overrides a deprecated API.
 - Note: Recompile with -Xlint:deprecation for details.
 - 16/03/01 13:00:01 INFO orm.CompilationManager: Writing jar file: /tmp/sqoop-cloudera/compile/6471c43b5c867834458d3bf5a67eade2/customer.jar
 - 16/03/01 13:00:01 WARN manager.MySQLManager: It looks like you are importing from mysql.
 - 16/03/01 13:00:01 WARN manager.MySQLManager: This transfer can be faster! Use the --direct
 - 16/03/01 13:00:01 WARN manager.MySQLManager: option to exercise a MySQL-specific fast path.
 - 16/03/01 13:00:01 INFO manager.MySQLManager: Setting zero DATETIME behavior to convertToNull (mysql)
 - 16/03/01 13:00:01 INFO mapreduce.ImportJobBase: Beginning import of customer
 - 16/03/01 13:00:01 INFO Configuration.deprecation: mapred.job.tracker is deprecated. Instead, use mapreduce.jobtracker.address
 - 16/03/01 13:00:02 INFO Configuration.deprecation: mapred.jar is deprecated. Instead, use mapreduce.job.jar
 - 16/03/01 13:00:04 INFO Configuration.deprecation: mapred.map.tasks is deprecated. Instead, use mapreduce.job.maps
 - 16/03/01 13:00:05 INFO client.RMProxy: Connecting to ResourceManager at /0.0.0.0:8032
 - 16/03/01 13:00:11 INFO db.DBInputFormat: Using read commited transaction isolation
 - 16/03/01 13:00:11 INFO db.DataDrivenDBInputFormat: BoundingValsQuery: SELECT MIN(`id`), MAX(`id`) FROM `customer`
 - 16/03/01 13:00:11 WARN db.TextSplitter: Generating splits for a textual index column.
 - 16/03/01 13:00:11 WARN db.TextSplitter: If your database sorts in a case-insensitive order, this may result in a partial import or duplicate records.
 - 16/03/01 13:00:11 WARN db.TextSplitter: You are strongly encouraged to choose an integral split column.
 - 16/03/01 13:00:11 INFO mapreduce.JobSubmitter: number of splits:4
 - 16/03/01 13:00:12 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1456782715090_0004
 - 16/03/01 13:00:13 INFO impl.YarnClientImpl: Submitted application application_1456782715090_0004
 - 16/03/01 13:00:13 INFO mapreduce.Job: The url to track the job: http://quickstart.cloudera:8088/proxy/application_1456782715090_0004/
 - 16/03/01 13:00:13 INFO mapreduce.Job: Running job: job_1456782715090_0004
 - 16/03/01 13:00:47 INFO mapreduce.Job: Job job_1456782715090_0004 running in uber mode : false
 - 16/03/01 13:00:48 INFO mapreduce.Job: map 0% reduce 0%
 - 16/03/01 13:01:43 INFO mapreduce.Job: map 25% reduce 0%
 - 16/03/01 13:01:46 INFO mapreduce.Job: map 50% reduce 0%
 - 16/03/01 13:01:48 INFO mapreduce.Job: map 100% reduce 0%
 - 16/03/01 13:01:48 INFO mapreduce.Job: Job job_1456782715090_0004 completed successfully
 - 16/03/01 13:01:48 INFO mapreduce.Job: Counters: 30
 - File System Counters
 - FILE: Number of bytes read=0
 - FILE: Number of bytes written=548096
 - FILE: Number of read operations=0
 - FILE: Number of large read operations=0
 - FILE: Number of write operations=0
 - HDFS: Number of bytes read=409
 - HDFS: Number of bytes written=77
 - HDFS: Number of read operations=16
 - HDFS: Number of large read operations=0
 - HDFS: Number of write operations=8
 - Job Counters
 - Launched map tasks=4
 - Other local map tasks=5
 - Total time spent by all maps in occupied slots (ms)=216810
 - Total time spent by all reduces in occupied slots (ms)=0
 - Total time spent by all map tasks (ms)=216810
 - Total vcore-seconds taken by all map tasks=216810
 - Total megabyte-seconds taken by all map tasks=222013440
 - Map-Reduce Framework
 - Map input records=10
 - Map output records=10
 - Input split bytes=409
 - Spilled Records=0
 - Failed Shuffles=0
 - Merged Map outputs=0
 - GC time elapsed (ms)=2400
 - CPU time spent (ms)=5200
 - Physical memory (bytes) snapshot=418557952
 - Virtual memory (bytes) snapshot=6027804672
 - Total committed heap usage (bytes)=243007488
 - File Input Format Counters
 - Bytes Read=0
 - File Output Format Counters
 - Bytes Written=77
 - 16/03/01 13:01:48 INFO mapreduce.ImportJobBase: Transferred 77 bytes in 104.1093 seconds (0.7396 bytes/sec)
 - 16/03/01 13:01:48 INFO mapreduce.ImportJobBase: Retrieved 10 records.
 - 16/03/01 13:01:49 INFO manager.SqlManager: Executing SQL statement: SELECT t.* FROM `customer` AS t LIMIT 1
 - 16/03/01 13:01:49 INFO hive.HiveImport: Loading uploaded data into Hive
 - Logging initialized using configuration in jar:file:/usr/jars/hive-common-1.1.0-cdh5.5.0.jar!/hive-log4j.properties
 - OK
 - Time taken: 2.163 seconds
 - Loading data to table sqoop_workspace.customers
 - chgrp: changing ownership of 'hdfs://quickstart.cloudera:8020/user/hive/warehouse/sqoop_workspace.db/customers/part-m-00000': User does not belong to supergroup
 - chgrp: changing ownership of 'hdfs://quickstart.cloudera:8020/user/hive/warehouse/sqoop_workspace.db/customers/part-m-00001': User does not belong to supergroup
 - chgrp: changing ownership of 'hdfs://quickstart.cloudera:8020/user/hive/warehouse/sqoop_workspace.db/customers/part-m-00002': User does not belong to supergroup
 - chgrp: changing ownership of 'hdfs://quickstart.cloudera:8020/user/hive/warehouse/sqoop_workspace.db/customers/part-m-00003': User does not belong to supergroup
 - Table sqoop_workspace.customers stats: [numFiles=4, totalSize=77]
 - OK
 - Time taken: 1.399 seconds
 
***,讓我們驗證Hive中的輸出結(jié)果:
- hive> show databases;
 - OK
 - default
 - sqoop_workspace
 - Time taken: 0.034 seconds, Fetched: 2 row(s)
 - hive> use sqoop_workspace;
 - OK
 - Time taken: 0.063 seconds
 - hive> show tables;
 - OK
 - customers
 - Time taken: 0.036 seconds, Fetched: 1 row(s)
 - hive> show create table customers;
 - OK
 - CREATE TABLE `customers`(
 - `id` string,
 - `name` string)
 - COMMENT 'Imported by sqoop on 2016/03/01 13:01:49'
 - ROW FORMAT DELIMITED
 - FIELDS TERMINATED BY ','
 - LINES TERMINATED BY '\n'
 - STORED AS INPUTFORMAT
 - 'org.apache.hadoop.mapred.TextInputFormat'
 - OUTPUTFORMAT
 - 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
 - LOCATION
 - 'hdfs://quickstart.cloudera:8020/user/hive/warehouse/sqoop_workspace.db/customers'
 - TBLPROPERTIES (
 - 'COLUMN_STATS_ACCURATE'='true',
 - 'numFiles'='4',
 - 'totalSize'='77',
 - 'transient_lastDdlTime'='1456866115')
 - Time taken: 0.26 seconds, Fetched: 18 row(s)
 
hive> select * from customers;
OK
1 John
2 Kevin
19 Alex
3 Mark
4 Jenna
5 Robert
6 Zoya
7 Sam
8 George
9 Peter
Time taken: 1.123 seconds, Fetched: 10 row(s).
到此完成!從MySQL到Hive,數(shù)據(jù)遷移工作就是這么簡單。















 
 
 




 
 
 
 