HiveSQL查詢的18種方式
前言
????????相信大家一定對(duì) Hive 不陌生!Hive 是基于Hadoop 的一個(gè)數(shù)據(jù)倉庫工具,可以將結(jié)構(gòu)化的數(shù)據(jù)文件映射為一張數(shù)據(jù)庫表,并提供類SQL查詢功能(HQL)。Hive的優(yōu)點(diǎn)是學(xué)習(xí)成本低,可以通過類似SQL語句實(shí)現(xiàn)快速M(fèi)apReduce統(tǒng)計(jì),使MapReduce變得更加簡(jiǎn)單,而不必開發(fā)專門的MapReduce應(yīng)用程序。因此,hive十分適合對(duì)數(shù)據(jù)倉庫進(jìn)行統(tǒng)計(jì)分析。
????????今天呢,我們就來探討一下,關(guān)于Hive數(shù)據(jù)查詢的18種方式!
準(zhǔn)備
????????我們本期內(nèi)容大部分HQL操作都需要依賴如下兩張表,具體的數(shù)據(jù)內(nèi)容如下:
course
student
1、SELECT查詢語句
??????? SELECT 查詢語句比較簡(jiǎn)單,后面跟要查詢的字段,如下所示:
hive?(hypers)>?select?name?from?student;
OK
name
Rose
Jack
Jimmy
Tom
Jerry
????????可以為查詢語句中的列和表加上別名,如下所示:
hive?(hypers)>?select?t.name?from?student?t;
OK
t.name
Rose
Jack
Jimmy
Tom
Jerry
????????可以使用如下語句進(jìn)行嵌套查詢:
hive?(hypers)>?select?a.name,?b.coursename
?????????????>?from?(select?stuid,?name?from?student)?a
?????????????>??????????join?(select?stuid,?coursename?from?course)?b?on?a.stuid?=?b.stuid;
OK
a.name??b.coursename
Rose????C語言
Jack????Java
Jimmy???高等數(shù)學(xué)
Tom?????離散數(shù)學(xué)
Jerry???C++
????????可以使用正則表達(dá)式指定查詢的列,如下所示:
hive?(hypers)>?select?t.*?from?student?t;
OK
t.stuid?t.name??t.sex???t.age
15317408????????Rose????1???????21
15317412????????Jack????0???????20
15317432????????Jimmy???1???????21
15317423????????Tom?????1???????20
15317478????????Jerry???0???????19
15317467????????Alice???0???????20
????????可以使用 LIMIT 限制查詢的結(jié)果條數(shù),如下所示:
hive?(hypers)>?select?*?from?student?limit?1;
OK
student.stuid???student.name????student.sex?????student.age
15317408????????Rose????1???????21
????????可以使用ORDER BY語句對(duì)結(jié)果進(jìn)行排序,升序我們可以不在排序的字段后加上ASC(默認(rèn)),但是倒序需要指定DESC,如下所示:
hive?(hypers)>?select?*?from?student?order?by?age?desc;
OK
student.stuid???student.name????student.sex?????student.age
15317432????????Jimmy???1???????21
15317408????????Rose????1???????21
15317467????????Alice???0???????20
15317423????????Tom?????1???????20
15317412????????Jack????0???????20
15317478????????Jerry???0???????19
Time?taken:?10.631?seconds,?Fetched:?5?row(s)
hive?(hypers)>?select?*?from?student?order?by?age;
OK
student.stuid???student.name????student.sex?????student.age
15317478????????Jerry???0???????19
15317467????????Alice???0???????20
15317423????????Tom?????1???????20
15317412????????Jack????0???????20
15317432????????Jimmy???1???????21
15317408????????Rose????1???????21
????????我們還可以使用CASE...WHEN...THEN語句對(duì)某一列的值進(jìn)行處理,如下所示:
hive?(hypers)>?SELECT?stuid,
?????????????>????????name,
?????????????>????????age,
?????????????>????????sex,
?????????????>????????CASE
?????????????>????????????WHEN?sex?=?'1'?THEN?'男'
?????????????>????????????WHEN?sex?=?'0'?THEN?'女'
?????????????>????????????ELSE?'未知'
?????????????>????????????END
?????????????>????????FROM?student;
OK
stuid???name????age?????sex?????_c4
15317408????????Rose????21??????1???????男
15317412????????Jack????20??????0???????女
15317432????????Jimmy???21??????1???????男
15317423????????Tom?????20??????1???????男
15317478????????Jerry???19??????0???????女
15317478????????Alice???20??????0???????女
2、WHERE 條件語句
??????? WHERE 條件語句主要是對(duì)查詢進(jìn)行條件限制,如下所示:
hive?(hypers)>?select?*?from?student?where?age?=?21;
OK
student.stuid???student.name????student.sex?????student.age
15317408????????Rose????1???????21
15317432????????Jimmy???1???????21
????????WHERE 條件語句常用的操作符如該表所示
| 操作符 | 支持的數(shù)據(jù)類型 | 說明 |
|---|---|---|
| A=B | 基本數(shù)據(jù)類型 | 如果A等于B,則返回true,否則返回false |
| A<=>B | 基本數(shù)據(jù)類型 | 如果A和B都為NULL,則返回true,其他情況和 A=B 相同 |
| A<>B,A != B | 基本數(shù)據(jù)類型 | 如果A或者B為NULL,則返回NULL;如果A不等于B返回 true,否則返回 false |
| A | 基本數(shù)據(jù)類型 | 如果A或者B為NULL,則返回NULL;如果A小于B返回 true,否則返回 false |
| A<=B | 基本數(shù)據(jù)類型 | 如果A或者B為NULL,則返回NULL;如果A小于或等于B返回 true,否則返回 false |
| A>B | 基本數(shù)據(jù)類型 | 如果A 或者B為NULL,則返回NULL;如果A大于B返回true,否則返回 false |
| A>=B | 基本數(shù)據(jù)類型 | 如果A 或者B為NULL,則返回NULL;如果A大于或者等于B返回true,否則返回 false |
| A IS NULL | 所有數(shù)據(jù)類型 | 如果A為NULL返回true,否則返回 false |
| A IS NOT NULL | 所有數(shù)據(jù)類型 | 如果A不為NULL返回true,否則返回 false |
| A BETWEEN B AND C | 基本數(shù)據(jù)類型 | 如果A、B、C任一為NULL,則返回NULL;如果A大于或者等于B并且A小于或等于C,則返回true,否則返回false |
| A NOT BETWEEN B AND C | 基本數(shù)據(jù)類型 | 如果A、B、C任一為NULL,則返回NULL;如果A小于B或者A大于C,則返回true,否則返回false |
| A LIKE B | STRING類型 | 如果A模糊匹配B,則返回true,否則返回false |
| A NOT LIKE B | STRING類型 | 如果A不模糊匹配B,則返回true,否則返回false |
| A RLIKE B,A REGEXP B | STRING類型 | B是一個(gè)正則表達(dá)式,如果A匹配正則表達(dá)式,則返回true,否則返回false |
3、GROUP BY 語句
????????GROUP BY語句主要是對(duì)查詢的數(shù)據(jù)進(jìn)行分組,通常會(huì)和聚合函數(shù)一起使用,如下所示:
hive?(hypers)>?select?sex,avg(age)?from?student?group?by?sex;
OK
sex?????_c1
0???????19.666666666666668
1???????20.666666666666668
4、HAVING語句
????????HAVING語句主要用來對(duì)GROUP BY語句的結(jié)果進(jìn)行條件限制,如下所示:
hive?(hypers)>?select?sex,avg(age)?from?student?group?by?sex?having?avg(age)?>?20;
OK
sex?????_c1
1???????20.666666666666668
5、INNER JOIN語句
????????在 INNER JOIN 語句中,只有進(jìn)行連接的兩個(gè)表中都存在與連接條件相匹配的數(shù)據(jù)時(shí)才會(huì)被顯示在結(jié)果數(shù)據(jù)中,如下所示:
hive?(hypers)>?select?t1.name,t2.coursename?from?student?t1?join?course?t2?on?t1.stuid?=?t2.stuid;
OK
t1.name?t2.coursename
Rose????C語言
Jack????Java
Jimmy???高等數(shù)學(xué)
Tom?????離散數(shù)學(xué)
Jerry???C++
6、 LEFT OUTER JOIN語句
????????LEFT OUTER JOIN語句表示左外連接,左外連接查詢數(shù)據(jù)會(huì)包含左表中的全部記錄,而右表中不符合條件的結(jié)果將以NULL的形式出現(xiàn),如下所示:
hive?(hypers)>?select?t1.name,t2.coursename?from?student?t1?left?outer?join?course?t2?on?t1.stuid?=?t2.stuid;
OK
t1.name?t2.coursename
Rose????C語言
Jack????Java
Jimmy???高等數(shù)學(xué)
Tom?????離散數(shù)學(xué)
Jerry???C++
Alice???NULL
7、RIGHT OUTER JOIN語句
????????RIGHT OUTER JOIN表示右外連接,右外連接查詢數(shù)據(jù)會(huì)包含右表中的全部記錄,而左表中不符合條件的結(jié)果將以 NULL 的形式出現(xiàn),如下所示:
hive?(hypers)>?select?t1.name,t2.coursename?from?student?t1?right?outer?join?course?t2?on?t1.stuid?=?t2.stuid;
OK
t1.name?t2.coursename
Rose????C語言
Jack????Java
Jimmy???高等數(shù)學(xué)
Tom?????離散數(shù)學(xué)
Jerry???C++
NULL????大數(shù)據(jù)應(yīng)用開發(fā)
8、FULL OUTER JOIN語句
????????FULL OUTER JOIN語句表示全外連接,結(jié)果數(shù)據(jù)會(huì)包含左表和右表的全部數(shù)據(jù),不符合條件的用NULL表示,如下所示:
hive?(hypers)>?select?t1.name,t2.coursename?from?student?t1?FULL??outer?join?course?t2?on?t1.stuid?=?t2.stuid;
OK
t1.name?t2.coursename
Rose????C語言
Jack????Java
Tom?????離散數(shù)學(xué)
Jimmy???高等數(shù)學(xué)
NULL????大數(shù)據(jù)應(yīng)用開發(fā)
Alice???NULL
Jerry???C++
9、 LEFT SEMI JOIN語句
????????LEFT SEMI JOIN語句表示左半連接,其結(jié)果數(shù)據(jù)對(duì)應(yīng)右表滿足 ON 語句中的條件,如下所示:
hive?(hypers)>?select?t1.name?from?student?t1?LEFT?SEMI?JOIN?course?t2?on?t1.stuid?=?t2.stuid;
OK
t1.name
Rose
Jack
Jimmy
Tom
Jerry
注意:| 在 LEFT SEMI JOIN 語句中,SELECT 和 WHERE 子句中不能引用右表中的字段。|
10、笛卡爾積 JOIN 語句
????????笛卡爾積 JOIN 語句 表示左表的行數(shù)乘以右表的行數(shù)等于結(jié)果集的大小,如下所示:
hive?(hypers)>?select?*?from?student?join?course;
OK
student.stuid???student.name????student.sex?????student.age?????course.stuid????course.coursename?????course.score
15317408????????Rose????1???????21??????15317408????????C語言???50
15317412????????Jack????0???????20??????15317408????????C語言???50
15317432????????Jimmy???1???????21??????15317408????????C語言???50
15317423????????Tom?????1???????20??????15317408????????C語言???50
15317478????????Jerry???0???????19??????15317408????????C語言???50
15317467????????Alice???0???????20??????15317408????????C語言???50
15317408????????Rose????1???????21??????15317412????????Java????60
15317412????????Jack????0???????20??????15317412????????Java????60
15317432????????Jimmy???1???????21??????15317412????????Java????60
15317423????????Tom?????1???????20??????15317412????????Java????60
15317478????????Jerry???0???????19??????15317412????????Java????60
15317467????????Alice???0???????20??????15317412????????Java????60
15317408????????Rose????1???????21??????15317432????????高等數(shù)學(xué)????????70
15317412????????Jack????0???????20??????15317432????????高等數(shù)學(xué)????????70
15317432????????Jimmy???1???????21??????15317432????????高等數(shù)學(xué)????????70
15317423????????Tom?????1???????20??????15317432????????高等數(shù)學(xué)????????70
15317478????????Jerry???0???????19??????15317432????????高等數(shù)學(xué)????????70
15317467????????Alice???0???????20??????15317432????????高等數(shù)學(xué)????????70
15317408????????Rose????1???????21??????15317423????????離散數(shù)學(xué)????????80
15317412????????Jack????0???????20??????15317423????????離散數(shù)學(xué)????????80
15317432????????Jimmy???1???????21??????15317423????????離散數(shù)學(xué)????????80
15317423????????Tom?????1???????20??????15317423????????離散數(shù)學(xué)????????80
15317478????????Jerry???0???????19??????15317423????????離散數(shù)學(xué)????????80
15317467????????Alice???0???????20??????15317423????????離散數(shù)學(xué)????????80
15317408????????Rose????1???????21??????15317478????????C++?????90
15317412????????Jack????0???????20??????15317478????????C++?????90
15317432????????Jimmy???1???????21??????15317478????????C++?????90
15317423????????Tom?????1???????20??????15317478????????C++?????90
15317478????????Jerry???0???????19??????15317478????????C++?????90
15317467????????Alice???0???????20??????15317478????????C++?????90
15317408????????Rose????1???????21??????15317463????????大數(shù)據(jù)應(yīng)用開發(fā)??100
15317412????????Jack????0???????20??????15317463????????大數(shù)據(jù)應(yīng)用開發(fā)??100
15317432????????Jimmy???1???????21??????15317463????????大數(shù)據(jù)應(yīng)用開發(fā)??100
15317423????????Tom?????1???????20??????15317463????????大數(shù)據(jù)應(yīng)用開發(fā)??100
15317478????????Jerry???0???????19??????15317463????????大數(shù)據(jù)應(yīng)用開發(fā)??100
15317467????????Alice???0???????20??????15317463????????大數(shù)據(jù)應(yīng)用開發(fā)??100
注意:| 如果將 Hive 的屬性hive.mapred.mode 設(shè)置為 strict,則會(huì)阻止執(zhí)行笛卡爾積查詢。|
11、map-side JOIN語句
??????? map-site JOIN語句會(huì)在Map階段將小表讀到內(nèi)存,直接在 Map 端 進(jìn)行JOIN,這種連接需要在查詢語句中顯式申明,如下所示:
SELECT?/*?+?MapJOIN(t1)?*/?s1.stuid,s2.stuid??from?student?s1?JOIN?student?s2?ON?s1.stuid?=?s2.stuid;
????????可以通過設(shè)置Hive的屬性 hive.auto.convert.join=true自動(dòng)開啟 map-side JOIN;也可以設(shè)置 Hive 的屬性 hive.mapjoin.smalltable.filesize定義表的大小,默認(rèn)為 25 000 000 B。
12、多表JOIN語句
??????? Hive支持多張表進(jìn)行連接,語句如下所示:
hive?(hypers)>?SELECT?*
???????????????FROM?test1?t1
???????????????JOIN?test2?t2?ON?t1.id?=?t2.id
???????????????JOIN?test3?t3?ON?t2.id?=?t3.id
????????每個(gè) JOIN 都會(huì)啟動(dòng)一個(gè) MapReduce 作業(yè)。第一個(gè)MapReduce作業(yè)連接 test1 表和 test2 表,第二個(gè)MapReduce作業(yè)連接第一個(gè)MapReduce作業(yè)的輸出結(jié)果和 test3 表。
13、ORDER BY 和 SORT BY 語句
????????Hive中的 ORDER BY語句和SQL語句一樣,可以實(shí)現(xiàn)對(duì)結(jié)果集的排序,如下所示:
hive?(hypers)>?select?*?from?student?order?by?age?asc,stuId?desc;
OK
student.stuid???student.name????student.sex?????student.age
15317478????????Jerry???0???????19
15317467????????Alice???0???????20
15317423????????Tom?????1???????20
15317412????????Jack????0???????20
15317432????????Jimmy???1???????21
15317408????????Rose????1???????21
Time?taken:?11.929?seconds,?Fetched:?6?row(s)
????????上述語句表示按照age字段升序,stuId字段降序排序。
????????如果Hive表中的數(shù)據(jù)非常多,使用 ORDER BY排序可能會(huì)導(dǎo)致執(zhí)行的時(shí)間過長,此時(shí)可以設(shè)置Hive的屬性 hive.mapred.mode為strict,則排序語句后面必須加上 LIMIT限制查詢的結(jié)果條數(shù),以避免數(shù)據(jù)量太多造成的執(zhí)行時(shí)間過長的問題,如下所示:
hive?(hypers)>?SET?hive.mapred.mode?=?strict;
hive?(hypers)>?select?*?from?student?order?by?age?asc,stuId?desc?limit?100;
OK
student.stuid???student.name????student.sex?????student.age
15317478????????Jerry???0???????19
15317467????????Alice???0???????20
15317423????????Tom?????1???????20
15317412????????Jack????0???????20
15317432????????Jimmy???1???????21
15317408????????Rose????1???????21
Time?taken:?9.378?seconds,?Fetched:?6?row(s)
????????SORT BY語句會(huì)在每個(gè)Reduce中對(duì)數(shù)據(jù)進(jìn)行排序,可以保證每個(gè)Reduce輸出的數(shù)據(jù)是有序的(全局不一定有序),并可以提高全局排序的性能,如下所示:
hive?(hypers)>?select?*?from?student?sort?by?age?asc,stuId?desc?limit?100;
OK
student.stuid???student.name????student.sex?????student.age
15317478????????Jerry???0???????19
15317467????????Alice???0???????20
15317423????????Tom?????1???????20
15317412????????Jack????0???????20
15317432????????Jimmy???1???????21
15317408????????Rose????1???????21
????????上述語句會(huì)在每個(gè)Reduce中對(duì)age字段進(jìn)行升序排序,同時(shí)對(duì)create_time字段進(jìn)行降序排序。如果Reduce個(gè)數(shù)為1,則ORDER BY和SORT BY語句的查詢結(jié)果相同;如果Reduce個(gè)數(shù)大于1,則SORT BY輸出的結(jié)果為局部有序。
14、 DISTRIBUTE BY 和 SORT BY語句
????????DISTRIBUTE語句結(jié)合SORT BY語句可以實(shí)現(xiàn)在第一列數(shù)據(jù)相同時(shí),能夠按照第二列數(shù)據(jù)進(jìn)行排序,如下所示:
hive?(hypers)>?select?*?from?student?distribute?by?sex?sort?by?age,stuId;
OK
student.stuid???student.name????student.sex?????student.age
15317478????????Jerry???0???????19
15317412????????Jack????0???????20
15317423????????Tom?????1???????20
15317467????????Alice???0???????20
15317408????????Rose????1???????21
15317432????????Jimmy???1???????21
????????DISTRIBUTE BY語句能夠保證sex相同的數(shù)據(jù)進(jìn)入同一個(gè) ?Reduce 函數(shù),在 Reduce中再按照 age 和 stuId 排序即可實(shí)現(xiàn)在第一列數(shù)據(jù)相同時(shí),按照第二列數(shù)據(jù)排序。
15、CLUSTER BY語句
????????如果 DISTRIBUTE BY 和 SORT BY 語句中的列完全相同,并且都是按照升序排序,則可以使用CLUSTER BY語句代替DISTRIBUTE BY和SORT BY語句,如下所示:
select?*?from?student?distribute?by?age?sort?by?age;
????????上面的語句等價(jià)于:
hive?(hypers)>?select?*?from?student?cluster?by?age;
OK
student.stuid???student.name????student.sex?????student.age
15317478????????Jerry???0???????19
15317467????????Alice???0???????20
15317423????????Tom?????1???????20
15317412????????Jack????0???????20
15317432????????Jimmy???1???????21
15317408????????Rose????1???????21
16、類型轉(zhuǎn)換
????????類型轉(zhuǎn)換可以使用 cast(value As TYPE)語法,如下所示:
hive?(hypers)>?select?*?from?student?where?cast(stuId?AS?INT)?>=?15317450;
OK
student.stuid???student.name????student.sex?????student.age
15317478????????Jerry???0???????19
15317467????????Alice???0???????20
????????上述語句表示將 stuId 轉(zhuǎn)化為 INT 類型。
17、分桶抽樣
??????? Hive支持分桶抽樣查詢,如下所示:
hive?(hypers)>?SELECT?*?FROM?student?TABLESAMPLE?(BUCKET?2?OUT?OF?6?ON?stuid);
OK
student.stuid???student.name????student.sex?????student.age
15317467????????Alice???0???????20
????????上述語句表示查詢時(shí)分6個(gè)桶,取第2個(gè)桶,分桶的依據(jù)是將id值的哈希值除以桶數(shù)6的余數(shù)。也可以采用隨機(jī)抽樣的方式,如下所示:
hive?(hypers)>?SELECT?*?FROM?student?TABLESAMPLE?(BUCKET?2?OUT?OF?6?ON?RAND());
OK
student.stuid???student.name????student.sex?????student.age
15317478????????Jerry???0???????19
Time?taken:?0.04?seconds,?Fetched:?1?row(s)
????????可以在創(chuàng)建表時(shí)指定分桶,需要提前將 Hive 的 hive.enforce.bucketing屬性設(shè)置為 true。該屬性可以在 hive-site.xml文件中配置,如下所示:
????????hive.enforce.bucketing
????????true
????????也可以在Hive命令行設(shè)置,如下所示:
hive?(default)>?SET?hive.enforce.bucketing?=?true;
????????創(chuàng)建表時(shí)指定分桶,并插入 student 表中的 id列數(shù)據(jù),如下所示:
hive?(hypers)>?CREATE?TABLE?test_bucket(id?INT)?CLUSTERED?BY?(id)?INTO?3?BUCKETS?;
OK
Time?taken:?0.086?seconds
hive?(hypers)>?INSERT?OVERWRITE?TABLE?test_bucket?SELECT?stuid?FROM?student;
OK
stuid
Time?taken:?24.261?seconds
????????上述語句首先創(chuàng)建一個(gè) test_bucket表,并將 ?test_bucket 表劃分為3個(gè)桶,然后將 student 表中的 id 列數(shù)據(jù)插入 test_bucket表中。插入的數(shù)據(jù)會(huì)被保存在3個(gè)文件中,每個(gè)桶一個(gè)文件,保存在 test_bucket表路徑下。
18、 UNION ALL 語句
????????Hive 支持 UNION ALL查詢,其主要用于多表數(shù)據(jù)合并的場(chǎng)景。使用 UNION ALL語句要求各表查詢出的字段類型必須完全匹配,如下所示:
SELECT?t.id,t.name
FROM?(
SELECT?t1.id,t1.name?FROM?test1?t1
UNION?ALL
SELECT?t2.id,t2.name?FROM?test2?t2
UNION?ALL
SELECT?t3.id,t3.name?FROM?test3?t3
)?t
注意:| 在Hive中使用
UNION ALL語句,必須使用嵌套查詢 。|

推薦閱讀
歡迎長按掃碼關(guān)注「數(shù)據(jù)管道」
