奇思妙想的SQL|兼顧性能的數(shù)據(jù)傾斜處理新姿勢

阿里妹導(dǎo)讀
文章導(dǎo)讀
一、場景描述

二、常見的優(yōu)化方法
2.1.Mapjoin
SELECT /*+MAPJOIN(dim)*/ *FROM (SELECT * FROM dwd_tbl) baseLEFT OUTER JOIN (SELECT * FROM dim_tbl) dimON base.dim_key = dim.dim_key
2.2.特殊值/空值打散
-
特殊值/空值場景也比較普遍,比如主表中有個屬性字段在某些場景下為空或為一些無業(yè)務(wù)含義的特殊字符串(如DEFAULT),然后此屬性字段本身對應(yīng)了一張數(shù)據(jù)量較大的維表,需要關(guān)聯(lián)打?qū)捬a全。此時做數(shù)據(jù)關(guān)聯(lián),由于兩張表需要按照關(guān)聯(lián)key進行shuffle,就會導(dǎo)致主表中該字段為空/相同特殊字符串的數(shù)據(jù)記錄shuffle到同一節(jié)點上,從而導(dǎo)致數(shù)據(jù)傾斜。 -
此類場景也好解決,對特殊值/空值在關(guān)聯(lián)時轉(zhuǎn)為隨機值就行。
SELECT *FROM (SELECT * FROM dwd_tbl) baseLEFT OUTER JOIN (SELECT * FROM dim_tbl) dimON IF(COALESCE(base.dim_key,'')='',CONCAT('HIVE_',RAND()),base.dim_key) = dim.dim_key
2.3.熱點值打散,副表呈倍數(shù)擴散
-
此類方法使用較少,核心在于對于主表附加一個隨機值(比如1~10)字段,記為ext_a字段,然后對應(yīng)被關(guān)聯(lián)維表數(shù)據(jù)按照對應(yīng)倍數(shù)進行復(fù)制膨脹,并依次賦予1~10的編號,記為ext_b字段,然后在關(guān)聯(lián)兩張表時把ext_a、ext_b兩個字段也作為關(guān)聯(lián)字段之一。 -
此方法適用于被關(guān)聯(lián)表遠比主表小,但又因數(shù)據(jù)大小超過內(nèi)存容量而無法使用Mapjoin,且主表的數(shù)據(jù)傾斜程度不大(即極值對應(yīng)的數(shù)據(jù)行數(shù)相較于值平均對應(yīng)行數(shù),倍數(shù)差距不太大)的情況下可以使用,但整體上此方案只能對數(shù)據(jù)熱點成倍數(shù)的削弱一些。
SELECT *FROM (SELECT *,CAST(RAND()*10 AS BIGINT) AS ext_aFROM dwd_tbl) baseLEFT OUTER JOIN (SELECT *FROM dim_tblLATERAL VIEW EXPLODE(SPLIT('0;1;2;3;4;5;6;7;8;9',';')) tt AS ext_b-- 或者Join一個用于倍數(shù)膨脹的小表) dimON base.dim_key = dim.dim_keyAND base.ext_a = dim.ext_b
2.4.熱點數(shù)據(jù)單獨處理/SkewJoin
使用此方法通常也意味著被關(guān)聯(lián)的維表數(shù)據(jù)大小較大,無法使用Mapjoin,只能走普通shuffle模式的join方案。此類場景最典型的案例就是雙十一淘系交易大表關(guān)聯(lián)商家維表,此時的商家維表因記錄數(shù)和數(shù)據(jù)大小都較大而無法放入內(nèi)存,再加上部分商家的交易單量遠超大盤平均,此時的數(shù)據(jù)傾斜就得使用熱點數(shù)據(jù)單獨處理的方案了。 熱點數(shù)據(jù)單獨處理的方案的核心點在于將熱點數(shù)據(jù)提取出來單獨處理,熱點數(shù)據(jù)可以用Mapjoin的方式完成關(guān)聯(lián)維表熱點記錄行,非熱點則使用普通的shuffle模式的join方案完成關(guān)聯(lián)。-
具體操作主要分三個部分:基于主表統(tǒng)計獲得Top熱點的屬性值;用熱點屬性值將被關(guān)聯(lián)維表拆成熱點小表和非熱點表,同時也將主表拆成熱點主表和非熱點主表;熱點小表通過Mapjoin與熱點主表join,非熱點表與非熱點主表join,最終兩部分再Union到一起,完成數(shù)據(jù)關(guān)聯(lián)。
-- Step01:熱點數(shù)據(jù)記錄提取INSERT OVERWRITE TABLE tmp_hot_list PARTITION (dt = '${bizdate}')SELECT dim_shop_id AS hot_idFROM main_tableWHERE dt = '${bizdate}'GROUP BY dim_shop_idHAVING COUNT(1) > 10000;INSERT OVERWRITE TABLE final_result_table PARTITION (dt = '${bizdate}')-- Step02:熱點數(shù)據(jù)處理,使用MapJoin完成處理SELECT /*+MAPJOIN(a2,a3)*/a1.trade_no AS trade_no,a1.dim_shop_id AS shop_id,a3.shop_name AS shop_name,a3.shop_type AS shop_typeFROM (SELECT * FROM main_table WHERE dt = '${bizdate}') a1-- Step02-1:主表用JOIN關(guān)聯(lián)熱點表進行熱點記錄篩選JOIN (SELECT * FROM tmp_hot_list WHERE dt = '${bizdate}') a2 -- 熱點數(shù)據(jù)清單ON a1.dim_shop_id = a2.dim_shop_id-- Step02-2:熱點維度數(shù)據(jù)處理LEFT OUTER JOIN (SELECT /*+MAPJOIN(b2)*/ b1.*FROM (SELECT * FROM dim_table_info WHERE dt = '${bizdate}') b1JOIN (SELECT * FROM tmp_hot_list WHERE dt = '${bizdate}') b2 -- 熱點數(shù)據(jù)清單ON b1.dim_shop_id = b2.dim_shop_id) a3ON a1.dim_shop_id = a3.dim_shop_idUNION ALL-- Step03:非熱點數(shù)據(jù)處理,使用普通Join完成處理,兩張表均需要進行ShuffleSELECT /*+MAPJOIN(a12)*/a11.trade_no AS trade_no,a11.dim_shop_id AS shop_id,a13.shop_name AS shop_name,a13.shop_type AS shop_typeFROM (SELECT * FROM main_table WHERE dt = '${bizdate}') a11-- Step03-1:主表用ANTI JOIN關(guān)聯(lián)熱點表進行剔除LEFT ANTI JOIN (SELECT * FROM tmp_hot_list WHERE dt = '${bizdate}') a12ON a11.dim_shop_id = a12.dim_shop_id-- Step03-2:非熱點維度數(shù)據(jù)處理LEFT OUTER JOIN (SELECT /*+MAPJOIN(b12)*/ b11.*FROM (SELECT * FROM dim_table_info WHERE dt = '${bizdate}') b11LEFT ANTI JOIN (SELECT * FROM tmp_hot_list WHERE dt = '${bizdate}') b12ON b11.dim_shop_id = b12.dim_shop_id) a13ON a11.dim_shop_id = a13.dim_shop_id;
-
整個步驟稍有些復(fù)雜,這里也可以直接用平臺的skewjoin參數(shù)完成傾斜處理,skew的核心思路就是上面提到的熱點數(shù)據(jù)單獨處理,只是做了平臺級別的集成,方便用戶一鍵解決數(shù)據(jù)傾斜問題。詳細用法和詳細原理可參考 《阿里云-SKEWJOIN HINT》[1] 。
INSERT OVERWRITE TABLE final_result_table PARTITION (dt = '${bizdate}')SELECT /*+SKEWJOIN(a1)*/a1.trade_no AS trade_no,a1.dim_shop_id AS shop_id,a2.shop_name AS shop_name,a2.shop_type AS shop_typeFROM (SELECT * FROM main_table WHERE dt = '${bizdate}') a1LEFT JOIN (SELECT * FROM dim_table_info WHERE dt = '${bizdate}') a2ON a1.dim_shop_id = a2.dim_shop_id;

2.5.方案總結(jié)
-
不難發(fā)現(xiàn),上面幾種方案核心都是在圍繞解決數(shù)據(jù)重分發(fā)(即shuffle)導(dǎo)致的熱點問題,一種是想方設(shè)法采用Mapjoin的方式避免熱點數(shù)據(jù)重分發(fā),一種是讓數(shù)據(jù)重分發(fā)過程中盡可能得均勻。 -
不管是哪種思路,問題核心都還是在解決shuffle導(dǎo)致的數(shù)據(jù)分布不均勻的問題。所以,一切的“罪魁禍首”就是 shuffle 、 shuffle 、 shuffle ~
三、一種新的思路 WithDistmapjoin~
3.1.核心思路
3.2.代碼實現(xiàn)
WITH-- STEP01:熱點Key采集tmp_hot_pid AS (SELECT dim_shop_id,'Y' AS is_hotFROM main_table_detailWHERE dt = '${bizdate}'GROUP BY dim_shop_idHAVING COUNT(1) > 100000)-- STEP02:維表熱點數(shù)據(jù)打標,tmp_dim_tbl AS (SELECT /*+MAPJOIN(hot)*/dim.*,COALESCE(hot.is_hot,'N') AS is_hotFROM (SELECT *FROM dim_table_infoWHERE dt = '${bizdate}') dimLEFT OUTER JOIN tmp_hot_pid hotON dim.dim_shop_id = hot.dim_shop_id)-- STEP03:明細熱點數(shù)據(jù)打標,tmp_dwd_tbl AS (SELECT /*+MAPJOIN(hot)*/base.*,COALESCE(hot.is_hot,'N') AS is_hotFROM (SELECT *FROM main_table_detailWHERE dt = '${bizdate}') baseLEFT OUTER JOIN tmp_hot_pid hotON base.dim_shop_id = hot.dim_shop_id)-- STEP04:數(shù)據(jù)合并處理,熱點數(shù)據(jù)用Mapjoin,非熱點數(shù)據(jù)用DISTMAPJOININSERT OVERWRITE TABLE final_result_table PARTITION (dt = '${bizdate}')SELECT *FROM (-- STEP04-1:非熱點數(shù)據(jù)用DISTMAPJOINSELECT /*+ DISTMAPJOIN(dim(shard_count=77)) */dwd_tbl.trade_no AS trade_no,dwd_tbl.trade_date AS trade_date,dwd_tbl.shop_id AS shop_id,dim.shop_name AS shop_name,dim.shop_type AS shop_typeFROM (SELECT * FROM tmp_dwd_tbl WHERE is_hot = 'N') dwd_tblLEFT OUTER JOIN (SELECT * FROM tmp_dim_tbl WHERE is_hot = 'N') dimON dwd_tbl.partner_id = dim.partner_idUNION ALL-- STEP04-1:熱點數(shù)據(jù)用MapjoinSELECT /*+MAPJOIN(dim)*/dwd_tbl.trade_no AS trade_no,dwd_tbl.trade_date AS trade_date,dwd_tbl.shop_id AS shop_id,dim.shop_name AS shop_name,dim.shop_type AS shop_typeFROM (SELECT *FROM tmp_dwd_tbl WHERE is_hot = 'Y') dwd_tblLEFT OUTER JOIN (SELECT *FROM tmp_dim_tbl WHERE is_hot = 'Y') dimON dwd_tbl.partner_id = dim.partner_id) base;

3.3.真實效果
四、方案總結(jié)

