使用 SQL 窗口函數(shù)進(jìn)行增長(zhǎng)數(shù)據(jù)分析
點(diǎn)擊上方“數(shù)據(jù)管道”,選擇“置頂星標(biāo)”公眾號(hào)
干貨福利,第一時(shí)間送達(dá)

本文會(huì)從一個(gè)商務(wù)分析案例入手,說明SQL窗口函數(shù)的使用方式。通過本文的5個(gè)需求分析,可以看出SQL窗口函數(shù)的功能十分強(qiáng)大,不僅能夠使我們編寫的SQL邏輯更加清晰,而且在某種程度上可以簡(jiǎn)化需求開發(fā)。
數(shù)據(jù)準(zhǔn)備
本文主要分析只涉及一張訂單表orders,操作過程在Hive中完成,具體數(shù)據(jù)如下:
--?建表
CREATE?TABLE?orders(
????order_id?int,
????customer_id?string,
????city?string,
????add_time?string,
????amount?decimal(10,2));
--?準(zhǔn)備數(shù)據(jù)??????????????????????????????
INSERT?INTO?orders?VALUES
(1,"A","上海","2020-01-01?00:00:00.000000",200),
(2,"B","上海","2020-01-05?00:00:00.000000",250),
(3,"C","北京","2020-01-12?00:00:00.000000",200),
(4,"A","上海","2020-02-04?00:00:00.000000",400),
(5,"D","上海","2020-02-05?00:00:00.000000",250),
(5,"D","上海","2020-02-05?12:00:00.000000",300),
(6,"C","北京","2020-02-19?00:00:00.000000",300),
(7,"A","上海","2020-03-01?00:00:00.000000",150),
(8,"E","北京","2020-03-05?00:00:00.000000",500),
(9,"F","上海","2020-03-09?00:00:00.000000",250),
(10,"B","上海","2020-03-21?00:00:00.000000",600);
需求1:收入增長(zhǎng)
在業(yè)務(wù)方面,第m1個(gè)月的收入增長(zhǎng)計(jì)算如下:100 *(m1-m0)/ m0
其中,m1是給定月份的收入,m0是上個(gè)月的收入。因此,從技術(shù)上講,我們需要找到每個(gè)月的收入,然后以某種方式將每個(gè)月的收入與上一個(gè)收入相關(guān)聯(lián),以便進(jìn)行上述計(jì)算。計(jì)算當(dāng)時(shí)如下:
WITH
monthly_revenue?as?(
????SELECT
????trunc(add_time,'MM')?as?month,
????sum(amount)?as?revenue
????FROM?orders
????GROUP?BY?1
)
,prev_month_revenue?as?(
????SELECT?
????month,
????revenue,
????lag(revenue)?over?(order?by?month)?as?prev_month_revenue?--?上一月收入
????FROM?monthly_revenue
)
SELECT?
??month,
??revenue,
??prev_month_revenue,
??round(100.0*(revenue-prev_month_revenue)/prev_month_revenue,1)?as?revenue_growth
FROM?prev_month_revenue
ORDER?BY?1
結(jié)果輸出
| month | revenue | prev_month_revenue | revenue_growth |
|---|---|---|---|
| 2020-01-01 | 650 | NULL | NULL |
| 2020-02-01 | 1250 | 650 | 92.3 |
| 2020-03-01 | 1500 | 1250 | 20 |
我們還可以按照按城市分組進(jìn)行統(tǒng)計(jì),查看某個(gè)城市某個(gè)月份的收入增長(zhǎng)情況
WITH
monthly_revenue?as?(
????SELECT
?????trunc(add_time,'MM')?as?month,
????city,
????sum(amount)?as?revenue
????FROM?orders
????GROUP?BY?1,2
)
,prev_month_revenue?as?(
????SELECT?
????month,
????city,
????revenue,
????lag(revenue)?over?(partition?by?city?order?by?month)?as?prev_month_revenue
????FROM?monthly_revenue
)
SELECT?
month,
city,
revenue,
round(100.0*(revenue-prev_month_revenue)/prev_month_revenue,1)?as?revenue_growth
FROM?prev_month_revenue
ORDER?BY?2,1
結(jié)果輸出
| month | city | revenue | revenue_growth |
|---|---|---|---|
| 2020-01-01 | 上海 | 450 | NULL |
| 2020-02-01 | 上海 | 950 | 111.1 |
| 2020-03-01 | 上海 | 1000 | 5.3 |
| 2020-01-01 | 北京 | 200 | NULL |
| 2020-02-01 | 北京 | 300 | 50 |
| 2020-03-01 | 北京 | 500 | 66.7 |
需求2:累計(jì)求和
累計(jì)匯總,即當(dāng)前元素和所有先前元素的總和,如下面的SQL:
WITH
monthly_revenue?as?(
????SELECT
????trunc(add_time,'MM')?as?month,
????sum(amount)?as?revenue
????FROM?orders
????GROUP?BY?1
)
SELECT?
month,
revenue,
sum(revenue)?over?(order?by?month?rows?between?unbounded?preceding?and?current?row)?as?running_total
FROM?monthly_revenue
ORDER?BY?1
結(jié)果輸出
| month | revenue | running_total |
|---|---|---|
| 2020-01-01 | 650 | 650 |
| 2020-02-01 | 1250 | 1900 |
| 2020-03-01 | 1500 | 3400 |
我們還可以使用下面的組合方式進(jìn)行分析,SQL如下:
SELECT
???order_id,
???customer_id,
???city,
???add_time,
???amount,
???sum(amount)?over?()?as?amount_total,?--?所有數(shù)據(jù)求和
???sum(amount)?over?(order?by?order_id?rows?between?unbounded?preceding?and?current?row)?as?running_sum,?--?累計(jì)求和
???sum(amount)?over?(partition?by?customer_id?order?by?add_time?rows?between?unbounded????preceding?and?current?row)?as?running_sum_by_customer,?
???avg(amount)?over?(order?by?add_time?rows?between?5?preceding?and?current?row)?as??trailing_avg?--?滾動(dòng)求平均
FROM?orders
ORDER?BY?1
結(jié)果輸出:
| order_id | customer_id | city | add_time | amount | amount_total | running_sum | running_sum_by_customer | trailing_avg |
|---|---|---|---|---|---|---|---|---|
| 1 | A | 上海 | 2020-01-01 00:00:00.000000 | 200 | 3400 | 200 | 200 | 200 |
| 2 | B | 上海 | 2020-01-05 00:00:00.000000 | 250 | 3400 | 450 | 250 | 225 |
| 3 | C | 北京 | 2020-01-12 00:00:00.000000 | 200 | 3400 | 650 | 200 | 216.666667 |
| 4 | A | 上海 | 2020-02-04 00:00:00.000000 | 400 | 3400 | 1050 | 600 | 262.5 |
| 5 | D | 上海 | 2020-02-05 00:00:00.000000 | 250 | 3400 | 1300 | 250 | 260 |
| 5 | D | 上海 | 2020-02-05 12:00:00.000000 | 300 | 3400 | 1600 | 550 | 266.666667 |
| 6 | C | 北京 | 2020-02-19 00:00:00.000000 | 300 | 3400 | 1900 | 500 | 283.333333 |
| 7 | A | 上海 | 2020-03-01 00:00:00.000000 | 150 | 3400 | 2050 | 750 | 266.666667 |
| 8 | E | 北京 | 2020-03-05 00:00:00.000000 | 500 | 3400 | 2550 | 500 | 316.666667 |
| 9 | F | 上海 | 2020-03-09 00:00:00.000000 | 250 | 3400 | 2800 | 250 | 291.666667 |
| 10 | B | 上海 | 2020-03-21 00:00:00.000000 | 600 | 3400 | 3400 | 850 |
需求3:處理重復(fù)數(shù)據(jù)
從上面的數(shù)據(jù)可以看出,存在兩條重復(fù)的數(shù)據(jù)**(5,"D","上海","2020-02-05 00:00:00.000000",250), (5,"D","上海","2020-02-05 12:00:00.000000",300),**顯然需要對(duì)其進(jìn)行清洗去重,保留最新的一條數(shù)據(jù),SQL如下:
我們先進(jìn)行分組排名,然后保留最新的那條數(shù)據(jù)即可:
SELECT?*
FROM?(
????SELECT?*,
????row_number()?over?(partition?by?order_id?order?by?add_time?desc)?as?rank
????FROM?orders
)?t
WHERE?rank=1
結(jié)果輸出:
| t.order_id | t.customer_id | t.city | t.add_time | t.amount | t.rank |
|---|---|---|---|---|---|
| 1 | A | 上海 | 2020-01-01 00:00:00.000000 | 200 | 1 |
| 2 | B | 上海 | 2020-01-05 00:00:00.000000 | 250 | 1 |
| 3 | C | 北京 | 2020-01-12 00:00:00.000000 | 200 | 1 |
| 4 | A | 上海 | 2020-02-04 00:00:00.000000 | 400 | 1 |
| 5 | D | 上海 | 2020-02-05 12:00:00.000000 | 300 | 1 |
| 6 | C | 北京 | 2020-02-19 00:00:00.000000 | 300 | 1 |
| 7 | A | 上海 | 2020-03-01 00:00:00.000000 | 150 | 1 |
| 8 | E | 北京 | 2020-03-05 00:00:00.000000 | 500 | 1 |
| 9 | F | 上海 | 2020-03-09 00:00:00.000000 | 250 | 1 |
| 10 | B | 上海 | 2020-03-21 00:00:00.000000 | 600 | 1 |
經(jīng)過上面的清洗過程,對(duì)數(shù)據(jù)進(jìn)行了去重。重新計(jì)算上面的需求1,正確SQL腳本為:
WITH
orders_cleaned?as?(
????SELECT?*
????FROM?(
????????SELECT?*,
????????row_number()?over?(partition?by?order_id?order?by?add_time?desc)?as?rank
????????FROM?orders
????)t
????WHERE?rank=1
)
,monthly_revenue?as?(
????SELECT
????trunc(add_time,'MM')?as?month,
????sum(amount)?as?revenue
????FROM?orders_cleaned
????GROUP?BY?1
)
,prev_month_revenue?as?(
????SELECT?
????month,
????revenue,
????lag(revenue)?over?(order?by?month)?as?prev_month_revenue
????FROM?monthly_revenue
)
SELECT?
month,
revenue,
round(100.0*(revenue-prev_month_revenue)/prev_month_revenue,1)?as?revenue_growth
FROM?prev_month_revenue
ORDER?BY?1
結(jié)果輸出:
| month | revenue | revenue_growth |
|---|---|---|
| 2020-01-01 | 650 | NULL |
| 2020-02-01 | 1000 | 53.8 |
| 2020-03-01 | 1500 | 50 |
將清洗后的數(shù)據(jù)創(chuàng)建成視圖,方便以后使用
CREATE?VIEW?orders_cleaned?AS
SELECT
????order_id,?
????customer_id,?
????city,?
????add_time,?
????amount
FROM?(
????SELECT?*,
????row_number()?over?(partition?by?order_id?order?by?add_time?desc)?as?rank
????FROM?orders
)t
WHERE?rank=1
需求4:分組取TopN
分組取topN是最長(zhǎng)見的SQL窗口函數(shù)使用場(chǎng)景,下面的SQL是計(jì)算每個(gè)月份的top2訂單金額,如下:
WITH?orders_ranked?as?(
????SELECT
????trunc(add_time,'MM')?as?month,
????*,
????row_number()?over?(partition?by?trunc(add_time,'MM')?order?by?amount?desc,?add_time)?as?rank
????FROM?orders_cleaned
)
SELECT?
????month,
????order_id,
????customer_id,
????city,
????add_time,
????amount
FROM?orders_ranked
WHERE?rank?<=2
ORDER?BY?1
需求5:重復(fù)購買行為
下面的SQL計(jì)算重復(fù)購買率:重復(fù)購買的人數(shù)/總?cè)藬?shù)*100%以及第一筆訂單金額與第二筆訂單金額之間的典型差額:avg(第二筆訂單金額/第一筆訂單金額)
WITH?customer_orders?as?(
????SELECT?*,
????row_number()?over?(partition?by?customer_id?order?by?add_time)?as?customer_order_n,
????lag(amount)?over?(partition?by?customer_id?order?by?add_time)?as?prev_order_amount
????FROM?orders_cleaned
)
SELECT
round(100.0*sum(case?when?customer_order_n=2?then?1?end)/count(distinct?customer_id),1)?as?repeat_purchases,--?重復(fù)購買率
avg(case?when?customer_order_n=2?then?1.0*amount/prev_order_amount?end)?as?revenue_expansion?--?重復(fù)購買較上次購買差異,第一筆訂單金額與第二筆訂單金額之間的典型差額
FROM?customer_orders
結(jié)果輸出:
WITH結(jié)果輸出:
| orders_cleaned.order_id | orders_cleaned.customer_id | orders_cleaned.city | orders_cleaned.add_time | orders_cleaned.amount | customer_order_n | prev_order_amount |
|---|---|---|---|---|---|---|
| 1 | A | 上海 | 2020-01-01 00:00:00.000000 | 200 | 1 | NULL |
| 4 | A | 上海 | 2020-02-04 00:00:00.000000 | 400 | 2 | 200 |
| 7 | A | 上海 | 2020-03-01 00:00:00.000000 | 150 | 3 | 400 |
| 2 | B | 上海 | 2020-01-05 00:00:00.000000 | 250 | 1 | NULL |
| 10 | B | 上海 | 2020-03-21 00:00:00.000000 | 600 | 2 | 250 |
| 3 | C | 北京 | 2020-01-12 00:00:00.000000 | 200 | 1 | NULL |
| 6 | C | 北京 | 2020-02-19 00:00:00.000000 | 300 | 2 | 200 |
| 5 | D | 上海 | 2020-02-05 12:00:00.000000 | 300 | 1 | NULL |
| 8 | E | 北京 | 2020-03-05 00:00:00.000000 | 500 | 1 | NULL |
| 9 | F | 上海 | 2020-03-09 00:00:00.000000 | 250 |
最終結(jié)果輸出:
| repeat_purchases | revenue_expansion |
|---|---|
| 50 | 1.9666666666666668 |
總結(jié)
本文主要分享了SQL窗口函數(shù)的基本使用方式以及使用場(chǎng)景,并結(jié)合了具體的分析案例。通過本文的分析案例,可以加深對(duì)SQL窗口函數(shù)的理解。
