<kbd id="afajh"><form id="afajh"></form></kbd>
<strong id="afajh"><dl id="afajh"></dl></strong>
    <del id="afajh"><form id="afajh"></form></del>
        1. <th id="afajh"><progress id="afajh"></progress></th>
          <b id="afajh"><abbr id="afajh"></abbr></b>
          <th id="afajh"><progress id="afajh"></progress></th>

          使用 SQL 窗口函數(shù)進(jìn)行增長(zhǎng)數(shù)據(jù)分析

          共 5897字,需瀏覽 12分鐘

           ·

          2020-09-05 22:25

          點(diǎn)擊上方數(shù)據(jù)管道”,選擇“置頂星標(biāo)”公眾號(hào)

          干貨福利,第一時(shí)間送達(dá)


          本文轉(zhuǎn)自:大數(shù)據(jù)技術(shù)與數(shù)倉

          本文會(huì)從一個(gè)商務(wù)分析案例入手,說明SQL窗口函數(shù)的使用方式。通過本文的5個(gè)需求分析,可以看出SQL窗口函數(shù)的功能十分強(qiáng)大,不僅能夠使我們編寫的SQL邏輯更加清晰,而且在某種程度上可以簡(jiǎn)化需求開發(fā)。

          數(shù)據(jù)準(zhǔn)備

          本文主要分析只涉及一張訂單表orders,操作過程在Hive中完成,具體數(shù)據(jù)如下:

          --?建表
          CREATE?TABLE?orders(
          ????order_id?int,
          ????customer_id?string,
          ????city?string,
          ????add_time?string,
          ????amount?decimal(10,2));

          --?準(zhǔn)備數(shù)據(jù)??????????????????????????????
          INSERT?INTO?orders?VALUES
          (1,"A","上海","2020-01-01?00:00:00.000000",200),
          (2,"B","上海","2020-01-05?00:00:00.000000",250),
          (3,"C","北京","2020-01-12?00:00:00.000000",200),
          (4,"A","上海","2020-02-04?00:00:00.000000",400),
          (5,"D","上海","2020-02-05?00:00:00.000000",250),
          (5,"D","上海","2020-02-05?12:00:00.000000",300),
          (6,"C","北京","2020-02-19?00:00:00.000000",300),
          (7,"A","上海","2020-03-01?00:00:00.000000",150),
          (8,"E","北京","2020-03-05?00:00:00.000000",500),
          (9,"F","上海","2020-03-09?00:00:00.000000",250),
          (10,"B","上海","2020-03-21?00:00:00.000000",600);

          需求1:收入增長(zhǎng)

          在業(yè)務(wù)方面,第m1個(gè)月的收入增長(zhǎng)計(jì)算如下:100 *(m1-m0)/ m0

          其中,m1是給定月份的收入,m0是上個(gè)月的收入。因此,從技術(shù)上講,我們需要找到每個(gè)月的收入,然后以某種方式將每個(gè)月的收入與上一個(gè)收入相關(guān)聯(lián),以便進(jìn)行上述計(jì)算。計(jì)算當(dāng)時(shí)如下:

          WITH
          monthly_revenue?as?(
          ????SELECT
          ????trunc(add_time,'MM')?as?month,
          ????sum(amount)?as?revenue
          ????FROM?orders
          ????GROUP?BY?1
          )
          ,prev_month_revenue?as?(
          ????SELECT?
          ????month,
          ????revenue,
          ????lag(revenue)?over?(order?by?month)?as?prev_month_revenue?--?上一月收入
          ????FROM?monthly_revenue
          )
          SELECT?
          ??month,
          ??revenue,
          ??prev_month_revenue,
          ??round(100.0*(revenue-prev_month_revenue)/prev_month_revenue,1)?as?revenue_growth
          FROM?prev_month_revenue
          ORDER?BY?1

          結(jié)果輸出

          monthrevenueprev_month_revenuerevenue_growth
          2020-01-01650NULLNULL
          2020-02-01125065092.3
          2020-03-011500125020

          我們還可以按照按城市分組進(jìn)行統(tǒng)計(jì),查看某個(gè)城市某個(gè)月份的收入增長(zhǎng)情況

          WITH
          monthly_revenue?as?(
          ????SELECT
          ?????trunc(add_time,'MM')?as?month,
          ????city,
          ????sum(amount)?as?revenue
          ????FROM?orders
          ????GROUP?BY?1,2
          )
          ,prev_month_revenue?as?(
          ????SELECT?
          ????month,
          ????city,
          ????revenue,
          ????lag(revenue)?over?(partition?by?city?order?by?month)?as?prev_month_revenue
          ????FROM?monthly_revenue
          )
          SELECT?
          month,
          city,
          revenue,
          round(100.0*(revenue-prev_month_revenue)/prev_month_revenue,1)?as?revenue_growth
          FROM?prev_month_revenue
          ORDER?BY?2,1

          結(jié)果輸出

          monthcityrevenuerevenue_growth
          2020-01-01上海450NULL
          2020-02-01上海950111.1
          2020-03-01上海10005.3
          2020-01-01北京200NULL
          2020-02-01北京30050
          2020-03-01北京50066.7

          需求2:累計(jì)求和

          累計(jì)匯總,即當(dāng)前元素和所有先前元素的總和,如下面的SQL:

          WITH
          monthly_revenue?as?(
          ????SELECT
          ????trunc(add_time,'MM')?as?month,
          ????sum(amount)?as?revenue
          ????FROM?orders
          ????GROUP?BY?1
          )
          SELECT?
          month,
          revenue,
          sum(revenue)?over?(order?by?month?rows?between?unbounded?preceding?and?current?row)?as?running_total
          FROM?monthly_revenue
          ORDER?BY?1

          結(jié)果輸出

          monthrevenuerunning_total
          2020-01-01650650
          2020-02-0112501900
          2020-03-0115003400

          我們還可以使用下面的組合方式進(jìn)行分析,SQL如下:

          SELECT
          ???order_id,
          ???customer_id,
          ???city,
          ???add_time,
          ???amount,
          ???sum(amount)?over?()?as?amount_total,?--?所有數(shù)據(jù)求和
          ???sum(amount)?over?(order?by?order_id?rows?between?unbounded?preceding?and?current?row)?as?running_sum,?--?累計(jì)求和
          ???sum(amount)?over?(partition?by?customer_id?order?by?add_time?rows?between?unbounded????preceding?and?current?row)?as?running_sum_by_customer,?
          ???avg(amount)?over?(order?by?add_time?rows?between?5?preceding?and?current?row)?as??trailing_avg?--?滾動(dòng)求平均
          FROM?orders
          ORDER?BY?1

          結(jié)果輸出

          order_idcustomer_idcityadd_timeamountamount_totalrunning_sumrunning_sum_by_customertrailing_avg
          1A上海2020-01-01 00:00:00.0000002003400200200200
          2B上海2020-01-05 00:00:00.0000002503400450250225
          3C北京2020-01-12 00:00:00.0000002003400650200216.666667
          4A上海2020-02-04 00:00:00.00000040034001050600262.5
          5D上海2020-02-05 00:00:00.00000025034001300250260
          5D上海2020-02-05 12:00:00.00000030034001600550266.666667
          6C北京2020-02-19 00:00:00.00000030034001900500283.333333
          7A上海2020-03-01 00:00:00.00000015034002050750266.666667
          8E北京2020-03-05 00:00:00.00000050034002550500316.666667
          9F上海2020-03-09 00:00:00.00000025034002800250291.666667
          10B上海2020-03-21 00:00:00.00000060034003400850

          需求3:處理重復(fù)數(shù)據(jù)

          從上面的數(shù)據(jù)可以看出,存在兩條重復(fù)的數(shù)據(jù)**(5,"D","上海","2020-02-05 00:00:00.000000",250), (5,"D","上海","2020-02-05 12:00:00.000000",300),**顯然需要對(duì)其進(jìn)行清洗去重,保留最新的一條數(shù)據(jù),SQL如下:

          我們先進(jìn)行分組排名,然后保留最新的那條數(shù)據(jù)即可:

          SELECT?*
          FROM?(
          ????SELECT?*,
          ????row_number()?over?(partition?by?order_id?order?by?add_time?desc)?as?rank
          ????FROM?orders
          )?t
          WHERE?rank=1

          結(jié)果輸出

          t.order_idt.customer_idt.cityt.add_timet.amountt.rank
          1A上海2020-01-01 00:00:00.0000002001
          2B上海2020-01-05 00:00:00.0000002501
          3C北京2020-01-12 00:00:00.0000002001
          4A上海2020-02-04 00:00:00.0000004001
          5D上海2020-02-05 12:00:00.0000003001
          6C北京2020-02-19 00:00:00.0000003001
          7A上海2020-03-01 00:00:00.0000001501
          8E北京2020-03-05 00:00:00.0000005001
          9F上海2020-03-09 00:00:00.0000002501
          10B上海2020-03-21 00:00:00.0000006001

          經(jīng)過上面的清洗過程,對(duì)數(shù)據(jù)進(jìn)行了去重。重新計(jì)算上面的需求1,正確SQL腳本為:

          WITH
          orders_cleaned?as?(
          ????SELECT?*
          ????FROM?(
          ????????SELECT?*,
          ????????row_number()?over?(partition?by?order_id?order?by?add_time?desc)?as?rank
          ????????FROM?orders
          ????)t
          ????WHERE?rank=1
          )
          ,monthly_revenue?as?(
          ????SELECT
          ????trunc(add_time,'MM')?as?month,
          ????sum(amount)?as?revenue
          ????FROM?orders_cleaned
          ????GROUP?BY?1
          )
          ,prev_month_revenue?as?(
          ????SELECT?
          ????month,
          ????revenue,
          ????lag(revenue)?over?(order?by?month)?as?prev_month_revenue
          ????FROM?monthly_revenue
          )
          SELECT?
          month,
          revenue,
          round(100.0*(revenue-prev_month_revenue)/prev_month_revenue,1)?as?revenue_growth
          FROM?prev_month_revenue
          ORDER?BY?1

          結(jié)果輸出

          monthrevenuerevenue_growth
          2020-01-01650NULL
          2020-02-01100053.8
          2020-03-01150050

          將清洗后的數(shù)據(jù)創(chuàng)建成視圖,方便以后使用

          CREATE?VIEW?orders_cleaned?AS
          SELECT
          ????order_id,?
          ????customer_id,?
          ????city,?
          ????add_time,?
          ????amount
          FROM?(
          ????SELECT?*,
          ????row_number()?over?(partition?by?order_id?order?by?add_time?desc)?as?rank
          ????FROM?orders
          )t
          WHERE?rank=1

          需求4:分組取TopN

          分組取topN是最長(zhǎng)見的SQL窗口函數(shù)使用場(chǎng)景,下面的SQL是計(jì)算每個(gè)月份的top2訂單金額,如下:

          WITH?orders_ranked?as?(
          ????SELECT
          ????trunc(add_time,'MM')?as?month,
          ????*,
          ????row_number()?over?(partition?by?trunc(add_time,'MM')?order?by?amount?desc,?add_time)?as?rank
          ????FROM?orders_cleaned
          )
          SELECT?
          ????month,
          ????order_id,
          ????customer_id,
          ????city,
          ????add_time,
          ????amount
          FROM?orders_ranked
          WHERE?rank?<=2
          ORDER?BY?1

          需求5:重復(fù)購買行為

          下面的SQL計(jì)算重復(fù)購買率:重復(fù)購買的人數(shù)/總?cè)藬?shù)*100%以及第一筆訂單金額與第二筆訂單金額之間的典型差額:avg(第二筆訂單金額/第一筆訂單金額)

          WITH?customer_orders?as?(
          ????SELECT?*,
          ????row_number()?over?(partition?by?customer_id?order?by?add_time)?as?customer_order_n,
          ????lag(amount)?over?(partition?by?customer_id?order?by?add_time)?as?prev_order_amount
          ????FROM?orders_cleaned
          )
          SELECT
          round(100.0*sum(case?when?customer_order_n=2?then?1?end)/count(distinct?customer_id),1)?as?repeat_purchases,--?重復(fù)購買率
          avg(case?when?customer_order_n=2?then?1.0*amount/prev_order_amount?end)?as?revenue_expansion?--?重復(fù)購買較上次購買差異,第一筆訂單金額與第二筆訂單金額之間的典型差額
          FROM?customer_orders

          結(jié)果輸出

          WITH結(jié)果輸出:

          orders_cleaned.order_idorders_cleaned.customer_idorders_cleaned.cityorders_cleaned.add_timeorders_cleaned.amountcustomer_order_nprev_order_amount
          1A上海2020-01-01 00:00:00.0000002001NULL
          4A上海2020-02-04 00:00:00.0000004002200
          7A上海2020-03-01 00:00:00.0000001503400
          2B上海2020-01-05 00:00:00.0000002501NULL
          10B上海2020-03-21 00:00:00.0000006002250
          3C北京2020-01-12 00:00:00.0000002001NULL
          6C北京2020-02-19 00:00:00.0000003002200
          5D上海2020-02-05 12:00:00.0000003001NULL
          8E北京2020-03-05 00:00:00.0000005001NULL
          9F上海2020-03-09 00:00:00.000000250

          最終結(jié)果輸出:

          repeat_purchasesrevenue_expansion
          501.9666666666666668

          總結(jié)

          本文主要分享了SQL窗口函數(shù)的基本使用方式以及使用場(chǎng)景,并結(jié)合了具體的分析案例。通過本文的分析案例,可以加深對(duì)SQL窗口函數(shù)的理解。

          瀏覽 76
          點(diǎn)贊
          評(píng)論
          收藏
          分享

          手機(jī)掃一掃分享

          分享
          舉報(bào)
          評(píng)論
          圖片
          表情
          推薦
          點(diǎn)贊
          評(píng)論
          收藏
          分享

          手機(jī)掃一掃分享

          分享
          舉報(bào)
          <kbd id="afajh"><form id="afajh"></form></kbd>
          <strong id="afajh"><dl id="afajh"></dl></strong>
            <del id="afajh"><form id="afajh"></form></del>
                1. <th id="afajh"><progress id="afajh"></progress></th>
                  <b id="afajh"><abbr id="afajh"></abbr></b>
                  <th id="afajh"><progress id="afajh"></progress></th>
                  日日成人网| 美女草逼 | 国产女人高潮视频 | 福利一区在线 | 亚洲五月丁香影院 |