成人性爱免费,人人摸在线观看视频,亚洲青青草,亚洲AV无码精品自拍,精品黄色片,中文亚洲字幕,中文字幕第256页,人成网站在线观看

在數(shù)據(jù)分析中，窗口函數(shù)是我們經(jīng)常用到的函數(shù)，今天的文章我們總結(jié)了常用的各類窗口函數(shù)并給出實例。

一.創(chuàng)建數(shù)據(jù)集

from pyspark.sql import SparkSession
import pandas as pd
spark = SparkSession.builder.appName('Windowfunction').enableHiveSupport().getOrCreate()
import pyspark.sql.functions 
# 原始數(shù)據(jù) 
test = spark.createDataFrame([('001','1',100,87,67,83,98), ('002','2',87,81,90,83,83), ('003','3',86,91,83,89,63),
                            ('004','2',65,87,94,73,88), ('005','1',76,62,89,81,98), ('006','3',84,82,85,73,99),
                            ('007','3',56,76,63,72,87), ('008','1',55,62,46,78,71), ('009','2',63,72,87,98,64)],
                             ['number','class','language','math','english','physic','chemical'])

#查看原始數(shù)據(jù)
test.show()
#將原始數(shù)據(jù)存入中間表
test.createOrReplaceTempView('test_temp_table')

number|class|language|math|english|physic|chemical|
+------+-----+--------+----+-------+------+--------+
|   001|    1|     100|  87|     67|    83|      98|
|   002|    2|      87|  81|     90|    83|      83|
|   003|    3|      86|  91|     83|    89|      63|
|   004|    2|      65|  87|     94|    73|      88|
|   005|    1|      76|  62|     89|    81|      98|
|   006|    3|      84|  82|     85|    73|      99|
|   007|    3|      56|  76|     63|    72|      87|
|   008|    1|      55|  62|     46|    78|      71|
|   009|    2|      63|  72|     87|    98|      64|
+------+-----+--------+----+-------+------+--------+

#將數(shù)據(jù)轉(zhuǎn)換為長數(shù)據(jù)
# 逆透視Unpivot
test_long =test.selectExpr("`number`","`class`",
                        "stack(5, 'language', `language`,'math', `math`, 'english', `english`, 'physic', `physic`,'chemical', `chemical`) as (`subject`,`grade`)").orderBy(["`class`", "`number`"]) 

test_long.show()
test_long.createOrReplaceTempView('test_long_temp_table')

+------+-----+--------+-----+
|number|class| subject|grade|
+------+-----+--------+-----+
|   001|    1| english|   67|
|   001|    1|language|  100|
|   001|    1|  physic|   83|
|   001|    1|    math|   87|
|   001|    1|chemical|   98|
|   005|    1|chemical|   98|
|   005|    1| english|   89|
|   005|    1|  physic|   81|
|   005|    1|    math|   62|
|   005|    1|language|   76|
|   008|    1|  physic|   78|
|   008|    1|    math|   62|
|   008|    1|chemical|   71|
|   008|    1|language|   55|
|   008|    1| english|   46|
|   002|    2|language|   87|
|   002|    2|    math|   81|
|   002|    2|  physic|   83|
|   002|    2|chemical|   83|
|   002|    2| english|   90|
+------+-----+--------+-----+
only showing top 20 rows

二.聚合函數(shù)

聚合函數(shù)也可用于窗口函數(shù)當(dāng)中，用法和專用窗口函數(shù)相同。

聚合函數(shù)sum、avg、count、max、min都是針對自身記錄以及自身記錄以上的所有數(shù)據(jù)進行計算的。

聚合函數(shù)作為窗口函數(shù)，可以在每一行的數(shù)據(jù)里直觀看到截止到本行數(shù)據(jù)，統(tǒng)計數(shù)據(jù)是多少，比如：按照時間的順序，計算各時期的銷售總額就需要用到這種累計的統(tǒng)計方法。同時也可以看出每一行數(shù)據(jù)對整體數(shù)據(jù)的影響。聚合函數(shù)的開窗和專用的窗口函數(shù)是一致的，其形式為：

?窗口函數(shù)? over (partition by ?用于分組的列名? order by ?用于排序的列名?)

聚合函數(shù)的窗口函數(shù)中，加不加order by,order by的列名是否是用于分組的列名，這些情況都會影響到最終的結(jié)果，下面我們分別來討論各種不同的情況。

2.1 窗口函數(shù)有無order by的區(qū)別

2.1.1 有order by且order by的字段不是用于分組的字段

這種情況下得到的結(jié)果是每個partition的累加的結(jié)果

https://zhuanlan.zhihu.com/p/104402435

test_sum=spark.sql("""
select *,sum(grade)over(partition by class,subject order by number) total_grade,
avg(grade)over(partition by subject,class  order by number) avg_grade,
count(grade)over(partition by subject,class  order by number) total_classmate,
max(grade)over(partition by subject,class  order by number) max_grade,
min(grade)over(partition by subject,class  order by number) min_grade
from test_long_temp_table
""")
test_sum.show()

number|class| subject|grade|total_grade|        avg_grade|total_classmate|max_grade|min_grade|
+------+-----+--------+-----+-----------+-----------------+---------------+---------+---------+
|   001|    1| english|   67|         67|             67.0|              1|       67|       67|
|   005|    1| english|   89|        156|             78.0|              2|       89|       67|
|   008|    1| english|   46|        202|67.33333333333333|              3|       89|       46|
|   002|    2| english|   90|         90|             90.0|              1|       90|       90|
|   004|    2| english|   94|        184|             92.0|              2|       94|       90|
|   009|    2| english|   87|        271|90.33333333333333|              3|       94|       87|
|   003|    3|chemical|   63|         63|             63.0|              1|       63|       63|
|   006|    3|chemical|   99|        162|             81.0|              2|       99|       63|
|   007|    3|chemical|   87|        249|             83.0|              3|       99|       63|
|   003|    3|    math|   91|         91|             91.0|              1|       91|       91|
|   006|    3|    math|   82|        173|             86.5|              2|       91|       82|
|   007|    3|    math|   76|        249|             83.0|              3|       91|       76|
|   001|    1|    math|   87|         87|             87.0|              1|       87|       87|
|   005|    1|    math|   62|        149|             74.5|              2|       87|       62|
|   008|    1|    math|   62|        211|70.33333333333333|              3|       87|       62|
|   002|    2|    math|   81|         81|             81.0|              1|       81|       81|
|   004|    2|    math|   87|        168|             84.0|              2|       87|       81|
|   009|    2|    math|   72|        240|             80.0|              3|       87|       72|
|   003|    3|  physic|   89|         89|             89.0|              1|       89|       89|
|   006|    3|  physic|   73|        162|             81.0|              2|       89|       73|
+------+-----+--------+-----+-----------+-----------------+---------------+---------+---------+
only showing top 20 rows

2.1.2 有order by且order by的字段是用于分組的字段

該情況下得到的數(shù)據(jù)是每個partition的總和而不是累加

test_sum_1=spark.sql("""
select *,
sum(grade)over(partition by subject,class  order by class) total_grade,
avg(grade)over(partition by subject,class  order by class) avg_grade,
count(grade)over(partition by subject,class  order by class) total_classmate,
max(grade)over(partition by subject,class  order by class) max_grade,
min(grade)over(partition by subject,class  order by class) min_grade
from test_long_temp_table
""")
test_sum_1.show()

number|class| subject|grade|total_grade|        avg_grade|total_classmate|max_grade|min_grade|
+------+-----+--------+-----+-----------+-----------------+---------------+---------+---------+
|   003|    3|  physic|   89|        234|             78.0|              3|       89|       72|
|   006|    3|  physic|   73|        234|             78.0|              3|       89|       72|
|   007|    3|  physic|   72|        234|             78.0|              3|       89|       72|
|   002|    2|  physic|   83|        254|84.66666666666667|              3|       98|       73|
|   004|    2|  physic|   73|        254|84.66666666666667|              3|       98|       73|
|   009|    2|  physic|   98|        254|84.66666666666667|              3|       98|       73|
|   003|    3|chemical|   63|        249|             83.0|              3|       99|       63|
|   006|    3|chemical|   99|        249|             83.0|              3|       99|       63|
|   007|    3|chemical|   87|        249|             83.0|              3|       99|       63|
|   003|    3|    math|   91|        249|             83.0|              3|       91|       76|
|   006|    3|    math|   82|        249|             83.0|              3|       91|       76|
|   007|    3|    math|   76|        249|             83.0|              3|       91|       76|
|   001|    1| english|   67|        202|67.33333333333333|              3|       89|       46|
|   005|    1| english|   89|        202|67.33333333333333|              3|       89|       46|
|   008|    1| english|   46|        202|67.33333333333333|              3|       89|       46|
|   002|    2|    math|   81|        240|             80.0|              3|       87|       72|
|   004|    2|    math|   87|        240|             80.0|              3|       87|       72|
|   009|    2|    math|   72|        240|             80.0|              3|       87|       72|
|   002|    2|language|   87|        215|71.66666666666667|              3|       87|       63|
|   004|    2|language|   65|        215|71.66666666666667|              3|       87|       63|
+------+-----+--------+-----+-----------+-----------------+---------------+---------+---------+
only showing top 20 rows

2.1.3 有partition by無order by

該情況下，sum()over()得到的數(shù)據(jù)是每個partition的總和而不是累加，和第二種情況是一致的。

test_sum_2=spark.sql("""
select *,
sum(grade)over(partition by subject,class) total_grade,
avg(grade)over(partition by subject,class) avg_grade,
count(grade)over(partition by subject,class) total_classmate,
max(grade)over(partition by subject,class ) max_grade,
min(grade)over(partition by subject,class) min_grade
from test_long_temp_table
""")
test_sum_2.show()

number|class| subject|grade|total_grade|        avg_grade|total_classmate|max_grade|min_grade|
+------+-----+--------+-----+-----------+-----------------+---------------+---------+---------+
|   003|    3|  physic|   89|        234|             78.0|              3|       89|       72|
|   006|    3|  physic|   73|        234|             78.0|              3|       89|       72|
|   007|    3|  physic|   72|        234|             78.0|              3|       89|       72|
|   002|    2|  physic|   83|        254|84.66666666666667|              3|       98|       73|
|   004|    2|  physic|   73|        254|84.66666666666667|              3|       98|       73|
|   009|    2|  physic|   98|        254|84.66666666666667|              3|       98|       73|
|   003|    3|chemical|   63|        249|             83.0|              3|       99|       63|
|   006|    3|chemical|   99|        249|             83.0|              3|       99|       63|
|   007|    3|chemical|   87|        249|             83.0|              3|       99|       63|
|   003|    3|    math|   91|        249|             83.0|              3|       91|       76|
|   006|    3|    math|   82|        249|             83.0|              3|       91|       76|
|   007|    3|    math|   76|        249|             83.0|              3|       91|       76|
|   001|    1| english|   67|        202|67.33333333333333|              3|       89|       46|
|   005|    1| english|   89|        202|67.33333333333333|              3|       89|       46|
|   008|    1| english|   46|        202|67.33333333333333|              3|       89|       46|
|   002|    2|    math|   81|        240|             80.0|              3|       87|       72|
|   004|    2|    math|   87|        240|             80.0|              3|       87|       72|
|   009|    2|    math|   72|        240|             80.0|              3|       87|       72|
|   002|    2|language|   87|        215|71.66666666666667|              3|       87|       63|
|   004|    2|language|   65|        215|71.66666666666667|              3|       87|       63|
+------+-----+--------+-----+-----------+-----------------+---------------+---------+---------+
only showing top 20 rows

2.2 窗口函數(shù)的平均移動

select *, avg(成績) over (order by 學(xué)號 rows 2 preceding) as current_avg from 班級表;

rows和preceding這兩個關(guān)鍵字，是“之前~行”的意思，上面的句子中，是之前2行。也就是得到的結(jié)果是自身記錄及前2行的平均（相對應(yīng)的preceding是following）

test_preceding=spark.sql("""
select *,
sum(grade)over(partition by subject order by number rows 2 preceding) total_preceding_grade
from test_long_temp_table
""")
test_preceding.show()

number|class| subject|grade|total_preceding_grade|
+------+-----+--------+-----+---------------------+
|   001|    1|  physic|   83|                   83|
|   002|    2|  physic|   83|                  166|
|   003|    3|  physic|   89|                  255|
|   004|    2|  physic|   73|                  245|
|   005|    1|  physic|   81|                  243|
|   006|    3|  physic|   73|                  227|
|   007|    3|  physic|   72|                  226|
|   008|    1|  physic|   78|                  223|
|   009|    2|  physic|   98|                  248|
|   001|    1|chemical|   98|                   98|
|   002|    2|chemical|   83|                  181|
|   003|    3|chemical|   63|                  244|
|   004|    2|chemical|   88|                  234|
|   005|    1|chemical|   98|                  249|
|   006|    3|chemical|   99|                  285|
|   007|    3|chemical|   87|                  284|
|   008|    1|chemical|   71|                  257|
|   009|    2|chemical|   64|                  222|
|   001|    1|language|  100|                  100|
|   002|    2|language|   87|                  187|
+------+-----+--------+-----+---------------------+
only showing top 20 rows

二.專用窗口函數(shù)

專用窗口函數(shù)包括rank() over,dense_rank() over,row_number() over()

1.rank() over

查出指定條件后的進行排名。特點是，加入是對學(xué)生排名，使用這個函數(shù)，成績相同的兩名是并列，下一位同學(xué)空出所占的名次。

test_rank=spark.sql("""
select *,
rank()over(partition by subject order by grade desc) rank
from test_long_temp_table
""")

test_rank.show()

number|class| subject|grade|rank|
+------+-----+--------+-----+----+
|   009|    2|  physic|   98|   1|
|   003|    3|  physic|   89|   2|
|   001|    1|  physic|   83|   3|
|   002|    2|  physic|   83|   3|
|   005|    1|  physic|   81|   5|
|   008|    1|  physic|   78|   6|
|   004|    2|  physic|   73|   7|
|   006|    3|  physic|   73|   7|
|   007|    3|  physic|   72|   9|
|   006|    3|chemical|   99|   1|
|   001|    1|chemical|   98|   2|
|   005|    1|chemical|   98|   2|
|   004|    2|chemical|   88|   4|
|   007|    3|chemical|   87|   5|
|   002|    2|chemical|   83|   6|
|   008|    1|chemical|   71|   7|
|   009|    2|chemical|   64|   8|
|   003|    3|chemical|   63|   9|
|   001|    1|language|  100|   1|
|   002|    2|language|   87|   2|
+------+-----+--------+-----+----+
only showing top 20 rows

2.dense_rank() over

與rank() over的區(qū)別是，兩名學(xué)生的成績并列以后，下一位同學(xué)并不空出所占的名次。

test_dense_rank=spark.sql("""
select *,
dense_rank()over(partition by subject order by grade desc) rank
from test_long_temp_table
""")

test_dense_rank.show()

number|class| subject|grade|rank|
+------+-----+--------+-----+----+
|   009|    2|  physic|   98|   1|
|   003|    3|  physic|   89|   2|
|   001|    1|  physic|   83|   3|
|   002|    2|  physic|   83|   3|
|   005|    1|  physic|   81|   4|
|   008|    1|  physic|   78|   5|
|   004|    2|  physic|   73|   6|
|   006|    3|  physic|   73|   6|
|   007|    3|  physic|   72|   7|
|   006|    3|chemical|   99|   1|
|   001|    1|chemical|   98|   2|
|   005|    1|chemical|   98|   2|
|   004|    2|chemical|   88|   3|
|   007|    3|chemical|   87|   4|
|   002|    2|chemical|   83|   5|
|   008|    1|chemical|   71|   6|
|   009|    2|chemical|   64|   7|
|   003|    3|chemical|   63|   8|
|   001|    1|language|  100|   1|
|   002|    2|language|   87|   2|
+------+-----+--------+-----+----+
only showing top 20 rows

3.row_number() over

這個函數(shù)不需要考慮是否并列，哪怕根據(jù)條件查詢出來的數(shù)值相同也會進行連續(xù)排名!

test_row_number=spark.sql("""
select *,
row_number()over(partition by subject order by grade desc) rank
from test_long_temp_table
""")

test_row_number.show()

number|class| subject|grade|rank|
+------+-----+--------+-----+----+
|   009|    2|  physic|   98|   1|
|   003|    3|  physic|   89|   2|
|   001|    1|  physic|   83|   3|
|   002|    2|  physic|   83|   4|
|   005|    1|  physic|   81|   5|
|   008|    1|  physic|   78|   6|
|   006|    3|  physic|   73|   7|
|   004|    2|  physic|   73|   8|
|   007|    3|  physic|   72|   9|
|   006|    3|chemical|   99|   1|
|   005|    1|chemical|   98|   2|
|   001|    1|chemical|   98|   3|
|   004|    2|chemical|   88|   4|
|   007|    3|chemical|   87|   5|
|   002|    2|chemical|   83|   6|
|   008|    1|chemical|   71|   7|
|   009|    2|chemical|   64|   8|
|   003|    3|chemical|   63|   9|
|   001|    1|language|  100|   1|
|   002|    2|language|   87|   2|
+------+-----+--------+-----+----+
only showing top 20 rows

三.Lead和Lag函數(shù)

lag和lead函數(shù)可以在同一次查詢中取出同一字段的前N行數(shù)據(jù)（lag）和后N行數(shù)據(jù)（lead）。

lead和lag函數(shù)應(yīng)用場景較為廣泛，在計算前一天、前一個月以及后一天、后一個月等時間差時，我們通常會使用自連接來求差值，但是自連接有時候會出現(xiàn)重

復(fù)需要額外處理，而通過lag和lead函數(shù)正好能夠?qū)崿F(xiàn)這一功能。

3.1 Lead-后一行

語法：LEAD ( scalar_expression [ ,offset ] , [ default ] ) OVER ( [ partition_by_clause ] order_by_clause )

scalar_expression，要返回的值基于指定的偏移量。這是一個返回單個（標量）值的任何類型的表達式。scalar_expression 不能為分析函數(shù)。簡單地

來說就是，要取的列。

offset默認值為1， offset 可以是列、子查詢或其他求值為正整數(shù)的表達式，或者可隱式轉(zhuǎn)換為bigint。offset 不能是負數(shù)值或分析函數(shù)。簡單地來說就

是，取偏移后的第幾行數(shù)據(jù)

default默認值為NULL， offset 可以是列、子查詢或其他求值為正整數(shù)的表達式，或者可隱式轉(zhuǎn)換為bigint。offset不能是負數(shù)值或分析函數(shù)。簡單地來

說就是，沒有符合條件的默認值

3.2 Lag-前一行

語法：Lag ( scalar_expression [ ,offset ] , [ default ] ) OVER ( [ partition_by_clause ] order_by_clause )

test_lead_lag=spark.sql("""
select *,
lead(grade)over(partition by number order by grade desc) lead_grade,
lag(grade)over(partition by number order by grade desc) lag_grade
from test_long_temp_table
""")

test_lead_lag.show()

number|class| subject|grade|lead_grade|lag_grade|
+------+-----+--------+-----+----------+---------+
|   009|    2|  physic|   98|        87|     null|
|   009|    2| english|   87|        72|       98|
|   009|    2|    math|   72|        64|       87|
|   009|    2|chemical|   64|        63|       72|
|   009|    2|language|   63|      null|       64|
|   006|    3|chemical|   99|        85|     null|
|   006|    3| english|   85|        84|       99|
|   006|    3|language|   84|        82|       85|
|   006|    3|    math|   82|        73|       84|
|   006|    3|  physic|   73|      null|       82|
|   003|    3|    math|   91|        89|     null|
|   003|    3|  physic|   89|        86|       91|
|   003|    3|language|   86|        83|       89|
|   003|    3| english|   83|        63|       86|
|   003|    3|chemical|   63|      null|       83|
|   005|    1|chemical|   98|        89|     null|
|   005|    1| english|   89|        81|       98|
|   005|    1|  physic|   81|        76|       89|
|   005|    1|language|   76|        62|       81|
|   005|    1|    math|   62|      null|       76|
+------+-----+--------+-----+----------+---------+
only showing top 20 rows

參考鏈接

https://zhuanlan.zhihu.com/p/104402435

https://zhuanlan.zhihu.com/p/137111389

https://blog.csdn.net/sinat_28984567/article/details/80262604

https://zhidao.baidu.com/question/1667694010390213187.html

https://www.cnblogs.com/scwbky/p/9558203.html

相關(guān)閱讀：

寫在1024：一名數(shù)據(jù)分析師的修煉之路
數(shù)據(jù)科學(xué)系列：sklearn庫主要模塊功能簡介
數(shù)據(jù)科學(xué)系列：seaborn入門詳細教程
數(shù)據(jù)科學(xué)系列：pandas入門詳細教程
數(shù)據(jù)科學(xué)系列：matplotlib入門詳細教程
數(shù)據(jù)科學(xué)系列：numpy入門詳細教程

被問到窗口函數(shù)不知所措？一文教會你數(shù)據(jù)分析師常用的窗口函數(shù)！