面試官發(fā)難,如何用 SQL 來查詢 Elasticsearch 中的數(shù)據(jù)?
點(diǎn)擊“開發(fā)者技術(shù)前線”,選擇“星標(biāo)??”
讓一部分開發(fā)者看到未來

回復(fù)“666”,獲取一份技術(shù)人專屬大禮包
參考:https://elasticstack.blog.csdn.net/article/details/105199768
在今天的文章里,我們將簡單介紹一下如何使用 Elasticsearch SQL來對(duì)我們的數(shù)據(jù)進(jìn)行查詢。在之前的一篇文章“Kibana:Canvas入門”里也有 Elasticsearch SQL 的具體用例介紹。
安裝
對(duì)于還沒安裝好自己的 Elasticsearch 的開發(fā)者來說,你可以參閱我之前的文章“Elastic:菜鳥上手指南”來進(jìn)行安裝自己的 Elasticsearch 及 Kibana。在這里我就不累述了。
準(zhǔn)備數(shù)據(jù)
我們首先打開 Kibana:

點(diǎn)擊上面的“Load a data set and a Kibana dashboard”:

點(diǎn)擊上面的 Add data,這樣我們就可以完成實(shí)驗(yàn)數(shù)據(jù)的導(dǎo)入了。在 Elasticsearch 中,我們會(huì)找到一個(gè)叫 kibana_sample_data_flights 的索引。
SQL 實(shí)操
檢索 Elasticsearch schema 信息:DSL vs SQL
首先,我們確定表/索引的 schema 以及可供我們使用的字段。我們將通過 REST 界面執(zhí)行此操作:
POST /_sql{"query": """DESCRIBE kibana_sample_data_flights"""}上面命令的結(jié)果:{"columns" : [{"name" : "column","type" : "keyword"},{"name" : "type","type" : "keyword"},{"name" : "mapping","type" : "keyword"}],"rows" : [["AvgTicketPrice","REAL","float"],["Cancelled","BOOLEAN","boolean"],["Carrier","VARCHAR","keyword"],["Dest","VARCHAR","keyword"],["DestAirportID","VARCHAR","keyword"],["DestCityName","VARCHAR","keyword"],["DestCountry","VARCHAR","keyword"],["DestLocation","GEOMETRY","geo_point"],["DestRegion","VARCHAR","keyword"],["DestWeather","VARCHAR","keyword"],["DistanceKilometers","REAL","float"],["DistanceMiles","REAL","float"],["FlightDelay","BOOLEAN","boolean"],["FlightDelayMin","INTEGER","integer"],["FlightDelayType","VARCHAR","keyword"],["FlightNum","VARCHAR","keyword"],["FlightTimeHour","VARCHAR","keyword"],["FlightTimeMin","REAL","float"],["Origin","VARCHAR","keyword"],["OriginAirportID","VARCHAR","keyword"],["OriginCityName","VARCHAR","keyword"],["OriginCountry","VARCHAR","keyword"],["OriginLocation","GEOMETRY","geo_point"],["OriginRegion","VARCHAR","keyword"],["OriginWeather","VARCHAR","keyword"],["dayOfWeek","INTEGER","integer"],["timestamp","TIMESTAMP","datetime"]]}
也可以通過 url 參數(shù) format = txt 以表格形式格式化以上響應(yīng)。例如:
POST /_sql?format=txt{"query": "DESCRIBE kibana_sample_data_flights"}
上面命令查詢的結(jié)果是:
column | type | mapping------------------+---------------+---------------AvgTicketPrice |REAL |floatCancelled |BOOLEAN |booleanCarrier |VARCHAR |keywordDest |VARCHAR |keywordDestAirportID |VARCHAR |keywordDestCityName |VARCHAR |keywordDestCountry |VARCHAR |keywordDestLocation |GEOMETRY |geo_pointDestRegion |VARCHAR |keywordDestWeather |VARCHAR |keywordDistanceKilometers|REAL |floatDistanceMiles |REAL |floatFlightDelay |BOOLEAN |booleanFlightDelayMin |INTEGER |integerFlightDelayType |VARCHAR |keywordFlightNum |VARCHAR |keywordFlightTimeHour |VARCHAR |keywordFlightTimeMin |REAL |floatOrigin |VARCHAR |keywordOriginAirportID |VARCHAR |keywordOriginCityName |VARCHAR |keywordOriginCountry |VARCHAR |keywordOriginLocation |GEOMETRY |geo_pointOriginRegion |VARCHAR |keywordOriginWeather |VARCHAR |keyworddayOfWeek |INTEGER |integertimestamp |TIMESTAMP |datetime
是不是感覺回到 SQL 時(shí)代啊:)
向前邁進(jìn),只要提供來自 REST api 的示例響應(yīng),我們就會(huì)使用上面顯示的表格響應(yīng)結(jié)構(gòu)。要通過控制臺(tái)實(shí)現(xiàn)相同的查詢,需要使用以下命令登錄:
./bin/elasticsearch-sql-cli http://localhost:9200我們可在屏幕上看到如下的畫面:

太神奇了。我們直接看到 SQL 的命令提示符了。在上面的命令行中,我們打入如下的命令:
DESCRIBE kibana_sample_data_flights;
這個(gè)結(jié)果和我們?cè)贙ibana中得到的結(jié)果是一樣的。
上面的schema也會(huì)隨對(duì)在 SELECT 子句中顯示的字段的任何查詢一起返回,從而為任何潛在的驅(qū)動(dòng)程序提供格式化或?qū)Y(jié)果進(jìn)行操作所需的必要類型信息。例如,考慮帶有 LIMIT 子句的簡單 SELECT,以使響應(yīng)簡短。默認(rèn)情況下,我們返回1000行。
我們發(fā)現(xiàn)索引的名字 kibana_sample_data_flights 比較長,為了方便,我們來創(chuàng)建一個(gè)alias:
PUT /kibana_sample_data_flights/_alias/flights這樣在以后的操作中,當(dāng)我們使用flights的時(shí)候,其實(shí)也就是對(duì)索引kibana_sample_data_flights 進(jìn)行操作。
我們執(zhí)行如下的命令:
POST /_sql?format=txt{"query": "SELECT FlightNum FROM flights LIMIT 1"}
顯示結(jié)果:
FlightNum---------------9HY9SWR
相同的REST請(qǐng)求/響應(yīng)由JDBC驅(qū)動(dòng)程序和控制臺(tái)使用:
SELECT OriginCountry, OriginCityName FROM flights LIMIT 1;OriginCountry | OriginCityName---------------+-----------------DE |Frankfurt am Main
請(qǐng)注意,如果在任何時(shí)候請(qǐng)求的字段都不存在(區(qū)分大小寫),則表格式和強(qiáng)類型存儲(chǔ)區(qū)的語義意味著將返回錯(cuò)誤-這與 Elasticsearch 行為不同,在該行為中,根本不會(huì)返回該字段。例如,將上面的內(nèi)容修改為使用字段“OrigincityName”而不是“OriginCityName”會(huì)產(chǎn)生有用的錯(cuò)誤消息:
SELECT OriginCountry, OrigincityName FROM flights LIMIT 1;Bad request [Found 1 problem(s)line 1:23: Unknown column [OrigincityName], did you mean any of [OriginCityName, DestCityName]?]
同樣,如果我們嘗試在不兼容的字段上使用函數(shù)或表達(dá)式,則會(huì)出現(xiàn)相應(yīng)的錯(cuò)誤。通常,分析器在驗(yàn)證 AST 時(shí)會(huì)較早失敗。為了實(shí)現(xiàn)這一點(diǎn),Elasticsearch 必須了解每個(gè)字段的索引映射和功能。因此,任何具有安全性訪問 SQL 接口的客戶端都需要適當(dāng)?shù)臋?quán)限。
如果我們繼續(xù)提供每一個(gè)請(qǐng)求和相應(yīng)的回復(fù),我們將最終獲得一篇冗長的博客文章!為了簡潔起見,以下是一些帶有感興趣的注釋的日益復(fù)雜的查詢。
使用 WHERE 及 ORDER BY 來 SELECT
“找到飛行時(shí)間超過5小時(shí)的美國最長10班航班。”
POST /_sql?format=txt{"query": """SELECT OriginCityName, DestCityName FROM flights WHERE FlightTimeHour > 5 AND OriginCountry='US' ORDER BY FlightTimeHour DESC LIMIT 10"""}
顯示結(jié)果是:
OriginCityName | DestCityName---------------+-------------------Chicago |OsloCleveland |SeoulDenver |Chitose / TomakomaiNashville |VeronaMinneapolis |TokyoPortland |TrevisoSpokane |ViennaKansas City |ZurichKansas City |ShanghaiLos Angeles |Zurich
限制行數(shù)的運(yùn)算符因 SQL 實(shí)現(xiàn)而異。對(duì)于 Elasticsearch SQL,我們?cè)趯?shí)現(xiàn)LIMIT運(yùn)算符時(shí)與 Postgresql/Mysql 保持一致。
Math
只是一些隨機(jī)數(shù)字...
sql> SELECT ((1 + 3) * 1.5 / (7 - 6)) * 2 AS random;random---------------12.0
這代表服務(wù)器端對(duì)功能執(zhí)行某些后處理的示例。沒有等效的Elasticsearch DSL查詢。
Functions & Expressions
“在2月份之后查找所有航班,該航班的飛行時(shí)間大于5小時(shí),并且按照時(shí)間最長來排序。”
POST /_sql?format=txt{"query": """SELECT MONTH_OF_YEAR(timestamp), OriginCityName, DestCityName FROM flights WHERE FlightTimeHour > 1 AND MONTH_OF_YEAR(timestamp) > 2 ORDER BY FlightTimeHour DESC LIMIT 10"""}
顯示結(jié)果是:
MONTH_OF_YEAR(timestamp)|OriginCityName | DestCityName------------------------+---------------+---------------4 |Chicago |Oslo4 |Osaka |Spokane4 |Quito |Tucson4 |Shanghai |Stockholm5 |Tokyo |Venice5 |Tokyo |Venice5 |Tokyo |Venice5 |Buenos Aires |Treviso5 |Amsterdam |Birmingham5 |Edmonton |Milan
這些功能通常需要在 Elasticsearch 中運(yùn)用 Painless 變形才能達(dá)到等效的效果,而 SQL 的功能聲明避免任何腳本編寫。還要注意我們?nèi)绾卧赪HERE和SELECT子句中使用該函數(shù)。WHERE 子句組件被下推到 Elasticsearch,因?yàn)樗绊懡Y(jié)果計(jì)數(shù)。SELECT 函數(shù)由演示中的服務(wù)器端插件處理。
請(qǐng)注意,可用功能列表可通過“SHOW FUNCTIONS”檢索
SHOW FUNCTIONS;name | type-----------------+---------------AVG |AGGREGATECOUNT |AGGREGATEFIRST |AGGREGATEFIRST_VALUE |AGGREGATELAST |AGGREGATELAST_VALUE |AGGREGATEMAX |AGGREGATE...
將其與我們之前的數(shù)學(xué)能力相結(jié)合,我們可以開始制定查詢,對(duì)于大多數(shù)DSL用戶來說,查詢將非常復(fù)雜。
“找出最快的2個(gè)航班(速度)的距離和平均速度,這些航班在星期一,星期二或星期三上午9點(diǎn)至11點(diǎn)之間離開,并且距離超過500公里。將距離和速度四舍五入到最接近的整數(shù)。如果速度相等,請(qǐng)先顯示最長的時(shí)間。”
首先我們?cè)谏厦娴?DESCRIBE kibana_sample_data_flights 命令的輸出中,我們可以看到FlightTimeHour 是一個(gè) keyword。這個(gè)顯然是不對(duì)的,因?yàn)樗且粋€(gè)數(shù)值。也許在最初的設(shè)計(jì)時(shí)這么想的。我們需要把這個(gè)字段改為 float 類型的數(shù)據(jù)。
PUT flight1{"mappings": {"properties": {"AvgTicketPrice": {"type": "float"},"Cancelled": {"type": "boolean"},"Carrier": {"type": "keyword"},"Dest": {"type": "keyword"},"DestAirportID": {"type": "keyword"},"DestCityName": {"type": "keyword"},"DestCountry": {"type": "keyword"},"DestLocation": {"type": "geo_point"},"DestRegion": {"type": "keyword"},"DestWeather": {"type": "keyword"},"DistanceKilometers": {"type": "float"},"DistanceMiles": {"type": "float"},"FlightDelay": {"type": "boolean"},"FlightDelayMin": {"type": "integer"},"FlightDelayType": {"type": "keyword"},"FlightNum": {"type": "keyword"},"FlightTimeHour": {"type": "float"},"FlightTimeMin": {"type": "float"},"Origin": {"type": "keyword"},"OriginAirportID": {"type": "keyword"},"OriginCityName": {"type": "keyword"},"OriginCountry": {"type": "keyword"},"OriginLocation": {"type": "geo_point"},"OriginRegion": {"type": "keyword"},"OriginWeather": {"type": "keyword"},"dayOfWeek": {"type": "integer"},"timestamp": {"type": "date"}}}}
我們需要 reindex 這個(gè)索引。
POST _reindex{"source": {"index": "flights"},"dest": {"index": "flight1"}}
那么現(xiàn)在 flight1 的數(shù)據(jù)中,F(xiàn)lightTimeHour 字段將會(huì)是一個(gè) float 的類型。我們?cè)俅沃匦略O(shè)置 alias 為 flights:
POST _aliases{"actions": [{"add": {"index": "flight1","alias": "flights"}},{"remove": {"index": "kibana_sample_data_flights","alias": "flights"}}]}
那么現(xiàn)在 flights 將是指向 flight1 的一個(gè) alias。
我們使用如下的 SQL 語句來查詢:
sql> SELECT timestamp, FlightNum, OriginCityName, DestCityName, ROUND(DistanceMiles) AS distance, ROUND(DistanceMiles/FlightTimeHour) AS speed, DAY_OF_WEEK(timestamp) AS day_of_week FROM flights WHERE DAY_OF_WEEK(timestamp) >= 0 AND DAY_OF_WEEK(timestamp) <= 2 AND HOUR_OF_DAY(timestamp) >=9 AND HOUR_OF_DAY(timestamp) <= 10 ORDER BY speed DESC, distance DESC LIMIT 2;timestamp | FlightNum |OriginCityName | DestCityName | distance | speed | day_of_week------------------------+---------------+---------------+---------------+---------------+---------------+---------------2020-05-17T10:53:52.000Z|LAJSKLT |Guangzhou |Lima |11398.0 |783.0 |12020-04-27T09:30:39.000Z|VLUDO2H |Buenos Aires |Moscow |8377.0 |783.0 |2
一個(gè)相當(dāng)復(fù)雜且奇怪的問題,但希望您能明白這一點(diǎn)。還要注意我們?nèi)绾蝿?chuàng)建字段別名并在ORDER BY 子句中引用它們。
還要注意,不需要在 SELECT 子句中指定 WHERE 和 ORDER BY 中使用的所有字段。這可能與您過去使用的 SQL 實(shí)現(xiàn)不同。例如,以下內(nèi)容完全正確:
POST /_sql{"query":"SELECT timestamp, FlightNum FROM flights WHERE AvgTicketPrice > 500 ORDER BY AvgTicketPrice"}
它顯示:
{"columns" : [{"name" : "timestamp","type" : "datetime"},{"name" : "FlightNum","type" : "text"}],"rows" : [["2020-04-26T09:04:20.000Z","QG5DXD3"],["2020-05-02T23:18:27.000Z","NXA71BT"],["2020-04-17T01:55:18.000Z","VU8K9DM"],["2020-04-24T08:46:45.000Z","UM8IKF8"],...]
我們都曾嘗試過要在 Elasticsearch DSL 中表達(dá)的 SQL 查詢,或者想知道它是否是最佳的。新 SQL 接口的引人注目的功能之一是它能夠協(xié)助 Elasticsearch 的新采用者解決此類問題。使用 REST 接口,我們只需將/translate附加到“sql”端點(diǎn),即可獲取驅(qū)動(dòng)程序?qū)l(fā)出的Elasticsearch 查詢。
讓我們考慮一下以前的一些查詢:
POST /_sql/translate{"query": "SELECT OriginCityName, DestCityName FROM flights WHERE FlightTimeHour > 5 AND OriginCountry='US' ORDER BY FlightTimeHour DESC LIMIT 10"}
對(duì)于任何有經(jīng)驗(yàn)的 Elasticsearch 用戶,等效的 DSL 都應(yīng)該是顯而易見的:
{"size" : 10,"query" : {"bool" : {"must" : [{"range" : {"FlightTimeHour" : {"from" : 5,"to" : null,"include_lower" : false,"include_upper" : false,"boost" : 1.0}}},{"term" : {"OriginCountry.keyword" : {"value" : "US","boost" : 1.0}}}],"adjust_pure_negative" : true,"boost" : 1.0}},"_source" : {"includes" : ["OriginCityName","DestCityName"],"excludes" : [ ]},"sort" : [{"FlightTimeHour" : {"order" : "desc","missing" : "_first","unmapped_type" : "float"}}]}
WHERE 子句將按您期望的那樣轉(zhuǎn)換為 range 和 term 查詢。請(qǐng)注意,子字段的OriginCountry.keyword變體如何用于與父代 OriginCountry(文本類型)的精確匹配。不需要用戶知道基礎(chǔ)映射的行為差異-正確的字段類型將會(huì)被自動(dòng)選擇。有趣的是,該接口嘗試通過在 _source 上使用 docvalue_fields 來優(yōu)化檢索性能,例如適用于啟用了 doc 值的確切類型(數(shù)字,日期,關(guān)鍵字)。我們可以依靠 Elasticsearch SQL 為指定的查詢生成最佳的 DSL。
現(xiàn)在考慮我們上次使用的最復(fù)雜的查詢:
POST /_sql/translate{"query": """SELECT timestamp, FlightNum, OriginCityName, DestCityName, ROUND(DistanceMiles) AS distance, ROUND(DistanceMiles/FlightTimeHour) AS speed, DAY_OF_WEEK(timestamp) AS day_of_week FROM flights WHERE DAY_OF_WEEK(timestamp) >= 0 AND DAY_OF_WEEK(timestamp) <= 2 AND HOUR_OF_DAY(timestamp) >=9 AND HOUR_OF_DAY(timestamp) <= 10 ORDER BY speed DESC, distance DESC LIMIT 2"""}
上面的響應(yīng)為:
{"size" : 2,"query" : {"bool" : {"must" : [{"script" : {"script" : {"source" : "InternalSqlScriptUtils.nullSafeFilter(InternalSqlScriptUtils.and(InternalSqlScriptUtils.gte(InternalSqlScriptUtils.dateTimeChrono(InternalSqlScriptUtils.docValue(doc,params.v0), params.v1, params.v2), params.v3), InternalSqlScriptUtils.lte(InternalSqlScriptUtils.dateTimeChrono(InternalSqlScriptUtils.docValue(doc,params.v4), params.v5, params.v6), params.v7)))","lang" : "painless","params" : {"v0" : "timestamp","v1" : "Z","v2" : "HOUR_OF_DAY","v3" : 9,"v4" : "timestamp","v5" : "Z","v6" : "HOUR_OF_DAY","v7" : 10}},"boost" : 1.0}},{"script" : {"script" : {"source" : "InternalSqlScriptUtils.nullSafeFilter(InternalSqlScriptUtils.and(InternalSqlScriptUtils.gte(InternalSqlScriptUtils.dayOfWeek(InternalSqlScriptUtils.docValue(doc,params.v0), params.v1), params.v2), InternalSqlScriptUtils.lte(InternalSqlScriptUtils.dayOfWeek(InternalSqlScriptUtils.docValue(doc,params.v3), params.v4), params.v5)))","lang" : "painless","params" : {"v0" : "timestamp","v1" : "Z","v2" : 0,"v3" : "timestamp","v4" : "Z","v5" : 2}},"boost" : 1.0}}],"adjust_pure_negative" : true,"boost" : 1.0}},"_source" : {"includes" : ["FlightNum","OriginCityName","DestCityName","DistanceMiles","FlightTimeHour"],"excludes" : [ ]},"docvalue_fields" : [{"field" : "timestamp","format" : "epoch_millis"}],"sort" : [{"_script" : {"script" : {"source" : "InternalSqlScriptUtils.nullSafeSortNumeric(InternalSqlScriptUtils.round(InternalSqlScriptUtils.div(InternalSqlScriptUtils.docValue(doc,params.v0),InternalSqlScriptUtils.docValue(doc,params.v1)),params.v2))","lang" : "painless","params" : {"v0" : "DistanceMiles","v1" : "FlightTimeHour","v2" : null}},"type" : "number","order" : "desc"}},{"_script" : {"script" : {"source" : "InternalSqlScriptUtils.nullSafeSortNumeric(InternalSqlScriptUtils.round(InternalSqlScriptUtils.docValue(doc,params.v0),params.v1))","lang" : "painless","params" : {"v0" : "DistanceMiles","v1" : null}},"type" : "number","order" : "desc"}}]}
是不是覺得非常復(fù)雜啊?
我們的 WHERE 和 ORDER BY 子句已轉(zhuǎn)換為 painless 腳本,并在 Elasticsearch 提供的排序和腳本查詢中使用。這些腳本甚至被參數(shù)化以避免編譯并利用腳本緩存。
附帶說明一下,盡管以上內(nèi)容代表了 SQL 語句的最佳翻譯,但并不代表解決更廣泛問題的最佳解決方案。實(shí)際上,我們希望在索引時(shí)間對(duì)文檔中的星期幾,一天中的小時(shí)和速度進(jìn)行編碼,因此可以只使用簡單的范圍查詢。這可能比使用painless 腳本解決此特定問題的性能更高。實(shí)際上,由于這些原因,其中的某些字段實(shí)際上甚至已經(jīng)存在于文檔中。這是用戶應(yīng)注意的常見主題:盡管我們可以依靠 Elasticsearch SQL 實(shí)現(xiàn)為我們提供最佳翻譯,但它只能利用查詢中指定的字段,因此不一定能為更大的問題查詢提供最佳解決方案。為了實(shí)現(xiàn)最佳方法,需要考慮基礎(chǔ)平臺(tái)的優(yōu)勢(shì),而 _translate API 可能是此過程的第一步。
END
前線推出學(xué)習(xí)交流群,加群一定要備注:研究/工作方向+地點(diǎn)+學(xué)校/公司+昵稱(如大數(shù)據(jù)+上海+攜程+可可),根據(jù)格式備注,可更快被通過且邀請(qǐng)進(jìn)群,領(lǐng)取一份專屬學(xué)習(xí)禮包 掃碼家助理微信進(jìn)群,內(nèi)推和技術(shù)交流,大佬們零距離
