日皮视频在线观看,亚洲国产精品色婷婷,精品少妇人妻Av久久久牛牛,黄片视频在线看,伊人超碰,免费看黄片,在线观看,天天草夜夜操,色欲久久久

一?背景

接到一個(gè)需求，需要把hive數(shù)據(jù)同步到clickhouse，本來(lái)以為是一個(gè)非常簡(jiǎn)單的需求，因?yàn)閿?shù)據(jù)平臺(tái)已經(jīng)集成了datax，最新版的datax是支持clickhouse writer的。

萬(wàn)萬(wàn)沒(méi)想到，同步的時(shí)候有點(diǎn)慢，每小時(shí)400w條數(shù)據(jù)左右，表里面這么多數(shù)據(jù)，要同步到什么時(shí)候去。所以開(kāi)始了漫漫調(diào)研路，最終選擇了waterdrop

二關(guān)于waterdrop

Waterdrop是生產(chǎn)環(huán)境中的海量數(shù)據(jù)計(jì)算引擎，可以滿(mǎn)足你的流式，離線，etl，聚合等計(jì)算需求。InterestingLab是一個(gè)以為用戶(hù)簡(jiǎn)化和普及大數(shù)據(jù)處理為核心目標(biāo)的開(kāi)源技術(shù)團(tuán)隊(duì)。核心項(xiàng)目Waterdrop是基于Spark，F(xiàn)link構(gòu)建的配置化，零開(kāi)發(fā)成本的大規(guī)模流式及離線處理工具。目前已有360、滴滴、華為、微博、新浪、一點(diǎn)資訊、永輝集團(tuán)、水滴籌等多個(gè)行業(yè)的公司在線上使用。

項(xiàng)目地址: https://github.com/InterestingLab/waterdrop

文檔地址：https://interestinglab.github.io/waterdrop-docs/

快速入門(mén)：https://interestinglab.github.io/waterdrop-docs/#/zh-cn/v1/quick-start

行業(yè)應(yīng)用案例：https://interestinglab.github.io/waterdrop-docs/#/zh-cn/v1/case_study/

插件開(kāi)發(fā)：https://interestinglab.github.io/waterdrop-docs/#/zh-cn/v1/developing-plugin

Waterdrop的設(shè)計(jì)與實(shí)現(xiàn)原理：https://mp.weixin.qq.com/s/lYECVCYdKsfcL64xhWEqPg

三 waterdrop架構(gòu)

3.1?input

3.2?filter

3.3??output

四安裝使用

4.1 下載

https://github.com/InterestingLab/waterdrop/releases

4.2 解壓

tar -zxvf waterdrop-1.4.2-with-spark.zip

4.3配置文件修改(hive-->clickhouse)

waterdrop-env.sh

#!/usr/bin/env bash

# Home directory of spark distribution.

SPARK_HOME=/usr/local/spark-current/

test_df.conf

spark {

??spark.app.name = "hive-ck"

??spark.executor.instances = 8

??spark.executor.cores = 2

??spark.executor.memory = "2g"

??spark.sql.catalogImplementation = "hive"

??spark.yarn.queue="root.test"

}

input {

??hive {

????pre_sql = "select * from wedw_tmp.test_df"

????table_name = "test_df"

}

filter {

}

output {

????clickhouse {

????host = "10.20.xxx.xxx:8123"

????database = "ck"

????clickhouse.socket_timeout=600000

????table = "test_df"

????username = "root"

????password = "123456"

????bulk_size = 50000

????retry = 3

}

4.4?啟動(dòng)waterdrop同步數(shù)據(jù)

/home/pgxl/liuzc/waterdrop-1.4.2/bin/start-waterdrop.sh --master yarn --deploy-mode client --config /home/pgxl/liuzc/waterdrop-1.4.2/config/test.conf

4.5?速度

2億條數(shù)據(jù)，一個(gè)小時(shí)左右

五使用中可能遇到的問(wèn)題

5.1 Too many parts (304). Merges are processing significantly slower than inserts

merge速度跟不上插入速度，也就是insert，可能原因：?數(shù)據(jù)是否可能跨多個(gè)分區(qū)，如果這樣的話每次寫(xiě)入有多個(gè)partition， merge壓力很大，可以減少并發(fā)

?spark.executor.instances = 4

5.2 read time out?

超時(shí)問(wèn)題，可適當(dāng)增加超時(shí)時(shí)間

clickhouse.socket_timeout=600000

5.3 找不到類(lèi)

需要看一下spark的配置?

--end--

掃描下方二維碼
添加好友，備注【交流】
可私聊交流，也可進(jìn)資源豐富學(xué)習(xí)群

為了把Hive數(shù)據(jù)同步到ClickHouse，我調(diào)研了Waterdrop

5.1 Too many parts (304). Merges are processing significantly slower than inserts

5.2 read time out?

5.3 找不到類(lèi)

--end--

為了把Hive數(shù)據(jù)同步到ClickHouse，我調(diào)研了Waterdrop