【NLP】大模型訓(xùn)練之難,難于上青天?預(yù)訓(xùn)練易用、效率超群的「李白」模型庫(kù)來(lái)了!
機(jī)器之心編輯部
LiBai(李白)模型庫(kù)覆蓋了 Hugging Face、Megatron-LM、DeepSpeed、FairSeq 這些所有主流 Transformer 庫(kù)的優(yōu)點(diǎn),讓大模型訓(xùn)練飛入尋常百姓家。

支持單卡代碼平滑地?cái)U(kuò)展到分布式。LiBai 內(nèi)置的模型與 PyTorch 保持一致風(fēng)格,大大降低學(xué)習(xí)和使用成本,只需要簡(jiǎn)單配置,就可以便捷地?cái)U(kuò)展至任意規(guī)模的并行。這意味著,你可以在單卡上增加新功能,進(jìn)行模型調(diào)試,跑通代碼后再絲滑地遷移到分布式上進(jìn)行訓(xùn)練。如果完全不想配置分布式訓(xùn)練,或是覺(jué)得手動(dòng)配置的分布式訓(xùn)練太慢,那可以試用分布式托管特性,只需安裝自動(dòng)并行的包(https://libai.readthedocs.io/en/latest/tutorials/basics/Auto_Parallel.html),并在 LiBai 里配置一行 graph.auto_parallel=True,就可以專注于模型本身,在完全不用操心分布式的同時(shí)獲得較快的訓(xùn)練速度。
兼容 Hugging Face。OneFlow 和 PyTorch 在 API 層次高度兼容,可以通過(guò)簡(jiǎn)單的代碼修改就可以導(dǎo)入 Hugging Face 模型,只須 import oneflow as torch ,基于 LiBai 的數(shù)據(jù)并行、自動(dòng)混合精度、Activation Checkpoint、ZeRO 等機(jī)制進(jìn)行一個(gè)大規(guī)模模型的訓(xùn)練。如果把模型的個(gè)別層次替換為 LiBai 內(nèi)置的 layers ,就可以使用 3D 并行來(lái)訓(xùn)練一個(gè)大模型。
模塊化設(shè)計(jì)。在 LiBai 的實(shí)現(xiàn)中,不僅為模型構(gòu)建提供可復(fù)用的基礎(chǔ)計(jì)算模塊,也針對(duì)數(shù)據(jù)加載、訓(xùn)練邏輯、指標(biāo)計(jì)算等做了抽象和模塊化處理,方便用戶根據(jù)自己的需求重寫(xiě),然后作為插件集成到 LiBai 的訓(xùn)練系統(tǒng)中進(jìn)行訓(xùn)練。
開(kāi)箱即用。大模型訓(xùn)練通常需要依賴一些技術(shù),LiBai 提供了混合精度訓(xùn)練、梯度重計(jì)算、梯度累加、ZeRO 等特性,可以輕松與數(shù)據(jù)并行、模型并行、流水并行組合使用。
快速?gòu)?fù)現(xiàn)實(shí)驗(yàn)。OneFlow 團(tuán)隊(duì)參考了 Detectron2 LazyConfig(https://github.com/facebookresearch/detectron2/blob/main/docs/tutorials/lazyconfigs.md) 來(lái)構(gòu)建 LiBai 的配置系統(tǒng),相比于傳統(tǒng)的 argparse 和 yacs-based 配置方式,LiBai 的配置系統(tǒng)更加靈活,使用 Python 語(yǔ)法完成整體構(gòu)建,所以添加新的參數(shù)和模塊非常方便,只需要 import 對(duì)應(yīng)的模塊即可完成新模塊的添加。同時(shí),訓(xùn)練配置還可以序列化成 yaml 文件進(jìn)行保存,方便直接在文件中進(jìn)行關(guān)鍵字搜索來(lái)查找配置項(xiàng),如果用戶想要復(fù)現(xiàn)之前的實(shí)驗(yàn)的結(jié)果,也直接傳入保存的 config.yaml 作為訓(xùn)練配置,保留非常多腳本的文件既不利于查看有效修改,在復(fù)現(xiàn)實(shí)驗(yàn)的同時(shí)也容易弄混實(shí)驗(yàn)配置。
高效性能。通過(guò)和 Megatron-LM 進(jìn)行嚴(yán)格的 kernel 對(duì)齊,實(shí)現(xiàn)了多種 kernel fusion 操作,同時(shí)得益于 OneFlow 靜態(tài)圖的設(shè)計(jì),不管是單卡性能還是各種組合并行的效率,LiBai 都優(yōu)于英偉達(dá)深度優(yōu)化的 Megatron-LM 和微軟的 DeepSpeed。
Megatron-LM 固定 commit:https://github.com/NVIDIA/Megatron-LM/commit/e156d2fea7fc5c98e645f7742eb86b643956d840 LiBai commit: https://github.com/Oneflow-Inc/libai/commit/9fc504c457da4fd1e92d854c60b7271c89a55222 OneFlow commit: https://github.com/Oneflow-Inc/oneflow/commit/55b822e4d3c88757d11077d7546981309125c73f
注:以下每組參數(shù)的含義: DP 數(shù)據(jù)并行、MP 模型并行、PP 流水并行、2D 并行、3D 并行 fp16:打開(kāi)混合精度訓(xùn)練 (amp) nl: num layers (當(dāng) Pipeline parallel size = 8 時(shí),為了讓每個(gè) stage 有相對(duì)數(shù)量的 layer 進(jìn)行計(jì)算,我們將 num layers 從 24 調(diào)整為 48)
ac: enable activation checkpointing mb: micro-batch size per gpu gb: global batch size total dxmxp,其中: d = 數(shù)據(jù)并行度(data-parallel-size) m = 模型并行度(tensor-model-parallel-size) p = 流水并行度(pipeline-model-parallel-size)
1n1g 表示單機(jī)單卡,1n8g 表示單機(jī) 8 卡, 2n8g 表示 2 機(jī)每機(jī) 8 卡共 16 卡, 4n8g 表示 4 機(jī)共 32 卡 grad_acc_num_step = global_batch_size / (micro_batch_size * data_parallel_size) 展示的結(jié)果為 Throughout

(注:本組 num layers = 24,開(kāi)啟 amp,1n1g micro-batch size = 24, 其余組 micro-batch size = 16)

(注:本組 num layers = 24,開(kāi)啟 amp,1n1g micro-batch size = 6, 其余組 micro-batch size = 4)

(注:本組 num layers = 24,開(kāi)啟 amp, 開(kāi)啟 activation checkpointing,
micro-batch size = 128, global batch size = 1024, grad acc step = 8)

(注:本組 num layers = 24,開(kāi)啟 amp)

(注:前兩組 num layers = 24,grad acc step = 8, 最后一組 num layers = 48, grad acc step = 16,均開(kāi)啟 amp,開(kāi)啟 activation checkpointing)

(注:前兩組 num layers = 24,grad acc step = 8, 最后一組 num layers = 48, grad acc step = 16,均開(kāi)啟 amp,開(kāi)啟 activation checkpointing)

(注:本組均為 num layers = 24,均開(kāi)啟 amp,開(kāi)啟 activation checkpointing, micro-batch size = 128,grad acc step = 8)

(注:本組均為 num layers = 24,均開(kāi)啟 amp,開(kāi)啟 activation checkpointing, micro-batch size = 32,grad acc step = 8)

(注:本組均為 num layers = 24,均開(kāi)啟 amp,開(kāi)啟 activation checkpointing, micro-batch size = 128,grad acc step = 8)

(注:本組均為 num layers = 24,均開(kāi)啟 amp,開(kāi)啟 activation checkpointing, micro-batch size = 32,grad acc step = 8)
(注:本組均為 num layers = 24,均開(kāi)啟 amp,開(kāi)啟 activation checkpointing,grad acc step = 8)

(注:本組均為 num layers = 24,均開(kāi)啟 amp,開(kāi)啟 activation checkpointing,grad acc step = 8)

兼容性。可以有效和目前基于 PyTorch 實(shí)現(xiàn)的 SOTA 模型兼容,方便用戶快速遷移模型。 高效性。無(wú)論是單卡還是多卡,用戶使用 LiBai 都可以提高訓(xùn)練效率。 易用性。LiBai 具有優(yōu)秀的擴(kuò)展性,可以很方便地根據(jù)需求修改模型,增加新功能,更快地完成原型功能的開(kāi)發(fā)。以幾乎無(wú)感知、零學(xué)習(xí)成本的方式幫助用戶大幅降低分布式深度學(xué)習(xí)訓(xùn)練的門(mén)檻,用戶在使用 LiBai 開(kāi)發(fā)新模型和新功能時(shí),只要會(huì)單張 GPU 編程就能自動(dòng)擴(kuò)展到大規(guī)模 GPU 的集群,無(wú)須為分布式訓(xùn)練重寫(xiě)代碼,從而提高開(kāi)發(fā)的效率。
# configs/common/train.py# Distributed argumentsdist=dict(data_parallel_size=1,tensor_parallel_size=1,pipeline_parallel_size=1,)
純數(shù)據(jù)并行
# your config.pyfrom libai.config import get_configtrain = get_config("common/train.py").traintrain.dist.data_parallel_size = 8
純模型并行
# your config.pyfrom libai.config import get_configtrain = get_config("common/train.py").traintrain.dist.tensor_parallel_size = 8
# your config.pyfrom libai.config import get_configtrain = get_config("common/train.py").traintrain.dist.data_parallel_size = 2train.dist.tensor_parallel_size = 4
from libai.layers import Linearself.head = Linear(hidden_size, num_classes)
from libai.layers import Linearimport libai.utils.distributed as distself.head = Linear(hidden_size, num_classes).to_global(placement=dist.get_layer_placement(-1))
from libai.layers import Linearself.head = Linear(hidden_size, num_classes, layer_idx=-1)
class MyModule(nn.Module):def __init__(self, ... *, layer_idx):...self.layer_idx = layer_idx...def forward(self, input_data):input_data = input_data.to_global(placement=dist.get_layer_placement(self.layer_idx))...
set the number of pipeline stages to be 2train.dist.pipeline_parallel_size = 2# set model layers for pipelinetrain.dist.pipeline_num_layers = hidden_layers
# your config.pyfrom libai.config import get_configtrain = get_config("common/train.py").traintrain.dist.data_parallel_size = 2train.dist.tensor_parallel_size = 2train.dist.pipeline_parallel_size = 2hidden_layers = 8 #網(wǎng)絡(luò)的層數(shù)train.dist.pipeline_num_layers = hidden_layers
[ |X00 gpu0 | X01 gpu1--------------------------X10 gpu2 | X11 gpu3| ]
LiBai 中封裝 dist.get_nd_sbp()是為了兼容 1D parallel 的需求,同時(shí) dist.get_layer_placement()是為了方便配置 pipeline parallel。大多數(shù)情況下,用戶可以直接參照以下代碼:
# test.pyimport oneflow as flowfrom omegaconf import DictConfigfrom oneflow import nnfrom libai.utils import distributed as distcfg = DictConfig(dict(data_parallel_size=2, tensor_parallel_size=2, pipeline_parallel_size=1))dist.setup_dist_util(cfg)class Noise(nn.Module):def __init__(self):super().__init__()self.noise_tensor = flow.randn(16, 8,sbp=dist.get_nd_sbp([flow.sbp.split(0), flow.sbp.split(1)]),placement=dist.get_layer_placement(layer_idx=0))# 也可以換成以下的寫(xiě)法# self.noise_tensor = flow.randn(# 16, 8,# sbp=(flow.sbp.split(0), flow.sbp.split(1)),# placement=flow.placement("cuda", ranks=[[0, 1],[2, 3]])# )def forward(self, x):return x + self.noise_tensorNoise = Noise()x = flow.zeros(16, 8,sbp=(flow.sbp.split(0), flow.sbp.split(1)),placement=flow.placement("cuda", ranks=[[0, 1],[2, 3]]))y = Noise(x)print(f"rank: {flow.env.get_rank()}, global tensor: shape {y.shape} sbp {y.sbp} placement {y.placement}, local tensor shape: {y.to_local().shape}")
python3 -m oneflow.distributed.launch --nproc_per_node 4 test.pyrank: 2, global tensor: shape oneflow.Size([16, 8]) sbp (oneflow.sbp.split(axis=0), oneflow.sbp.split(axis=1)) placement oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3]]), local tensor shape: oneflow.Size([8, 4])rank: 3, global tensor: shape oneflow.Size([16, 8]) sbp (oneflow.sbp.split(axis=0), oneflow.sbp.split(axis=1)) placement oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3]]), local tensor shape: oneflow.Size([8, 4])rank: 1, global tensor: shape oneflow.Size([16, 8]) sbp (oneflow.sbp.split(axis=0), oneflow.sbp.split(axis=1)) placement oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3]]), local tensor shape: oneflow.Size([8, 4])rank: 0, global tensor: shape oneflow.Size([16, 8]) sbp (oneflow.sbp.split(axis=0), oneflow.sbp.split(axis=1)) placement oneflow.placement(type="cuda", ranks=[[0, 1], [2, 3]]), local tensor shape: oneflow.Size([8, 4])

LiBai 模型庫(kù)地址:https://github.com/Oneflow-Inc/libai LiBai 文檔地址:https://libai.readthedocs.io/en/latest OneFlow 項(xiàng)目地址:https://github.com/Oneflow-Inc/oneflow
? THE END
轉(zhuǎn)載請(qǐng)聯(lián)系本公眾號(hào)獲得授權(quán)
投稿或?qū)で髨?bào)道:[email protected]
往期精彩回顧
