AutoGPTQ大語言模型量化工具包
AutoGPTQ 是一個(gè)基于 GPTQ 算法,簡單易用且擁有用戶友好型接口的大語言模型量化工具包。
性能對比
推理速度
以下結(jié)果通過這個(gè)腳本生成,文本輸入的 batch size 為1,解碼策略為 beam search 并且強(qiáng)制模型生成512個(gè) token,速度的計(jì)量單位為 tokens/s(越大越好)。
量化模型通過能夠最大化推理速度的方式加載。
| model | GPU | num_beams | fp16 | gptq-int4 |
|---|---|---|---|---|
| llama-7b | 1xA100-40G | 1 | 18.87 | 25.53 |
| llama-7b | 1xA100-40G | 4 | 68.79 | 91.30 |
| moss-moon 16b | 1xA100-40G | 1 | 12.48 | 15.25 |
| moss-moon 16b | 1xA100-40G | 4 | OOM | 42.67 |
| moss-moon 16b | 2xA100-40G | 1 | 06.83 | 06.78 |
| moss-moon 16b | 2xA100-40G | 4 | 13.10 | 10.80 |
| gpt-j 6b | 1xRTX3060-12G | 1 | OOM | 29.55 |
| gpt-j 6b | 1xRTX3060-12G | 4 | OOM | 47.36 |
困惑度(PPL)
快速開始
量化和推理
警告:這里僅是對 AutoGPTQ 中基本接口的用法展示,只使用了一條文本來量化一個(gè)特別小的模型,因此其結(jié)果的表現(xiàn)可能不如在大模型上執(zhí)行量化后預(yù)期的那樣好。
以下展示了使用 auto_gptq 進(jìn)行量化和推理的最簡單用法:
from transformers import AutoTokenizer, TextGenerationPipeline from auto_gptq import AutoGPTQForCausalLM, BaseQuantizeConfig pretrained_model_dir = "facebook/opt-125m" quantized_model_dir = "opt-125m-4bit" tokenizer = AutoTokenizer.from_pretrained(pretrained_model_dir, use_fast=True) examples = [ tokenizer( "auto-gptq is an easy-to-use model quantization library with user-friendly apis, based on GPTQ algorithm." ) ] quantize_config = BaseQuantizeConfig( bits=4, # 將模型量化為 4-bit 數(shù)值類型 group_size=128, # 一般推薦將此參數(shù)的值設(shè)置為 128 desc_act=False, # 設(shè)為 False 可以顯著提升推理速度,但是 ppl 可能會輕微地變差 ) # 加載未量化的模型,默認(rèn)情況下,模型總是會被加載到 CPU 內(nèi)存中 model = AutoGPTQForCausalLM.from_pretrained(pretrained_model_dir, quantize_config) # 量化模型, 樣本的數(shù)據(jù)類型應(yīng)該為 List[Dict],其中字典的鍵有且僅有 input_ids 和 attention_mask model.quantize(examples) # 保存量化好的模型 model.save_quantized(quantized_model_dir) # 使用 safetensors 保存量化好的模型 model.save_quantized(quantized_model_dir, use_safetensors=True) # 將量化好的模型直接上傳至 Hugging Face Hub # 當(dāng)使用 use_auth_token=True 時(shí), 確保你已經(jīng)首先使用 huggingface-cli login 進(jìn)行了登錄 # 或者可以使用 use_auth_token="hf_xxxxxxx" 來顯式地添加賬戶認(rèn)證 token # (取消下面三行代碼的注釋來使用該功能) # repo_id = f"YourUserName/{quantized_model_dir}" # commit_message = f"AutoGPTQ model for {pretrained_model_dir}: {quantize_config.bits}bits, gr{quantize_config.group_size}, desc_act={quantize_config.desc_act}" # model.push_to_hub(repo_id, commit_message=commit_message, use_auth_token=True) # 或者你也可以同時(shí)將量化好的模型保存到本地并上傳至 Hugging Face Hub # (取消下面三行代碼的注釋來使用該功能) # repo_id = f"YourUserName/{quantized_model_dir}" # commit_message = f"AutoGPTQ model for {pretrained_model_dir}: {quantize_config.bits}bits, gr{quantize_config.group_size}, desc_act={quantize_config.desc_act}" # model.push_to_hub(repo_id, save_dir=quantized_model_dir, use_safetensors=True, commit_message=commit_message, use_auth_token=True) # 加載量化好的模型到能被識別到的第一塊顯卡中 model = AutoGPTQForCausalLM.from_quantized(quantized_model_dir, device="cuda:0") # 從 Hugging Face Hub 下載量化好的模型并加載到能被識別到的第一塊顯卡中 # model = AutoGPTQForCausalLM.from_quantized(repo_id, device="cuda:0", use_safetensors=True, use_triton=False) # 使用 model.generate 執(zhí)行推理 print(tokenizer.decode(model.generate(**tokenizer("auto_gptq is", return_tensors="pt").to(model.device))[0])) # 或者使用 TextGenerationPipeline pipeline = TextGenerationPipeline(model=model, tokenizer=tokenizer) print(pipeline("auto-gptq is")[0]["generated_text"])
參考 此樣例腳本 以了解進(jìn)階的用法。
自定義模型
以下展示了如何拓展 `auto_gptq` 以支持 `OPT` 模型,如你所見,這非常簡單:
from auto_gptq.modeling import BaseGPTQForCausalLM class OPTGPTQForCausalLM(BaseGPTQForCausalLM): # chained attribute name of transformer layer block layers_block_name = "model.decoder.layers" # chained attribute names of other nn modules that in the same level as the transformer layer block outside_layer_modules = [ "model.decoder.embed_tokens", "model.decoder.embed_positions", "model.decoder.project_out", "model.decoder.project_in", "model.decoder.final_layer_norm" ] # chained attribute names of linear layers in transformer layer module # normally, there are four sub lists, for each one the modules in it can be seen as one operation, # and the order should be the order when they are truly executed, in this case (and usually in most cases), # they are: attention q_k_v projection, attention output projection, MLP project input, MLP project output inside_layer_modules = [ ["self_attn.k_proj", "self_attn.v_proj", "self_attn.q_proj"], ["self_attn.out_proj"], ["fc1"], ["fc2"] ]
然后, 你就可以像在基本用法一節(jié)中展示的那樣使用 OPTGPTQForCausalLM.from_pretrained 和其他方法。
在下游任務(wù)上執(zhí)行評估
你可以使用在 auto_gptq.eval_tasks 中定義的任務(wù)來評估量化前后的模型在某個(gè)特定下游任務(wù)上的表現(xiàn)。
這些預(yù)定義的模型支持所有在 transformers 和本項(xiàng)目中被實(shí)現(xiàn)了的 causal-language-models。
以下是使用 `cardiffnlp/tweet_sentiment_multilingual` 數(shù)據(jù)集在序列分類(文本分類)任務(wù)上評估 `EleutherAI/gpt-j-6b` 模型的示例:from functools import partial import datasets from transformers import AutoTokenizer, AutoModelForCausalLM, GenerationConfig from auto_gptq import AutoGPTQForCausalLM, BaseQuantizeConfig from auto_gptq.eval_tasks import SequenceClassificationTask MODEL = "EleutherAI/gpt-j-6b" DATASET = "cardiffnlp/tweet_sentiment_multilingual" TEMPLATE = "Question:What's the sentiment of the given text? Choices are {labels}.\nText: {text}\nAnswer:" ID2LABEL = { 0: "negative", 1: "neutral", 2: "positive" } LABELS = list(ID2LABEL.values()) def ds_refactor_fn(samples): text_data = samples["text"] label_data = samples["label"] new_samples = {"prompt": [], "label": []} for text, label in zip(text_data, label_data): prompt = TEMPLATE.format(labels=LABELS, text=text) new_samples["prompt"].append(prompt) new_samples["label"].append(ID2LABEL[label]) return new_samples # model = AutoModelForCausalLM.from_pretrained(MODEL).eval().half().to("cuda:0") model = AutoGPTQForCausalLM.from_pretrained(MODEL, BaseQuantizeConfig()) tokenizer = AutoTokenizer.from_pretrained(MODEL) task = SequenceClassificationTask( model=model, tokenizer=tokenizer, classes=LABELS, data_name_or_path=DATASET, prompt_col_name="prompt", label_col_name="label", **{ "num_samples": 1000, # how many samples will be sampled to evaluation "sample_max_len": 1024, # max tokens for each sample "block_max_len": 2048, # max tokens for each data block # function to load dataset, one must only accept data_name_or_path as input # and return datasets.Dataset "load_fn": partial(datasets.load_dataset, name="english"), # function to preprocess dataset, which is used for datasets.Dataset.map, # must return Dict[str, list] with only two keys: [prompt_col_name, label_col_name] "preprocess_fn": ds_refactor_fn, # truncate label when sample's length exceed sample_max_len "truncate_prompt": False } ) # note that max_new_tokens will be automatically specified internally based on given classes print(task.run()) # self-consistency print( task.run( generation_config=GenerationConfig( num_beams=3, num_return_sequences=3, do_sample=True ) ) )
了解更多
教程 提供了將 auto_gptq 集成到你的項(xiàng)目中的手把手指導(dǎo)和最佳實(shí)踐準(zhǔn)則。
示例 提供了大量示例腳本以將 auto_gptq 用于不同領(lǐng)域。
評論
圖片
表情
