麻豆91精品人妻成人无码,一级日韩一级欧美,亚洲激情内射,逼特逼网站在线观看,午夜大屌,日本影片一区,AAA级国产免费,国产精品成人99一区无码

前言

AI 商業(yè)化的時(shí)代，大模型推理訓(xùn)練會(huì)被更加廣泛的使用。比較理性的看待大模型的話，一個(gè)大模型被訓(xùn)練出來(lái)后，無(wú)外乎兩個(gè)結(jié)果，第一個(gè)就是這個(gè)大模型沒(méi)用，那就沒(méi)有后續(xù)了；另一個(gè)結(jié)果就是發(fā)現(xiàn)這個(gè)模型很有用，那么就會(huì)全世界的使用，這時(shí)候主要的使用都來(lái)自于推理，不論是 openAI 還是 midjourney，用戶都是在為每一次推理行為付費(fèi)。隨著時(shí)間的推移，模型訓(xùn)練和模型推理的使用比重會(huì)是三七開，甚至二八開。應(yīng)該說(shuō)模型推理會(huì)是未來(lái)的主要戰(zhàn)場(chǎng)。

大模型推理是一個(gè)巨大的挑戰(zhàn)，它的挑戰(zhàn)體現(xiàn)在成本、性能和效率。其中成本最重要，因?yàn)榇竽Ｐ偷某杀咎魬?zhàn)在于模型規(guī)模越來(lái)越大，使用的資源越來(lái)越多，而模型的運(yùn)行平臺(tái) GPU 由于其稀缺性，價(jià)格很昂貴，這就導(dǎo)致每次模型推理的成本越來(lái)越高。而最終用戶只為價(jià)值買單，而不會(huì)為推理成本買單，因此降低單位推理的成本是基礎(chǔ)設(shè)施團(tuán)隊(duì)的首要任務(wù)。

在此基礎(chǔ)上，性能是核心競(jìng)爭(zhēng)力，特別是 ToC 領(lǐng)域的大模型，更快的推理和推理效果都是增加用戶粘性的關(guān)鍵。

應(yīng)該說(shuō)大模型的商業(yè)化是一個(gè)不確定性較高的領(lǐng)域，成本和性能可以保障你始終在牌桌上。效率是能夠保障你能在牌桌上贏牌。

進(jìn)一步，效率。模型是需要持續(xù)更新，這就模型多久可以更新一次，更新一次要花多久的時(shí)間。誰(shuí)的工程效率越高，誰(shuí)就有機(jī)會(huì)迭代出有更有價(jià)值的模型。

img

近年來(lái)，容器和 Kubernetes 已經(jīng)成為越來(lái)越多 AI 應(yīng)用首選的運(yùn)行環(huán)境和平臺(tái)。一方面，Kubernetes 幫助用戶標(biāo)準(zhǔn)化異構(gòu)資源和運(yùn)行時(shí)環(huán)境、簡(jiǎn)化運(yùn)維流程；另一方面，AI 這種重度依賴 GPU 的場(chǎng)景可以利用 K8s 的彈性優(yōu)勢(shì)節(jié)省資源成本。在 AIGC/大模型的這波浪潮下，以 Kubernetes 上運(yùn)行 AI 應(yīng)用將變成一種事實(shí)標(biāo)準(zhǔn)。

--- 節(jié)選自《云原生場(chǎng)景下，AIGC 模型服務(wù)的工程挑戰(zhàn)和應(yīng)對(duì)》

大模型訓(xùn)練和推理

大模型訓(xùn)練和推理是機(jī)器學(xué)習(xí)和深度學(xué)習(xí)領(lǐng)域的重要應(yīng)用，但企業(yè)和個(gè)人往往面臨著GPU管理復(fù)雜、資源利用率低，以及AI作業(yè)全生命周期管理中工程效率低下等挑戰(zhàn)。本方案通過(guò)創(chuàng)建kubernetes集群，使用kserve+vLLM部署推理服務(wù)。適用于機(jī)器學(xué)習(xí)和深度學(xué)習(xí)任務(wù)中的以下場(chǎng)景：

? 模型訓(xùn)練：基于Kubernetes集群微調(diào)開源模型，可以屏蔽底層資源和環(huán)境的復(fù)雜度，快速配置訓(xùn)練數(shù)據(jù)、提交訓(xùn)練任務(wù)，并自動(dòng)運(yùn)行和保存訓(xùn)練結(jié)果。
? 模型推理：基于Kubernetes集群部署推理服務(wù)，可以屏蔽底層資源和環(huán)境的復(fù)雜度，快速將微調(diào)后的模型部署成推理服務(wù)，將模型應(yīng)用到實(shí)際業(yè)務(wù)場(chǎng)景中。
? GPU共享推理：支持GPU共享調(diào)度能力和顯存隔離能力，可將多個(gè)推理服務(wù)部署在同一塊GPU卡上，提高GPU的利用率的同時(shí)也能保證推理服務(wù)的穩(wěn)定運(yùn)行。

VLLM介紹

即使在高端 GPU 上，提供 LLM 模型的速度也可能出奇地慢，vLLM[1]是一種快速且易于使用的 LLM 推理引擎。它可以實(shí)現(xiàn)比 Huggingface 變壓器高 10 倍至 20 倍的吞吐量。它支持連續(xù)批處理[2]以提高吞吐量和 GPU 利用率， vLLM支持分頁(yè)注意力[3]以解決內(nèi)存瓶頸，在自回歸解碼過(guò)程中，所有注意力鍵值張量（KV 緩存）都保留在 GPU 內(nèi)存中以生成下一個(gè)令牌。

vLLM 是一個(gè)快速且易于使用的 LLM 推理和服務(wù)庫(kù)。

vLLM 的速度很快：

? 最先進(jìn)的服務(wù)吞吐量
? 使用PagedAttention高效管理注意力鍵和值內(nèi)存
? 連續(xù)批處理傳入請(qǐng)求
? 使用 CUDA/HIP 圖快速執(zhí)行模型
? 量化：GPTQ[4]、AWQ[5]、SqueezeLLM[6]、FP8 KV 緩存
? 優(yōu)化的 CUDA 內(nèi)核

vLLM 靈活且易于使用：

? 與流行的 HuggingFace 模型無(wú)縫集成
? 高吞吐量服務(wù)與各種解碼算法，包括并行采樣、波束搜索等
? 對(duì)分布式推理的張量并行支持
? 流輸出
? 兼容 OpenAI 的 API 服務(wù)器
? 支持 NVIDIA GPU 和 AMD GPU
? （實(shí)驗(yàn)性）前綴緩存支持
? （實(shí)驗(yàn)性）多l(xiāng)ora支持

kserve介紹

KServe是一個(gè)針對(duì) Kubernetes 的自定義資源，用于為任意框架提供機(jī)器學(xué)習(xí)（ML）模型服務(wù)。它旨在為常見 ML 框架（如TensorFlow、XGBoost、ScikitLearn、PyTorch 和 ONNX）的提供性高性能、標(biāo)準(zhǔn)化的推理協(xié)議，解決生產(chǎn)模型服務(wù)的使用案例。

KServe提供簡(jiǎn)單的Kubernetes CRD，可用于將單個(gè)或多個(gè)經(jīng)過(guò)訓(xùn)練的模型（例如TFServing、TorchServe、Triton等推理服務(wù)器）部署到模型服務(wù)運(yùn)行時(shí)。

KServe封裝了自動(dòng)擴(kuò)展、網(wǎng)絡(luò)、健康檢查和服務(wù)器配置的復(fù)雜性，為 ML 部署帶來(lái)了 GPU 自動(dòng)擴(kuò)展、零擴(kuò)縮放和金絲雀發(fā)布等先進(jìn)的服務(wù)特性。它使得生產(chǎn) ML 服務(wù)變得簡(jiǎn)單、可插拔，并提供了完整的故事，包括預(yù)測(cè)、預(yù)處理、后處理和可解釋性。

KServe中的ModelMesh 專為高規(guī)模、高密度和頻繁變化的模型使用場(chǎng)景設(shè)計(jì)，智能地加載和卸載 AI 模型，以在用戶響應(yīng)和計(jì)算資源占用之間取得智能權(quán)衡。

KServe還提供基本API原語(yǔ)，可輕松構(gòu)建自定義模型服務(wù)運(yùn)行時(shí)。你也可以使用其他工具（例如BentoML）來(lái)構(gòu)建你自己的自定義模型服務(wù)鏡像。

? ?? KServe的誕生背景：Kubeflow Summit 2019后，從Kubeflow分離出的KF Serving，最終發(fā)展為KServe。在2022年由Nvidia貢獻(xiàn)了V2標(biāo)準(zhǔn)化推理協(xié)議，引起行業(yè)廣泛關(guān)注。
? ?? KServe的功能與部署：是高度可擴(kuò)展、基于Kubernetes的無(wú)服務(wù)器模型推理平臺(tái)。支持云端或本地部署，具備服務(wù)器自動(dòng)擴(kuò)展、多種模型服務(wù)運(yùn)行時(shí)等特性。
? ?? KServe當(dāng)前狀態(tài)：擁有約60%的Kubeflow用戶使用KServe；已有約10萬(wàn)次KServe Docker鏡像下載；擁有20個(gè)核心貢獻(xiàn)者和172名貢獻(xiàn)者。
? ??? KServe提供基本和高級(jí)特性，如規(guī)模化、請(qǐng)求處理、安全性、流量管理、分布式跟蹤等，使機(jī)器學(xué)習(xí)工作流更加便捷。
? ?? KServe構(gòu)建在Kubernetes之上，與Knative和Istio集成，提供了可擴(kuò)展且高效的模型服務(wù)架構(gòu)。
? ??? KServe特性：支持多種模型服務(wù)運(yùn)行時(shí)，如Triton、TF Serving等；推出的V2推理協(xié)議標(biāo)準(zhǔn)化了多種推理運(yùn)行時(shí)；引入了推理圖、批處理等功能。
? ?? Model Mesh與KServe整合：Model Mesh是IBM在KServe項(xiàng)目中貢獻(xiàn)的技術(shù)，用于管理大量模型服務(wù)的靜態(tài)Pods，提供高密度模型加載和路由請(qǐng)求功能，自動(dòng)優(yōu)化資源利用，旨在解決大規(guī)模、高密度的推斷服務(wù)部署問(wèn)題。
? ??KServe還支持Canary Rollout等功能，提供流量控制和版本管理，適用于各種生產(chǎn)用例。

為什么選擇KServe？

? KServe 是一個(gè)與云無(wú)關(guān)的標(biāo)準(zhǔn)模型推理平臺(tái)，專為高度可擴(kuò)展的用例而構(gòu)建。
? 跨機(jī)器學(xué)習(xí)框架，提供高性能標(biāo)準(zhǔn)化推理協(xié)議。
? 支持現(xiàn)代無(wú)服務(wù)器推理工作負(fù)載，具有基于請(qǐng)求在CPU和GPU的自動(dòng)縮放（包括縮放至零）。
? 使用ModelMesh 支持高可擴(kuò)展性、密度封裝和智能路由
? 簡(jiǎn)單且可插入的生產(chǎn)服務(wù)：用于推理、預(yù)/后處理、監(jiān)控和可解釋性.
? 高級(jí)部署：金絲雀部署、Pipeline、InferenceGraph

在k8s環(huán)境中，你可以使用Kserve InferenceServiceAPI 規(guī)范通過(guò)構(gòu)建的 vLLM 推理服務(wù)器容器鏡像來(lái)部署模型。

搭建kserve環(huán)境

你可以使用 KServe 快速安裝腳本，在本地部署 KServe ：

    curl -s "https://raw.githubusercontent.com/kserve/kserve/release-0.11/hack/quick_install.sh" | bash

執(zhí)行結(jié)果查看

     kubectl get pod -A
NAMESPACE            NAME                                                 READY   STATUS    RESTARTS   AGE
cert-manager         cert-manager-76b7c557d5-xzd8b                        1/1     Running   1          30m
cert-manager         cert-manager-cainjector-655d695d74-q2tjv             1/1     Running   6          30m
cert-manager         cert-manager-webhook-7955b9bb97-h4bm6                1/1     Running   3          30m
istio-system         istio-egressgateway-5547fcc8fc-z7sz9                 1/1     Running   1          32m
istio-system         istio-ingressgateway-8f568d595-f4f5z                 1/1     Running   1          32m
istio-system         istiod-568d797f55-k6476                              1/1     Running   1          32m
knative-serving      activator-68b7698d74-tld8d                           1/1     Running   1          31m
knative-serving      autoscaler-6c8884d6ff-dd6fn                          1/1     Running   1          31m
knative-serving      controller-76cf997d95-fk6hv                          1/1     Running   1          31m
knative-serving      domain-mapping-57fdbf97b-mg8r2                       1/1     Running   1          31m
knative-serving      domainmapping-webhook-66c5f7d596-qxhtw               1/1     Running   5          31m
knative-serving      net-istio-controller-544874485d-4dlvr                1/1     Running   1          31m
knative-serving      net-istio-webhook-695d588d65-crfkf                   1/1     Running   1          31m
knative-serving      webhook-7df8fd847b-dqx8s                             1/1     Running   4          31m
kserve               kserve-controller-manager-0                          2/2     Running   2          22m
kube-system          coredns-558bd4d5db-gkspn                             1/1     Running   1          54m
kube-system          coredns-558bd4d5db-tsh95                             1/1     Running   1          54m
kube-system          etcd-test-cluster-control-plane                      1/1     Running   1          54m
kube-system          kindnet-49hl9                                        1/1     Running   1          54m
kube-system          kube-apiserver-test-cluster-control-plane            1/1     Running   1          54m
kube-system          kube-controller-manager-test-cluster-control-plane   1/1     Running   7          54m
kube-system          kube-proxy-mqvtb                                     1/1     Running   1          54m
kube-system          kube-scheduler-test-cluster-control-plane            1/1     Running   7          54m
local-path-storage   local-path-provisioner-778f7d66bf-4fzcg              1/1     Running   1          54m

基于Kserve和vLLM：在k8s上創(chuàng)建大模型推理服務(wù)

創(chuàng)建理服務(wù)`InferenceService`

    apiVersion: serving.kserve.io/v1beta1
kind: InferenceService
metadata:
  namespace: kserve-test
  name: bloom
spec:
  predictor:
    containers:
    - args:
      - --port
      - "8080"
      - --model
      - "/mnt/models"
      command:
      - python3
      - -m
      - vllm.entrypoints.api_server
      env:
      - name: STORAGE_URI
        value: pvc://task-pv-claim/bloom-560m
      image: docker.io/kserve/vllmserver:latest
      imagePullPolicy: IfNotPresent 
      name: kserve-container
      resources:
        limits:
          cpu: "5"
          memory: 20Gi
          nvidia.com/gpu: "1"
        requests:
          cpu: "5"
          memory: 20Gi
          nvidia.com/gpu: "1"

啟動(dòng)日志

    $ kubectl -n kserve-test logs -f --tail=200 bloom-predictor-00001-deployment-66649d69bd-nw96r 

Defaulted container "kserve-container" out of: kserve-container, queue-proxy, storage-initializer (init)
INFO 02-20 06:04:40 llm_engine.py:70] Initializing an LLM engine with config: model='/mnt/models', tokenizer='/mnt/models', tokenizer_mode=auto, trust_remote_code=False, dtype=torch.float16, use_dummy_weights=False, download_dir=None, use_np_weights=False, tensor_parallel_size=1, seed=0)
INFO 02-20 06:04:54 llm_engine.py:196] # GPU blocks: 6600, # CPU blocks: 2730
INFO:     Started server process [1]
INFO:     Waiting for application startup.
INFO:     Application startup complete.
INFO:     Uvicorn running on http://localhost:8080 (Press CTRL+C to quit)

使用Port Forward進(jìn)行端口轉(zhuǎn)發(fā)

    INGRESS_GATEWAY_SERVICE=$(kubectl get svc --namespace istio-system --selector="app=istio-ingressgateway" --output jsonpath='{.items[0].metadata.name}')
kubectl port-forward --namespace istio-system svc/${INGRESS_GATEWAY_SERVICE} 8080:80

進(jìn)行LLM推理

來(lái)自具有 HOST 標(biāo)頭的 Ingress 網(wǎng)關(guān)

如果你沒(méi)有 DNS，你仍然可以攜帶 HOST 標(biāo)頭，來(lái)請(qǐng)求入口網(wǎng)關(guān)外部 IP。

    # start another terminal
export INGRESS_HOST=localhost
export INGRESS_PORT=8080
MODEL_NAME=bloom-560m
SERVICE_HOSTNAME=$(kubectl --namespace kserve-test get inferenceservice bloom -o jsonpath='{.status.url}' | cut -d "/" -f 3)

#提示 {"error":"TypeError : Type is not JSON serializable: bytes"}
# 在你的 curl 請(qǐng)求中，嘗試添加 -H "Content-Type: application/json" 頭部。這樣可以確保服務(wù)端正確地解析請(qǐng)求的 JSON 數(shù)據(jù)。
curl -v -H "Host: ${SERVICE_HOSTNAME}" -H "Content-Type: application/json"     http://${INGRESS_HOST}:${INGRESS_PORT}/generate     -d '{"prompt": "San Francisco is a" }'


curl -v -H "Host: ${SERVICE_HOSTNAME}" -H "Content-Type: application/json" \
    http://${INGRESS_HOST}:${INGRESS_PORT}/v1/completions \
    -d '{
        "model": "/mnt/models",
        "prompt": "San Francisco is a",
        "max_tokens": 70,
        "temperature": 0
    }'

預(yù)期輸出

    * About to connect() to localhost port 8080 (#0)
*   Trying ::1...
* Connected to localhost (::1) port 8080 (#0)
> POST /generate HTTP/1.1
> User-Agent: curl/7.29.0
> Accept: */*
> Host: bloom.kserve-test.example.com
> Content-Type: application/json
> Content-Length: 33
> 
* upload completely sent off: 33 out of 33 bytes
< HTTP/1.1 200 OK
< content-length: 128
< content-type: application/json
< date: Tue, 20 Feb 2024 07:34:23 GMT
< server: istio-envoy
< x-envoy-upstream-service-time: 287
< 
* Connection #0 to host localhost left intact
{"text":["San Francisco is a medium-sized family donating site with nonprofits, churches, Catholic organizations and business"]}

LLM推理：兼容OpenAI的API

vLLM可以部署為實(shí)現(xiàn)OpenAI API協(xié)議的服務(wù)器。這使得 vLLM 可以用作OpenAI API應(yīng)用程序的直接替代品。默認(rèn)情況下，它在http://localhost:8000處啟動(dòng)服務(wù)器。你可以使用和參數(shù)--host和--port指定地址。

服務(wù)器當(dāng)前一次托管一個(gè)模型（下面命令中的 OPT-125M）并實(shí)現(xiàn)模型列表[7]、使用 OpenAI Chat API 查詢模型[8]和通過(guò)輸入提示查詢模型[9]端點(diǎn)。我們正在積極增加對(duì)更多端點(diǎn)的支持。

使用`vllm-openai`部署推理服務(wù)

vLLM 提供官方 docker 鏡像進(jìn)行部署。該鏡像可用于運(yùn)行 OpenAI 兼容服務(wù)器。該鏡像在 Docker Hub 上以vllm/vllm-openai[10]形式提供。

    apiVersion: serving.kserve.io/v1beta1
kind: InferenceService
metadata:
  namespace: kserve-test
  name: bloom
spec:
  predictor:
    containers:
    - args:
      - --port
      - "8080"
      - --model
      - "/mnt/models"
      env:
      - name: STORAGE_URI
        value: pvc://task-pv-claim/bloom-560m
      image: docker.io/vllm/vllm-openai:latest
      imagePullPolicy: IfNotPresent
      name: kserve-container
      resources:
        limits:
          cpu: "5"
          memory: 20Gi
          nvidia.com/gpu: "1"
        requests:
          cpu: "5"
          memory: 20Gi
          nvidia.com/gpu: "1"

使用Port Forward進(jìn)行端口轉(zhuǎn)發(fā)

    INGRESS_GATEWAY_SERVICE=$(kubectl get svc --namespace istio-system --selector="app=istio-ingressgateway" --output jsonpath='{.items[0].metadata.name}')
kubectl port-forward --namespace istio-system svc/${INGRESS_GATEWAY_SERVICE} 8080:80

如果你沒(méi)有 DNS，你仍然可以攜帶 HOST 標(biāo)頭，來(lái)請(qǐng)求入口網(wǎng)關(guān)外部 IP。

    # start another terminal
export INGRESS_HOST=localhost
export INGRESS_PORT=8080
MODEL_NAME=bloom-560m
SERVICE_HOSTNAME=$(kubectl --namespace kserve-test get inferenceservice bloom -o jsonpath='{.status.url}' | cut -d "/" -f 3)

將 OpenAI Completions API 與 vLLM 結(jié)合使用

通過(guò)輸入提示查詢模型：

    #提示 {"error":"TypeError : Type is not JSON serializable: bytes"}
# 在你的 curl 請(qǐng)求中，嘗試添加 -H "Content-Type: application/json" 頭部。這樣可以確保服務(wù)端正確地解析請(qǐng)求的 JSON 數(shù)據(jù)。
curl -v -H "Host: ${SERVICE_HOSTNAME}" -H "Content-Type: application/json" \
    http://${INGRESS_HOST}:${INGRESS_PORT}/v1/completions \
    -d '{
        "model": "/mnt/models",
        "prompt": "San Francisco is a",
        "max_tokens": 70,
        "temperature": 0
    }'

預(yù)期輸出

    $ curl -v -H "Host: ${SERVICE_HOSTNAME}" -H "Content-Type: application/json" \
>     http://${INGRESS_HOST}:${INGRESS_PORT}/v1/completions \
>     -d '{
>         "model": "/mnt/models",
>         "prompt": "San Francisco is a",
>         "max_tokens": 70,
>         "temperature": 0
>     }'

* About to connect() to localhost port 8080 (#0)
*   Trying ::1...
* Connected to localhost (::1) port 8080 (#0)
> POST /v1/completions HTTP/1.1
> User-Agent: curl/7.29.0
> Accept: */*
> Host: bloom.kserve-test.example.com
> Content-Type: application/json
> Content-Length: 130
> 
* upload completely sent off: 130 out of 130 bytes
< HTTP/1.1 200 OK
< content-length: 641
< content-type: application/json
< date: Tue, 20 Feb 2024 09:37:47 GMT
< server: istio-envoy
< x-envoy-upstream-service-time: 589
< 
* Connection #0 to host localhost left intact
{"id":"cmpl-45a94f7aecb84de08de42d4e51fad49f","object":"text_completion","created":3793968,"model":"/mnt/models","choices":[{"index":0,"text":" great place to visit. The city is home to a number of museums, including the National Museum of Natural History, the National Museum of Natural History, the National Museum of Natural History, the National Museum of Natural History, the National Museum of Natural History, the National Museum of Natural History, the National Museum of Natural History, the National Museum of Natural","logprobs":null,"finish_reason":"length"}],"usage":{"prompt_tokens":4,"total_tokens":74,"completion_tokens":70}}

將 OpenAI 聊天 API 與 vLLM 結(jié)合使用

vLLM 服務(wù)器旨在支持 OpenAI Chat API，允許你與模型進(jìn)行動(dòng)態(tài)對(duì)話。聊天界面是一種與模型通信的更具交互性的方式，允許來(lái)回交換，并且可以存儲(chǔ)在聊天歷史記錄中。這對(duì)于需要上下文或更詳細(xì)解釋的任務(wù)非常有用。

使用 OpenAI Chat API 查詢模型：

你可以使用create chat completion[11] 端點(diǎn)在類似聊天的界面中與模型進(jìn)行通信：

    curl -v -H "Host: ${SERVICE_HOSTNAME}" -H "Content-Type: application/json" \
    http://${INGRESS_HOST}:${INGRESS_PORT}/v1/chat/completions \
    -d '{
        "model": "/mnt/models",
        "messages": [
            {"role": "system", "content": "You are a helpful assistant."},
            {"role": "user", "content": "Who won the world series in 2020?"}
        ]
    }'

預(yù)期輸出

    $ curl -v -H "Host: ${SERVICE_HOSTNAME}" -H "Content-Type: application/json" \
>     http://${INGRESS_HOST}:${INGRESS_PORT}/v1/chat/completions \
>     -d '{
>         "model": "/mnt/models",
>         "messages": [
>             {"role": "system", "content": "You are a helpful assistant."},
>             {"role": "user", "content": "Who won the world series in 2020?"}
>         ]
>     }'
* About to connect() to localhost port 8080 (#0)
*   Trying ::1...
* Connected to localhost (::1) port 8080 (#0)
> POST /v1/chat/completions HTTP/1.1
> User-Agent: curl/7.29.0
> Accept: */*
> Host: bloom.kserve-test.example.com
> Content-Type: application/json
> Content-Length: 223
> 
* upload completely sent off: 223 out of 223 bytes

< HTTP/1.1 200 OK
< content-length: 654
< content-type: application/json
< date: Tue, 20 Feb 2024 09:56:34 GMT
< server: istio-envoy
< x-envoy-upstream-service-time: 814
< 

* Connection #0 to host localhost left intact
{"id":"cmpl-af28c2d32b1f4a2aa24ad30f8fbbfa2b","object":"chat.completion","created":3795095,"model":"/mnt/models","choices":[{"index":0,"message":{"role":"assistant","content":"I have been on the road all year and was really happy to see the cheers in The Guardian site this year after the World Cup — for the first time in its history, a newspaper taking the time to cover the current World Cup (although not the one that was a fan favourite of mine and my wife). I am quite pleased with how well it went and I hope it will continue as much as possible."},"finish_reason":"stop"}],"usage":{"prompt_tokens":16,"total_tokens":100,"completion_tokens":84}}

有關(guān)聊天 API 的更深入示例和高級(jí)功能，你可以參考 OpenAI 官方文檔[12]。

引用鏈接

[1] vLLM: https://github.com/vllm-project/vllm
[2] 連續(xù)批處理: https://www.anyscale.com/blog/continuous-batching-llm-inference
[3] vLLM支持分頁(yè)注意力: https://vllm.ai/
[4] GPTQ: https://arxiv.org/abs/2210.17323
[5] AWQ: https://arxiv.org/abs/2306.00978
[6] SqueezeLLM: https://arxiv.org/abs/2306.07629
[7] 模型列表: https://platform.openai.com/docs/api-reference/models/list
[8] 使用 OpenAI Chat API 查詢模型: https://platform.openai.com/docs/api-reference/chat/completions/create
[9] 通過(guò)輸入提示查詢模型: https://platform.openai.com/docs/api-reference/completions/create
[10] vllm/vllm-openai: https://hub.docker.com/r/vllm/vllm-openai/tags
[11] create chat completion: https://platform.openai.com/docs/api-reference/chat/completions/create
[12] OpenAI 官方文檔: https://platform.openai.com/docs/api-reference