免费无码婬片AAAA片视频,国模一区二区,无码不卡播放,欧洲性爱在线,日本三级毛片,国产精品成人99一区无码,www色婷婷,国产精品偷窥熟女精品视

模型部署入門教程繼續(xù)更新啦！相信經(jīng)過前幾期的學(xué)習(xí)，大家已經(jīng)對 ONNX 這一中間表示有了一個(gè)比較全面的認(rèn)識，但是在具體的生產(chǎn)環(huán)境中，ONNX 模型常常需要被轉(zhuǎn)換成能被具體推理后端使用的模型格式。本篇教程我們就和大家一起來認(rèn)識大名鼎鼎的推理后端 TensorRT。

本文內(nèi)容

1. TensorRT 簡介

2. 安裝 TensorRT

3. 模型構(gòu)建

4. 模型推理

1. TensorRT 簡介

TensorRT 是由 NVIDIA 發(fā)布的深度學(xué)習(xí)框架，用于在其硬件上運(yùn)行深度學(xué)習(xí)推理。TensorRT 提供量化感知訓(xùn)練和離線量化功能，用戶可以選擇 INT8 和 FP16 兩種優(yōu)化模式，將深度學(xué)習(xí)模型應(yīng)用到不同任務(wù)的生產(chǎn)部署，如視頻流、語音識別、推薦、欺詐檢測、文本生成和自然語言處理。TensorRT 經(jīng)過高度優(yōu)化，可在 NVIDIA GPU 上運(yùn)行，并且可能是目前在 NVIDIA GPU 運(yùn)行模型最快的推理引擎。關(guān)于 TensorRT 更具體的信息可以訪問 TensorRT官網(wǎng) 了解。

TensorRT 官網(wǎng)鏈接：

https://developer.nvidia.com/tensorrt

2. 安裝 TensorRT

Windows

默認(rèn)在一臺(tái)有 NVIDIA 顯卡的機(jī)器上，提前安裝好 CUDA 和 CUDNN，登錄 NVIDIA 官方網(wǎng)站下載和主機(jī) CUDA 版本適配的 TensorRT 壓縮包即可。

以 CUDA 版本是 10.2 為例，選擇適配 CUDA 10.2 的 zip 包，下載完成后，有 conda 虛擬環(huán)境的用戶可以優(yōu)先切換到虛擬環(huán)境中，然后在 powershell 中執(zhí)行類似如下的命令安裝并測試：

cd \the\path\of\tensorrt\zip\fileExpand-Archive TensorRT-8.2.5.1.Windows10.x86_64.cuda-10.2.cudnn8.2.zip .$env:TENSORRT_DIR = "$pwd\TensorRT-8.2.5.1"$env:path = "$env:TENSORRT_DIR\lib;" + $env:pathpip install $env:TENSORRT_DIR\python\tensorrt-8.2.5.1-cp36-none-win_amd64.whlpython -c "import tensorrt;print(tensorrt.__version__)"

上述命令會(huì)在安裝后檢查 TensorRT 版本，如果打印結(jié)果是 8.2.5.1，說明安裝 Python 包成功了。

zip 包鏈接：

https://developer.nvidia.com/compute/machine-learning/tensorrt/secure/8.2.5.1/zip/tensorrt-8.2.5.1.windows10.x86_64.cuda-10.2.cudnn8.2.zip

Linux

和在 Windows 環(huán)境下安裝類似，默認(rèn)在一臺(tái)有 NVIDIA 顯卡的機(jī)器上，提前安裝好 CUDA 和 CUDNN，登錄 NVIDIA 官方網(wǎng)站下載和主機(jī) CUDA 版本適配的 TensorRT 壓縮包即可。

以 CUDA 版本是 10.2 為例，選擇適配 CUDA 10.2 的 tar 包，然后執(zhí)行類似如下的命令安裝并測試：

cd /the/path/of/tensorrt/tar/gz/filetar -zxvf TensorRT-8.2.5.1.linux.x86_64-gnu.cuda-10.2.cudnn8.2.tar.gzexport TENSORRT_DIR=$(pwd)/TensorRT-8.2.5.1export LD_LIBRARY_PATH=$TENSORRT_DIR/lib:$LD_LIBRARY_PATHpip install TensorRT-8.2.5.1/python/tensorrt-8.2.5.1-cp37-none-linux_x86_64.whlpython -c "import tensorrt;print(tensorrt.__version__)"

如果發(fā)現(xiàn)打印結(jié)果是 8.2.5.1，說明安裝 Python 包成功了。

tar 包鏈接：

https://developer.nvidia.com/compute/machine-learning/tensorrt/secure/8.2.5.1/tars/tensorrt-8.2.5.1.linux.x86_64-gnu.cuda-10.2.cudnn8.2.tar.gz

3. 模型構(gòu)建

我們使用 TensorRT 生成模型主要有兩種方式：

直接通過 TensorRT 的 API 逐層搭建網(wǎng)絡(luò)；
將中間表示的模型轉(zhuǎn)換成 TensorRT 的模型，比如將 ONNX 模型轉(zhuǎn)換成 TensorRT 模型。

接下來，我們將用 Python 和 C++ 語言分別使用這兩種方式構(gòu)建 TensorRT 模型，并將生成的模型進(jìn)行推理。

直接構(gòu)建

利用 TensorRT 的 API 逐層搭建網(wǎng)絡(luò)，這一過程類似使用一般的訓(xùn)練框架，如使用 Pytorch 或者TensorFlow 搭建網(wǎng)絡(luò)。需要注意的是對于權(quán)重部分，如卷積或者歸一化層，需要將權(quán)重內(nèi)容賦值到 TensorRT 的網(wǎng)絡(luò)中。本文就不詳細(xì)展示，只搭建一個(gè)對輸入做池化的簡單網(wǎng)絡(luò)。

使用 Python API 構(gòu)建

首先是使用 Python API 直接搭建 TensorRT 網(wǎng)絡(luò)，這種方法主要是利用 tensorrt.Builder 的 create_builder_config 和 create_network 功能，分別構(gòu)建 config 和 network，前者用于設(shè)置網(wǎng)絡(luò)的最大工作空間等參數(shù)，后者就是網(wǎng)絡(luò)主體，需要對其逐層添加內(nèi)容。

此外，需要定義好輸入和輸出名稱，將構(gòu)建好的網(wǎng)絡(luò)序列化，保存成本地文件。值得注意的是：如果想要網(wǎng)絡(luò)接受不同分辨率的輸入輸出，需要使用 tensorrt.Builder 的 create_optimization_profile 函數(shù)，并設(shè)置最小、最大的尺寸。

實(shí)現(xiàn)代碼如下：

import tensorrt as trt
verbose = TrueIN_NAME = 'input'OUT_NAME = 'output'IN_H = 224IN_W = 224BATCH_SIZE = 1
EXPLICIT_BATCH = 1 << (int)(    trt.NetworkDefinitionCreationFlag.EXPLICIT_BATCH)
TRT_LOGGER = trt.Logger(trt.Logger.VERBOSE) if verbose else trt.Logger()with trt.Builder(TRT_LOGGER) as builder, builder.create_builder_config() as config, builder.create_network(EXPLICIT_BATCH) as network:    # define network    input_tensor = network.add_input(        name=IN_NAME, dtype=trt.float32, shape=(BATCH_SIZE, 3, IN_H, IN_W))    pool = network.add_pooling(        input=input_tensor, type=trt.PoolingType.MAX, window_size=(2, 2))    pool.stride = (2, 2)    pool.get_output(0).name = OUT_NAME    network.mark_output(pool.get_output(0))
    # serialize the model to engine file    profile = builder.create_optimization_profile()    profile.set_shape_input('input', *[[BATCH_SIZE, 3, IN_H, IN_W]]*3)     builder.max_batch_size = 1    config.max_workspace_size = 1 << 30    engine = builder.build_engine(network, config)    with open('model_python_trt.engine', mode='wb') as f:        f.write(bytearray(engine.serialize()))        print("generating file done!")

使用 C++ API 構(gòu)建

對于想要直接用 C++ 語言構(gòu)建網(wǎng)絡(luò)的小伙伴來說，整個(gè)流程和上述 Python 的執(zhí)行過程非常類似，需要注意的點(diǎn)主要有：

nvinfer1:: createInferBuilder 對應(yīng) Python 中的 tensorrt.Builder，需要傳入 ILogger 類的實(shí)例，但是 ILogger 是一個(gè)抽象類，需要用戶繼承該類并實(shí)現(xiàn)內(nèi)部的虛函數(shù)。不過此處我們直接使用了 TensorRT 包解壓后的 samples 文件夾 ../samples/common/logger.h 文件里的實(shí)現(xiàn) Logger 子類。
設(shè)置 TensorRT 模型的輸入尺寸，需要多次調(diào)用 IOptimizationProfile 的 setDimensions 方法，比 Python 略繁瑣一些。IOptimizationProfile 需要用 createOptimizationProfile 函數(shù)，對應(yīng) Python 的 create_builder_config 函數(shù)。

實(shí)現(xiàn)代碼如下：

#include <fstream>#include <iostream>
#include <NvInfer.h>#include <../samples/common/logger.h>
using namespace nvinfer1;using namespace sample;
const char* IN_NAME = "input";const char* OUT_NAME = "output";static const int IN_H = 224;static const int IN_W = 224;static const int BATCH_SIZE = 1;static const int EXPLICIT_BATCH = 1 << (int)(NetworkDefinitionCreationFlag::kEXPLICIT_BATCH);
int main(int argc, char** argv){        // Create builder        Logger m_logger;        IBuilder* builder = createInferBuilder(m_logger);        IBuilderConfig* config = builder->createBuilderConfig();
        // Create model to populate the network        INetworkDefinition* network = builder->createNetworkV2(EXPLICIT_BATCH);        ITensor* input_tensor = network->addInput(IN_NAME, DataType::kFLOAT, Dims4{ BATCH_SIZE, 3, IN_H, IN_W });        IPoolingLayer* pool = network->addPoolingNd(*input_tensor, PoolingType::kMAX, DimsHW{ 2, 2 });        pool->setStrideNd(DimsHW{ 2, 2 });        pool->getOutput(0)->setName(OUT_NAME);        network->markOutput(*pool->getOutput(0));
        // Build engine        IOptimizationProfile* profile = builder->createOptimizationProfile();        profile->setDimensions(IN_NAME, OptProfileSelector::kMIN, Dims4(BATCH_SIZE, 3, IN_H, IN_W));        profile->setDimensions(IN_NAME, OptProfileSelector::kOPT, Dims4(BATCH_SIZE, 3, IN_H, IN_W));        profile->setDimensions(IN_NAME, OptProfileSelector::kMAX, Dims4(BATCH_SIZE, 3, IN_H, IN_W));        config->setMaxWorkspaceSize(1 << 20);        ICudaEngine* engine = builder->buildEngineWithConfig(*network, *config);
        // Serialize the model to engine file        IHostMemory* modelStream{ nullptr };        assert(engine != nullptr);        modelStream = engine->serialize();
        std::ofstream p("model.engine", std::ios::binary);        if (!p) {                std::cerr << "could not open output file to save model" << std::endl;                return -1;        }        p.write(reinterpret_cast<const char*>(modelStream->data()), modelStream->size());        std::cout << "generating file done!" << std::endl;
        // Release resources        modelStream->destroy();        network->destroy();        engine->destroy();        builder->destroy();        config->destroy();        return 0;}

IR 轉(zhuǎn)換模型

除了直接通過 TensorRT 的 API 逐層搭建網(wǎng)絡(luò)并序列化模型，TensorRT 還支持將中間表示的模型（如 ONNX）轉(zhuǎn)換成 TensorRT 模型。

使用 Python API 轉(zhuǎn)換

我們首先使用 Pytorch 實(shí)現(xiàn)一個(gè)和上文一致的模型，即只對輸入做一次池化并輸出；然后將 Pytorch 模型轉(zhuǎn)換成 ONNX 模型；最后將 ONNX 模型轉(zhuǎn)換成 TensorRT 模型。

這里主要使用了 TensorRT 的 OnnxParser 功能，它可以將 ONNX 模型解析到 TensorRT 的網(wǎng)絡(luò)中。最后我們同樣可以得到一個(gè) TensorRT 模型，其功能與上述方式實(shí)現(xiàn)的模型功能一致。

實(shí)現(xiàn)代碼如下：

import torchimport onnximport tensorrt as trt

onnx_model = 'model.onnx'
class NaiveModel(torch.nn.Module):    def __init__(self):        super().__init__()        self.pool = torch.nn.MaxPool2d(2, 2)        def forward(self, x):        return self.pool(x)
device = torch.device('cuda:0')
# generate ONNX modeltorch.onnx.export(NaiveModel(), torch.randn(1, 3, 224, 224), onnx_model, input_names=['input'], output_names=['output'], opset_version=11)onnx_model = onnx.load(onnx_model)
# create builder and networklogger = trt.Logger(trt.Logger.ERROR)builder = trt.Builder(logger)EXPLICIT_BATCH = 1 << (int)(    trt.NetworkDefinitionCreationFlag.EXPLICIT_BATCH)network = builder.create_network(EXPLICIT_BATCH)
# parse onnxparser = trt.OnnxParser(network, logger)
if not parser.parse(onnx_model.SerializeToString()):    error_msgs = ''    for error in range(parser.num_errors):        error_msgs += f'{parser.get_error(error)}\n'    raise RuntimeError(f'Failed to parse onnx, {error_msgs}')
config = builder.create_builder_config()config.max_workspace_size = 1<<20profile = builder.create_optimization_profile()
profile.set_shape('input', [1,3 ,224 ,224], [1,3,224, 224], [1,3 ,224 ,224])config.add_optimization_profile(profile)# create enginewith torch.cuda.device(device):    engine = builder.build_engine(network, config)
with open('model.engine', mode='wb') as f:    f.write(bytearray(engine.serialize()))    print("generating file done!")

IR 轉(zhuǎn)換時(shí)，如果有多 Batch、多輸入、動(dòng)態(tài) shape 的需求，都可以通過多次調(diào)用 set_shape 函數(shù)進(jìn)行設(shè)置。set_shape 函數(shù)接受的傳參分別是：輸入節(jié)點(diǎn)名稱，可接受的最小輸入尺寸，最優(yōu)的輸入尺寸，可接受的最大輸入尺寸。一般要求這三個(gè)尺寸的大小關(guān)系為單調(diào)遞增。

使用 C++ API 轉(zhuǎn)換

介紹了如何用 Python 語言將 ONNX 模型轉(zhuǎn)換成 TensorRT 模型后，再介紹下如何用 C++ 將 ONNX 模型轉(zhuǎn)換成 TensorRT 模型。這里通過 NvOnnxParser，我們可以將上一小節(jié)轉(zhuǎn)換時(shí)得到的 ONNX 文件直接解析到網(wǎng)絡(luò)中。

實(shí)現(xiàn)代碼如下：

#include <fstream>#include <iostream>
#include <NvInfer.h>#include <NvOnnxParser.h>#include <../samples/common/logger.h>
using namespace nvinfer1;using namespace nvonnxparser;using namespace sample;
int main(int argc, char** argv){        // Create builder        Logger m_logger;        IBuilder* builder = createInferBuilder(m_logger);        const auto explicitBatch = 1U << static_cast<uint32_t>(NetworkDefinitionCreationFlag::kEXPLICIT_BATCH);        IBuilderConfig* config = builder->createBuilderConfig();
        // Create model to populate the network        INetworkDefinition* network = builder->createNetworkV2(explicitBatch);
        // Parse ONNX file        IParser* parser = nvonnxparser::createParser(*network, m_logger);        bool parser_status = parser->parseFromFile("model.onnx", static_cast<int>(ILogger::Severity::kWARNING));
        // Get the name of network input        Dims dim = network->getInput(0)->getDimensions();        if (dim.d[0] == -1)  // -1 means it is a dynamic model        {                const char* name = network->getInput(0)->getName();                IOptimizationProfile* profile = builder->createOptimizationProfile();                profile->setDimensions(name, OptProfileSelector::kMIN, Dims4(1, dim.d[1], dim.d[2], dim.d[3]));                profile->setDimensions(name, OptProfileSelector::kOPT, Dims4(1, dim.d[1], dim.d[2], dim.d[3]));                profile->setDimensions(name, OptProfileSelector::kMAX, Dims4(1, dim.d[1], dim.d[2], dim.d[3]));                config->addOptimizationProfile(profile);        }

        // Build engine        config->setMaxWorkspaceSize(1 << 20);        ICudaEngine* engine = builder->buildEngineWithConfig(*network, *config);
        // Serialize the model to engine file        IHostMemory* modelStream{ nullptr };        assert(engine != nullptr);        modelStream = engine->serialize();
        std::ofstream p("model.engine", std::ios::binary);        if (!p) {                std::cerr << "could not open output file to save model" << std::endl;                return -1;        }        p.write(reinterpret_cast<const char*>(modelStream->data()), modelStream->size());        std::cout << "generate file success!" << std::endl;
        // Release resources        modelStream->destroy();        network->destroy();        engine->destroy();        builder->destroy();        config->destroy();        return 0;}

4. 模型推理

前面，我們使用了兩種構(gòu)建 TensorRT 模型的方式，分別用 Python 和 C++ 兩種語言共生成了四個(gè) TensorRT 模型，這四個(gè)模型的功能理論上是完全一致的。

接下來，我們將分別使用 Python 和 C++ 兩種語言對生成的 TensorRT 模型進(jìn)行推理。

使用 Python API 推理

首先是使用 Python API 推理 TensorRT 模型，這里部分代碼參考了 MMDeploy。運(yùn)行下面代碼，可以發(fā)現(xiàn)輸入一個(gè) 1x3x224x224 的張量，輸出一個(gè) 1x3x112x112 的張量，完全符合我們對輸入池化后結(jié)果的預(yù)期。

from typing import Union, Optional, Sequence,Dict,Any
import torchimport tensorrt as trt
class TRTWrapper(torch.nn.Module):    def __init__(self,engine: Union[str, trt.ICudaEngine],                 output_names: Optional[Sequence[str]] = None) -> None:        super().__init__()        self.engine = engine        if isinstance(self.engine, str):            with trt.Logger() as logger, trt.Runtime(logger) as runtime:                with open(self.engine, mode='rb') as f:                    engine_bytes = f.read()                self.engine = runtime.deserialize_cuda_engine(engine_bytes)        self.context = self.engine.create_execution_context()        names = [_ for _ in self.engine]        input_names = list(filter(self.engine.binding_is_input, names))        self._input_names = input_names        self._output_names = output_names
        if self._output_names is None:            output_names = list(set(names) - set(input_names))            self._output_names = output_names
    def forward(self, inputs: Dict[str, torch.Tensor]):        assert self._input_names is not None        assert self._output_names is not None        bindings = [None] * (len(self._input_names) + len(self._output_names))        profile_id = 0        for input_name, input_tensor in inputs.items():            # check if input shape is valid            profile = self.engine.get_profile_shape(profile_id, input_name)            assert input_tensor.dim() == len(                profile[0]), 'Input dim is different from engine profile.'            for s_min, s_input, s_max in zip(profile[0], input_tensor.shape,                                             profile[2]):                assert s_min <= s_input <= s_max, \                    'Input shape should be between ' \                    + f'{profile[0]} and {profile[2]}' \                    + f' but get {tuple(input_tensor.shape)}.'            idx = self.engine.get_binding_index(input_name)
            # All input tensors must be gpu variables            assert 'cuda' in input_tensor.device.type            input_tensor = input_tensor.contiguous()            if input_tensor.dtype == torch.long:                input_tensor = input_tensor.int()            self.context.set_binding_shape(idx, tuple(input_tensor.shape))            bindings[idx] = input_tensor.contiguous().data_ptr()
        # create output tensors        outputs = {}        for output_name in self._output_names:            idx = self.engine.get_binding_index(output_name)            dtype = torch.float32            shape = tuple(self.context.get_binding_shape(idx))
            device = torch.device('cuda')            output = torch.empty(size=shape, dtype=dtype, device=device)            outputs[output_name] = output            bindings[idx] = output.data_ptr()        self.context.execute_async_v2(bindings,                                      torch.cuda.current_stream().cuda_stream)        return outputs
model = TRTWrapper('model.engine', ['output'])output = model(dict(input = torch.randn(1, 3, 224, 224).cuda()))print(output)

MMDeploy 鏈接：

https://github.com/open-mmlab/mmdeploy

（歡迎體驗(yàn)，覺得好用歡迎點(diǎn)亮小星星）

使用 C++ API 推理

最后，在很多實(shí)際生產(chǎn)環(huán)境中，我們都會(huì)使用 C++ 語言完成具體的任務(wù)，以達(dá)到更加高效的代碼運(yùn)行效果，另外 TensoRT 的用戶一般也都更看重其在 C++ 下的使用，所以我們也用 C++ 語言實(shí)現(xiàn)一遍模型推理，這也可以和用 Python API 推理模型做一個(gè)對比。

實(shí)現(xiàn)代碼如下：

#include <fstream>#include <iostream>
#include <NvInfer.h>#include <../samples/common/logger.h>
#define CHECK(status) \    do\    {\        auto ret = (status);\        if (ret != 0)\        {\            std::cerr << "Cuda failure: " << ret << std::endl;\            abort();\        }\    } while (0)
using namespace nvinfer1;using namespace sample;
const char* IN_NAME = "input";const char* OUT_NAME = "output";static const int IN_H = 224;static const int IN_W = 224;static const int BATCH_SIZE = 1;static const int EXPLICIT_BATCH = 1 << (int)(NetworkDefinitionCreationFlag::kEXPLICIT_BATCH);

void doInference(IExecutionContext& context, float* input, float* output, int batchSize){        const ICudaEngine& engine = context.getEngine();
        // Pointers to input and output device buffers to pass to engine.        // Engine requires exactly IEngine::getNbBindings() number of buffers.        assert(engine.getNbBindings() == 2);        void* buffers[2];
        // In order to bind the buffers, we need to know the names of the input and output tensors.        // Note that indices are guaranteed to be less than IEngine::getNbBindings()        const int inputIndex = engine.getBindingIndex(IN_NAME);        const int outputIndex = engine.getBindingIndex(OUT_NAME);
        // Create GPU buffers on device        CHECK(cudaMalloc(&buffers[inputIndex], batchSize * 3 * IN_H * IN_W * sizeof(float)));        CHECK(cudaMalloc(&buffers[outputIndex], batchSize * 3 * IN_H * IN_W /4 * sizeof(float)));
        // Create stream        cudaStream_t stream;        CHECK(cudaStreamCreate(&stream));
        // DMA input batch data to device, infer on the batch asynchronously, and DMA output back to host        CHECK(cudaMemcpyAsync(buffers[inputIndex], input, batchSize * 3 * IN_H * IN_W * sizeof(float), cudaMemcpyHostToDevice, stream));        context.enqueue(batchSize, buffers, stream, nullptr);        CHECK(cudaMemcpyAsync(output, buffers[outputIndex], batchSize * 3 * IN_H * IN_W / 4 * sizeof(float), cudaMemcpyDeviceToHost, stream));        cudaStreamSynchronize(stream);
        // Release stream and buffers        cudaStreamDestroy(stream);        CHECK(cudaFree(buffers[inputIndex]));        CHECK(cudaFree(buffers[outputIndex]));}
int main(int argc, char** argv){        // create a model using the API directly and serialize it to a stream        char *trtModelStream{ nullptr };        size_t size{ 0 };
        std::ifstream file("model.engine", std::ios::binary);        if (file.good()) {                file.seekg(0, file.end);                size = file.tellg();                file.seekg(0, file.beg);                trtModelStream = new char[size];                assert(trtModelStream);                file.read(trtModelStream, size);                file.close();        }
        Logger m_logger;        IRuntime* runtime = createInferRuntime(m_logger);        assert(runtime != nullptr);        ICudaEngine* engine = runtime->deserializeCudaEngine(trtModelStream, size, nullptr);        assert(engine != nullptr);        IExecutionContext* context = engine->createExecutionContext();        assert(context != nullptr);
        // generate input data        float data[BATCH_SIZE * 3 * IN_H * IN_W];        for (int i = 0; i < BATCH_SIZE * 3 * IN_H * IN_W; i++)                data[i] = 1;
        // Run inference        float prob[BATCH_SIZE * 3 * IN_H * IN_W /4];        doInference(*context, data, prob, BATCH_SIZE);
        // Destroy the engine        context->destroy();        engine->destroy();        runtime->destroy();        return 0;}

總結(jié)

通過本文的學(xué)習(xí)，我們掌握了兩種構(gòu)建 TensorRT 模型的方式：直接通過 TensorRT 的 API 逐層搭建網(wǎng)絡(luò)；將中間表示的模型轉(zhuǎn)換成 TensorRT 的模型。不僅如此，我們還分別用 C++ 和 Python 兩種語言完成了 TensorRT 模型的構(gòu)建及推理，相信大家都有所收獲！在下一篇文章中，我們將和大家一起學(xué)習(xí)何添加 TensorRT 自定義算子，敬請期待哦~

FAQ

Q：運(yùn)行代碼時(shí)報(bào)錯(cuò)：Could not find: cudnn64_8.dll. Is it on your PATH?
A：首先檢查下自己的環(huán)境變量中是否包含 cudnn64_8.dll 所在的路徑，若發(fā)現(xiàn) cudnn 的路徑在 C:\\Program Files\\NVIDIA GPU Computing Toolkit\\CUDA\\v10.2\\bin 中，但是里面只有 cudnn64_7.dll，解決方法是去 NVIDIA 官網(wǎng)下載 cuDNN zip 包，解壓后，復(fù)制其中的 cudnn64_8.dll 到 CUDA Toolkit 的 bin 目錄下。這時(shí)也可以復(fù)制一份 cudnn64_7.dll，然后將復(fù)制的那份改名成 cudnn64_8.dll，同樣可以解決這個(gè)問題。

參考