<kbd id="afajh"><form id="afajh"></form></kbd>
<strong id="afajh"><dl id="afajh"></dl></strong>
    <del id="afajh"><form id="afajh"></form></del>
        1. <th id="afajh"><progress id="afajh"></progress></th>
          <b id="afajh"><abbr id="afajh"></abbr></b>
          <th id="afajh"><progress id="afajh"></progress></th>

          CUDA隨筆之Stream的使用

          共 25958字,需瀏覽 52分鐘

           ·

          2022-10-27 21:20

          點(diǎn)擊上方視學(xué)算法”,選擇加"星標(biāo)"或“置頂

          重磅干貨,第一時(shí)間送達(dá)

          作者丨伊凡@知乎(已授權(quán))
          來源丨h(huán)ttps://zhuanlan.zhihu.com/p/51402722
          編輯丨極市平臺(tái)

          導(dǎo)讀

           

          來看看如何使用stream進(jìn)行數(shù)據(jù)傳輸和計(jì)算并行,讓數(shù)據(jù)處理快人一步。 

          出于個(gè)人興趣和工作需要,最近接觸了GPU編程。于是想寫篇文章(或一系列文章),總結(jié)一下學(xué)習(xí)所得,防止自己以后忘了。

          這篇短文主要介紹CUDA里面Stream的概念。用到CUDA的程序一般需要處理海量的數(shù)據(jù),內(nèi)存帶寬經(jīng)常會(huì)成為主要的瓶頸。在Stream的幫助下,CUDA程序可以有效地將內(nèi)存讀取和數(shù)值運(yùn)算并行,從而提升數(shù)據(jù)的吞吐量。

          本文使用了一個(gè)非常naive的圖像處理例子:像素色彩空間轉(zhuǎn)換,將一張7680x4320的8-bit BRGA圖像轉(zhuǎn)成同樣尺寸的8-bit YUV。計(jì)算非常簡(jiǎn)單,就是數(shù)據(jù)量非常大。轉(zhuǎn)換公式直接照抄維基百科(https://en.wikipedia.org/wiki/YUV#Conversion_to/from_RGB)

          由于GPU和CPU不能直接讀取對(duì)方的內(nèi)存,CUDA程序一般會(huì)有一下三個(gè)步驟:1)將數(shù)據(jù)從CPU內(nèi)存轉(zhuǎn)移到GPU內(nèi)存,2)GPU進(jìn)行運(yùn)算并將結(jié)果保存在GPU內(nèi)存,3)將結(jié)果從GPU內(nèi)存拷貝到CPU內(nèi)存。

          如果不做特別處理,那么CUDA會(huì)默認(rèn)只使用一個(gè)Stream(Default Stream)。在這種情況下,剛剛提到的三個(gè)步驟就如菊花鏈般蛋疼地串聯(lián),必須等一步完成了才能進(jìn)行下一步。是不是很別扭?(短文末尾附有完整代碼)

          數(shù)值計(jì)算必須等數(shù)據(jù)拷貝完全結(jié)束后才開始能
          uint8_t* bgraBuffer;
          uint8_t* yuvBuffer;
          uint8_t* deviceBgraBuffer;
          uint8_t* deviceYuvBuffer;

          const int dataSizeBgra = 7680 * 4320 * 4;
          const int dataSizeYuv = 7680 * 4320 * 3;

          cudaMallocHost(&bgraBuffer, dataSizeBgra);
          cudaMallocHost(&yuvBuffer, dataSizeYuv);
          cudaMalloc(&deviceBgraBuffer, dataSizeBgra);
          cudaMalloc(&deviceYuvBuffer, dataSizeYuv);

          //隨機(jī)生成8K的BGRA圖像
          GenerateBgra8K(bgraBuffer, dataSizeBgra);

          //將圖像拷貝到GPU內(nèi)存
          cudaMemcpy(deviceBgraBuffer, bgraBuffer, dataSizeBgra, cudaMemcpyHostToDevice);

          //CUDA kernel將 BGRA 轉(zhuǎn)換為 YUV
          convertPixelFormat<<<40961024>>>(deviceBgraBuffer, deviceYuvBuffer, 7680*4320);

          //等待數(shù)值計(jì)算完成
          cudaDeviceSynchronize()

          //將轉(zhuǎn)換完的圖像拷貝回CPU內(nèi)存
          cudaMemcpy(yuvBuffer, deviceYuvBuffer, dataSizeYuv, cudaMemcpyDeviceToHost);

          cudaFreeHost(bgraBuffer);
          cudaFreeHost(yuvBuffer);
          cudaFree(deviceBgraBuffer);
          cudaFree(deviceYuvBuffer);

          NVIDIA家的GPU有一下很不錯(cuò)的技能(不知道是不是獨(dú)有):

          1. 數(shù)據(jù)拷貝和數(shù)值計(jì)算可以同時(shí)進(jìn)行。
          2. 兩個(gè)方向的拷貝可以同時(shí)進(jìn)行(GPU到CPU,和CPU到GPU),數(shù)據(jù)如同行駛在雙向快車道。

          但同時(shí),這數(shù)據(jù)和計(jì)算的并行也有一點(diǎn)合乎邏輯的限制:進(jìn)行數(shù)值計(jì)算的kernel不能讀寫正在被拷貝的數(shù)據(jù)。

          Stream正是幫助我們實(shí)現(xiàn)以上兩個(gè)并行的重要工具?;镜母拍钍牵?/p>

          1. 將數(shù)據(jù)拆分稱許多塊,每一塊交給一個(gè)Stream來處理。
          2. 每一個(gè)Stream包含了三個(gè)步驟:1)將屬于該Stream的數(shù)據(jù)從CPU內(nèi)存轉(zhuǎn)移到GPU內(nèi)存,2)GPU進(jìn)行運(yùn)算并將結(jié)果保存在GPU內(nèi)存,3)將該Stream的結(jié)果從GPU內(nèi)存拷貝到CPU內(nèi)存。
          3. 所有的Stream被同時(shí)啟動(dòng),由GPU的scheduler決定如何并行。

          在這樣的騷操作下,假設(shè)我們把數(shù)據(jù)分成A,B兩塊,各由一個(gè)Stream來處理。A的數(shù)值計(jì)算可以和B的數(shù)據(jù)傳輸同時(shí)進(jìn)行,而A與B的數(shù)據(jù)傳輸也可以同時(shí)進(jìn)行。由于第一個(gè)Stream只用到了數(shù)據(jù)A,而第二個(gè)Stream只用到了數(shù)據(jù)B,“進(jìn)行數(shù)值計(jì)算的kernel不能讀寫正在被拷貝的數(shù)據(jù)”這一限制并沒有被違反。效果如下:

          用2個(gè)Stream(上)與只用一個(gè)Default Stream(下)的對(duì)比

          實(shí)際上在NSight Profiler里面看上去是這樣(這里用了8個(gè)Stream):

          代碼(省略版):

          uint8_t* bgraBuffer;
          uint8_t* yuvBuffer;
          uint8_t* deviceBgraBuffer;
          uint8_t* deviceYuvBuffer;

          const int dataSizeBgra = 7680 * 4320 * 4;
          const int dataSizeYuv = 7680 * 4320 * 3;

          cudaMallocHost(&bgraBuffer, dataSizeBgra);
          cudaMallocHost(&yuvBuffer, dataSizeYuv);
          cudaMalloc(&deviceBgraBuffer, dataSizeBgra);
          cudaMalloc(&deviceYuvBuffer, dataSizeYuv);

          //隨機(jī)生成8K的BGRA圖像
          GenerateBgra8K(bgraBuffer, dataSizeBgra);

          //Stream的數(shù)量,這里用8個(gè)
          const int nStreams = 8;

          //Stream的初始化
          cudaStream_t streams[nStreams];
          for (int i = 0; i < nStreams; i++) {
            cudaStreamCreate(&streams[i]);
          }

          //計(jì)算每個(gè)Stream處理的數(shù)據(jù)量。這里只是簡(jiǎn)單將數(shù)據(jù)分成8等分
          //這里不會(huì)出現(xiàn)不能整除的情況,但實(shí)際中要小心
          int brgaOffset = 0;
          int yuvOffset = 0;
          const int brgaChunkSize = dataSizeBgra / nStreams;
          const int yuvChunkSize = dataSizeYuv / nStreams;

          //這個(gè)循環(huán)依次啟動(dòng) nStreams 個(gè) Stream
          for(int i=0; i<nStreams; i++)
          {
            
            brgaOffset = brgaChunkSize*i;
            yuvOffset = yuvChunkSize*i;

            //CPU到GPU的數(shù)據(jù)拷貝(原始數(shù)據(jù)),Stream i
            cudaMemcpyAsync(  deviceBgraBuffer+brgaOffset,
                              bgraBuffer+brgaOffset,
                              brgaChunkSize,
                              cudaMemcpyHostToDevice,
                              streams[i] );

            //數(shù)值計(jì)算,Stream i
            convertPixelFormat<<<409610240, streams[i]>>>(
                                           deviceBgraBuffer+brgaOffset, 
                                           deviceYuvBuffer+yuvOffset, 
                                           brgaChunkSize/4 );

            //GPU到CPU的數(shù)據(jù)拷貝(計(jì)算結(jié)果),Stream i
            cudaMemcpyAsync(  yuvBuffer+yuvOffset,
                              deviceYuvBuffer+yuvOffset,
                              yuvChunkSize,
                              cudaMemcpyDeviceToHost,
                              streams[i] );
          }

          //等待所有操作完成
          cudaDeviceSynchronize();

          cudaFreeHost(bgraBuffer);
          cudaFreeHost(yuvBuffer);
          cudaFree(deviceBgraBuffer);
          cudaFree(deviceYuvBuffer);

          在我的電腦上測(cè)試得出的性能對(duì)比(GPU型號(hào) Quadro M2200):

          • CPU:300 ms

          • GPU 不用 Stream:34.6 ms

          • GPU 用8個(gè)Stream:20.2 ms

          • GPU 用18個(gè)Stream:19.3 ms

          總結(jié)

          使用多個(gè)Stream令數(shù)據(jù)傳輸和計(jì)算并行,可比只用Default Stream增加相當(dāng)多的吞吐量。在需要處理海量數(shù)據(jù),Stream是一個(gè)十分重要的工具。

          完整代碼(需要NVidia GPU,本文中的測(cè)試使用CUDA 10.0):

          #include <vector>
          #include <random>
          #include <iostream>

          #include <cuda.h>
          #include <cuda_runtime.h>

          #ifdef DEBUG
          #define CUDA_CALL(F)  if( (F) != cudaSuccess ) \
            {printf("Error %s at %s:%d\n", cudaGetErrorString(cudaGetLastError()), \
             __FILE__,__LINE__); exit(-1);}

          #define CUDA_CHECK()  if( (cudaPeekAtLastError()) != cudaSuccess ) \
            {printf("Error %s at %s:%d\n", cudaGetErrorString(cudaGetLastError()), \
             __FILE__,__LINE__-1); exit(-1);}

          #else
          #define CUDA_CALL(F) (F)
          #define CUDA_CHECK()
          #endif

          void PrintDeviceInfo();
          void GenerateBgra8K(uint8_t* buffer, int dataSize);
          void convertPixelFormatCpu(uint8_t* inputBgra, uint8_t* outputYuv, int numPixels);
          __global__ void convertPixelFormat(uint8_t* inputBgra, uint8_t* outputYuv, int numPixels);

          int main()
          {
            PrintDeviceInfo();

            uint8_t* bgraBuffer;
            uint8_t* yuvBuffer;
            uint8_t* deviceBgraBuffer;
            uint8_t* deviceYuvBuffer;

            const int dataSizeBgra = 7680 * 4320 * 4;
            const int dataSizeYuv = 7680 * 4320 * 3;
            CUDA_CALL(cudaMallocHost(&bgraBuffer, dataSizeBgra));
            CUDA_CALL(cudaMallocHost(&yuvBuffer, dataSizeYuv));
            CUDA_CALL(cudaMalloc(&deviceBgraBuffer, dataSizeBgra));
            CUDA_CALL(cudaMalloc(&deviceYuvBuffer, dataSizeYuv));

            std::vector<uint8_tyuvCpuBuffer(dataSizeYuv);

            cudaEvent_t start, stop;
            float elapsedTime;
            float elapsedTimeTotal;
            float dataRate;
            CUDA_CALL(cudaEventCreate(&start));
            CUDA_CALL(cudaEventCreate(&stop));

            std::cout << " " << std::endl;
            std::cout << "Generating 7680 x 4320 BRGA8888 image, data size: " << dataSizeBgra << std::endl;
            GenerateBgra8K(bgraBuffer, dataSizeBgra);

            std::cout << " " << std::endl;
            std::cout << "Computing results using CPU." << std::endl;
            std::cout << " " << std::endl;
            CUDA_CALL(cudaEventRecord(start, 0));
            convertPixelFormatCpu(bgraBuffer, yuvCpuBuffer.data(), 7680*4320);
            CUDA_CALL(cudaEventRecord(stop, 0));
            CUDA_CALL(cudaEventSynchronize(stop));
            CUDA_CALL(cudaEventElapsedTime(&elapsedTime, start, stop));
            std::cout << "    Whole process took " << elapsedTime << "ms." << std::endl;

            std::cout << " " << std::endl;
            std::cout << "Computing results using GPU, default stream." << std::endl;
            std::cout << " " << std::endl;

            std::cout << "    Move data to GPU." << std::endl;
            CUDA_CALL(cudaEventRecord(start, 0));
            CUDA_CALL(cudaMemcpy(deviceBgraBuffer, bgraBuffer, dataSizeBgra, cudaMemcpyHostToDevice));
            CUDA_CALL(cudaEventRecord(stop, 0));
            CUDA_CALL(cudaEventSynchronize(stop));
            CUDA_CALL(cudaEventElapsedTime(&elapsedTime, start, stop));
            dataRate = dataSizeBgra/(elapsedTime/1000.0)/1.0e9;
            elapsedTimeTotal = elapsedTime;
            std::cout << "        Data transfer took " << elapsedTime << "ms." << std::endl;
            std::cout << "        Performance is " << dataRate << "GB/s." << std::endl;

            std::cout << "    Convert 8-bit BGRA to 8-bit YUV." << std::endl;
            CUDA_CALL(cudaEventRecord(start, 0));
            convertPixelFormat<<<324001024>>>(deviceBgraBuffer, deviceYuvBuffer, 7680*4320);
            CUDA_CHECK();
            CUDA_CALL(cudaDeviceSynchronize());
            CUDA_CALL(cudaEventRecord(stop, 0));
            CUDA_CALL(cudaEventSynchronize(stop));
            CUDA_CALL(cudaEventElapsedTime(&elapsedTime, start, stop));
            dataRate = dataSizeBgra/(elapsedTime/1000.0)/1.0e9;
            elapsedTimeTotal += elapsedTime;
            std::cout << "        Processing of 8K image took " << elapsedTime << "ms." << std::endl;
            std::cout << "        Performance is " << dataRate << "GB/s." << std::endl;

            std::cout << "    Move data to CPU." << std::endl;
            CUDA_CALL(cudaEventRecord(start, 0));
            CUDA_CALL(cudaMemcpy(yuvBuffer, deviceYuvBuffer, dataSizeYuv, cudaMemcpyDeviceToHost));
            CUDA_CALL(cudaEventRecord(stop, 0));
            CUDA_CALL(cudaEventSynchronize(stop));
            CUDA_CALL(cudaEventElapsedTime(&elapsedTime, start, stop));
            dataRate = dataSizeYuv/(elapsedTime/1000.0)/1.0e9;
            elapsedTimeTotal += elapsedTime;
            std::cout << "        Data transfer took " << elapsedTime << "ms." << std::endl;
            std::cout << "        Performance is " << dataRate << "GB/s." << std::endl;

            std::cout << "    Whole process took " << elapsedTimeTotal << "ms." <<std::endl;

            std::cout << "    Compare CPU and GPU results ..." << std::endl;
            bool foundMistake = false;
            for(int i=0; i<dataSizeYuv; i++){
              if(yuvCpuBuffer[i]!=yuvBuffer[i]){
                foundMistake = true;
                break;
              }
            }

            if(foundMistake){
              std::cout << "        Results are NOT the same." << std::endl;
            } else {
              std::cout << "        Results are the same." << std::endl;
            }

            const int nStreams = 16;

            std::cout << " " << std::endl;
            std::cout << "Computing results using GPU, using "<< nStreams <<" streams." << std::endl;
            std::cout << " " << std::endl;

            cudaStream_t streams[nStreams];
            std::cout << "    Creating " << nStreams << " CUDA streams." << std::endl;
            for (int i = 0; i < nStreams; i++) {
              CUDA_CALL(cudaStreamCreate(&streams[i]));
            }

            int brgaOffset = 0;
            int yuvOffset = 0;
            const int brgaChunkSize = dataSizeBgra / nStreams;
            const int yuvChunkSize = dataSizeYuv / nStreams;

            CUDA_CALL(cudaEventRecord(start, 0));
            for(int i=0; i<nStreams; i++)
            {
              std::cout << "        Launching stream " << i << "." << std::endl;
              brgaOffset = brgaChunkSize*i;
              yuvOffset = yuvChunkSize*i;
              CUDA_CALL(cudaMemcpyAsync(  deviceBgraBuffer+brgaOffset,
                                          bgraBuffer+brgaOffset,
                                          brgaChunkSize,
                                          cudaMemcpyHostToDevice,
                                          streams[i] ));

              convertPixelFormat<<<409610240, streams[i]>>>(deviceBgraBuffer+brgaOffset, deviceYuvBuffer+yuvOffset, brgaChunkSize/4);

              CUDA_CALL(cudaMemcpyAsync(  yuvBuffer+yuvOffset,
                                          deviceYuvBuffer+yuvOffset,
                                          yuvChunkSize,
                                          cudaMemcpyDeviceToHost,
                                          streams[i] ));
            }

            CUDA_CHECK();
            CUDA_CALL(cudaDeviceSynchronize());

            CUDA_CALL(cudaEventRecord(stop, 0));
            CUDA_CALL(cudaEventSynchronize(stop));
            CUDA_CALL(cudaEventElapsedTime(&elapsedTime, start, stop));
            std::cout << "    Whole process took " << elapsedTime << "ms." << std::endl;

            std::cout << "    Compare CPU and GPU results ..." << std::endl;
            for(int i=0; i<dataSizeYuv; i++){
              if(yuvCpuBuffer[i]!=yuvBuffer[i]){
                foundMistake = true;
                break;
              }
            }

            if(foundMistake){
              std::cout << "        Results are NOT the same." << std::endl;
            } else {
              std::cout << "        Results are the same." << std::endl;
            }

            CUDA_CALL(cudaFreeHost(bgraBuffer));
            CUDA_CALL(cudaFreeHost(yuvBuffer));
            CUDA_CALL(cudaFree(deviceBgraBuffer));
            CUDA_CALL(cudaFree(deviceYuvBuffer));

            return 0;
          }

          void PrintDeviceInfo(){
            int deviceCount = 0;
            cudaGetDeviceCount(&deviceCount);
            std::cout << "Number of device(s): " << deviceCount << std::endl;
            if (deviceCount == 0) {
                std::cout << "There is no device supporting CUDA" << std::endl;
                return;
            }

            cudaDeviceProp info;
            for(int i=0; i<deviceCount; i++){
              cudaGetDeviceProperties(&info, i);
              std::cout << "Device " << i << std::endl;
              std::cout << "    Name:                    " << std::string(info.name) << std::endl;
              std::cout << "    Glocbal memory:          " << info.totalGlobalMem/1024.0/1024.0 << " MB"<< std::endl;
              std::cout << "    Shared memory per block: " << info.sharedMemPerBlock/1024.0 << " KB"<< std::endl;
              std::cout << "    Warp size:               " << info.warpSize<< std::endl;
              std::cout << "    Max thread per block:    " << info.maxThreadsPerBlock<< std::endl;
              std::cout << "    Thread dimension limits: " << info.maxThreadsDim[0]<< " x "
                                                           << info.maxThreadsDim[1]<< " x "
                                                           << info.maxThreadsDim[2]<< std::endl;
              std::cout << "    Max grid size:           " << info.maxGridSize[0]<< " x "
                                                           << info.maxGridSize[1]<< " x "
                                                           << info.maxGridSize[2]<< std::endl;
              std::cout << "    Compute capability:      " << info.major << "." << info.minor << std::endl;
            }
          }

          void GenerateBgra8K(uint8_t* buffer, int dataSize){

            std::random_device rd;
            std::mt19937 gen(rd());
            std::uniform_int_distribution<> sampler(0255);

            for(int i=0; i<dataSize/4; i++){
              buffer[i*4] = sampler(gen);
              buffer[i*4+1] = sampler(gen);
              buffer[i*4+2] = sampler(gen);
              buffer[i*4+3] = 255;
            }
          }

          void convertPixelFormatCpu(uint8_t* inputBgra, uint8_t* outputYuv, int numPixels){
            short3 yuv16;
            char3 yuv8;
            for(int idx=0; idx<numPixels; idx++){
              yuv16.x = 66*inputBgra[idx*4+2] + 129*inputBgra[idx*4+1] + 25*inputBgra[idx*4];
              yuv16.y = -38*inputBgra[idx*4+2] + -74*inputBgra[idx*4+1] + 112*inputBgra[idx*4];
              yuv16.z = 112*inputBgra[idx*4+2] + -94*inputBgra[idx*4+1] + -18*inputBgra[idx*4];

              yuv8.x = (yuv16.x>>8)+16;
              yuv8.y = (yuv16.y>>8)+128;
              yuv8.z = (yuv16.z>>8)+128;

              *(reinterpret_cast<char3*>(&outputYuv[idx*3])) = yuv8;
            }
          }

          __global__ void convertPixelFormat(uint8_t* inputBgra, uint8_t* outputYuv, int numPixels){
            int stride = gridDim.x * blockDim.x;
            int idx = threadIdx.x + blockIdx.x * blockDim.x;
            short3 yuv16;
            char3 yuv8;

            while(idx<=numPixels){
              if(idx<numPixels){
                yuv16.x = 66*inputBgra[idx*4+2] + 129*inputBgra[idx*4+1] + 25*inputBgra[idx*4];
                yuv16.y = -38*inputBgra[idx*4+2] + -74*inputBgra[idx*4+1] + 112*inputBgra[idx*4];
                yuv16.z = 112*inputBgra[idx*4+2] + -94*inputBgra[idx*4+1] + -18*inputBgra[idx*4];

                yuv8.x = (yuv16.x>>8)+16;
                yuv8.y = (yuv16.y>>8)+128;
                yuv8.z = (yuv16.z>>8)+128;

                *(reinterpret_cast<char3*>(&outputYuv[idx*3])) = yuv8;
              }
              idx += stride;
            }
          }

          點(diǎn)個(gè)在看 paper不斷!

          瀏覽 38
          點(diǎn)贊
          評(píng)論
          收藏
          分享

          手機(jī)掃一掃分享

          分享
          舉報(bào)
          評(píng)論
          圖片
          表情
          推薦
          點(diǎn)贊
          評(píng)論
          收藏
          分享

          手機(jī)掃一掃分享

          分享
          舉報(bào)
          <kbd id="afajh"><form id="afajh"></form></kbd>
          <strong id="afajh"><dl id="afajh"></dl></strong>
            <del id="afajh"><form id="afajh"></form></del>
                1. <th id="afajh"><progress id="afajh"></progress></th>
                  <b id="afajh"><abbr id="afajh"></abbr></b>
                  <th id="afajh"><progress id="afajh"></progress></th>
                  懂色av无码任你操久久久久蜜桃av | 成人自拍视频在线观看 | 玖玖精品视频 | 女人18毛片水真多免费 | 国产日韩二区 |