模型迁移至 Jetson

Johann Li | April 13, 2022

这里整理一下如何把训练好的深度学习模型部署到 Jetson 平台上面。

这里以 PyTorch 模型为例子，然后推理的代码是采用 TensorRT + Python 的。

模型准备

首先在对模型进行迁移之前，需要将被迁移的模型准备好，无论是别人训练好的模型还是自己训练的。需要的是两个内容，一个是模型的定义，还有一个就是模型的参数。

通常来说，一些深度学习框架导出的文件是只包括参数的，就比如说是 PyTorch，还有一些框架是同时导出模型的定义和参数的。

导出 ONNX

这里使用 ONNX 作为从 PyTorch 模型转换到 TensorRT 模型文件的中间桥梁。当然也有其他的方式，但这里只将使用 ONNX 作为 ”中间商“ 的流程。

这个中间商采用什么格式或者不采用”中间商“啥的，具体要看 TensorRT 相关的一些文档。英伟达官方的文档中有 ONNX、Caffe、UFF 的模型解析的文档，这三种格式是可以作为”中间商“的。

将 PyTorch 模型导出 ONNX 文件，可以参考 PyTorch 官方的文档中有如何使用转化为 ONNX 模型的说明、例程和相关的API文档。参考里面的文档就可以将 PyTorch 模型转化为 ONNX 模型。

PyTorch 转 ONNX 格式还有其他的转化方式。

在创建 PyTorch 模型的对象之后，一定要通过 load_state 将模型参数进行加载。

转化 ONNX 模型为 TensorRT 模型

TensorRT 是有一个可执行文件用来转化 ONNX、Caffe 等格式的模型文件为 TensorRT 模型文件，但是如果需要对模型进行量化的话，则需要编写代码调用 TensorRT 提供的 API 进行量化。

因为在转化得到的 TensorRT 的 engine 文件是和设备相关的，所以要在对应设备上将 Onnx 转化为相关的模型，并且几乎不能和其他机器通用。

官方文档里面有相关的 API 文档。然后转化的流程大致如下：

0. 导入相关模块和类：

import tensorrt as trt
from calibrator import CustomEntropyCalibrator

其中， tensorrt 是 TensorRT 的 Python 绑定的模块，然后 CustomEntropyCalibrator 是一个数据”加载器“，然后这个后面会讲。

1. 首先构建一个 `Logger`

LOGGER = trt.Logger(trt.Logger.INFO)

这个在后面阶段会用到日志类。

2. 创建转化引擎

第二步是通过 with-记法创建生成量化模型相关内容的几个东西：

with trt.Builder(LOGGER) as builder, \
        builder.create_network(EXPLICIT_BATCH) as network, \
        builder.create_builder_config() as config, \
        trt.OnnxParser(network, LOGGER) as parser:

这里创建了一个 TensorRT 的 builder，一个神经网络模型，一个 builder 的参数配置对象和一个 Onnx 模型的解析器。

3. 配置转化引擎

然后是对 builder 和 config 进行配置，这个详见英伟达的文档，当然后面会给出完整的例子。

要注意的是，由于英伟达 API 的变化，配置内容有一些是在 config 中，有一些是在 builder 之中，而且不同版本之间配置的方式可能会有一些变化。

上面提到的 CustomEntropyCalibrator 是加载用于量化的数据的一个类，然后这个类实例化之后要传递给 config 来配置对模型的量化。

在这个配置过程中，需要注意，对 FP16、TF32、INT8、NVDLA 等不同的功能的开启与关闭。

需要注意对 builder 里面 max_workspace_size 进行配置，要避免太小或者太大，对于

4. 转化并保存

然后据需要转化模型了。

第一步需要通过 parser 将 Onnx 模型加载并转化为 TensorRT builder 能识别的模型：

parser.parse_from_file(onnx_file_path)

第二步就可以模型进行构建了：

engine = builder.build_engine(network, config)

第三步是对模型进行序列化并保存：

with open("model.engine", "wb") as f:
            f.write(engine.serialize())

保存的文件在模型推理的时候，就可以直接加载文件然后运行了。

较为完整的转化脚本：

这里展示了我现在使用的转化脚本：

onnx_file_path = '/path/to/model.onnx'
batch_size = 1
calib_src = '/path/to/data'
calib_count = 10 # number of figures used
workspace_size = 10 # 10 MByte
mode = 'int8'
nvdla = 1

LOGGER = trt.Logger(trt.Logger.INFO)
EXPLICIT_BATCH = 1 << (int)(trt.NetworkDefinitionCreationFlag.EXPLICIT_BATCH)


with trt.Builder(LOGGER) as builder, \
        builder.create_network(EXPLICIT_BATCH) as network, \
        builder.create_builder_config() as config, \
        trt.OnnxParser(network, LOGGER) as parser:

    builder.max_batch_size = batch_size
    config.max_workspace_size = workspace_size * (1 << 20)
    if mode.lower() == 'int8':
        assert (builder.platform_has_fast_int8 == True), "not support int8"
        config.set_flag(trt.BuilderFlag.INT8)
        calib = CustomEntropyCalibrator(calib_src, count = calib_count)
        config.int8_calibrator = calib
    elif mode.lower() == 'fp16':
        assert (builder.platform_has_fast_fp16 == True), "not support fp16"
        config.set_flag(trt.BuilderFlag.FP16)

    if nvdla is not None and nvdla > 0 and builder.num_DLA_cores <= nvdla:
        config.set_flag(trt.BuilderFlag.GPU_FALLBACK)
        config.default_device_type = trt.DeviceType.DLA
        config.DLA_core = nvdla

    print(f'Loading ONNX file from path {onnx_file_path}')

    if not parser.parse_from_file(onnx_file_path):
        for e in range(parser.num_errors):
            print(parser.get_error(e))
        raise TypeError("Parser parse failed.")

    print('Completed parsing of ONNX file')

    print(f'Building an engine from file {onnx_file_path}; this may take a while...')
    engine = builder.build_engine(network, config)
    if engine is None:
        raise Exception('Failed to create engine, return is None')
    else:
        print("Created engine success! ")

    engine_file_path = onnx_file_path + '.engine'

    # 保存计划文件
    print(f'Saving TRT engine file to path {engine_file_path}...')
    with open(engine_file_path, "wb") as f:
        f.write(engine.serialize())
    print(f'Engine file has already saved to {engine_file_path}!')

关于 `CustomEntropyCalibrator`

这个东西在英伟达的文档上有一些相关内容的说明。简单来说，就是一个类似 PyTorch Dataset 的东西，用来加载用于量化的数据。一个根据官方示例魔改的代码如下：

class CustomEntropyCalibrator(trt.IInt8EntropyCalibrator2):
    def __init__(self, folder, count = 1024, cache_file = None, size = 512, batch_size = 64):
        # Whenever you specify a custom constructor for a TensorRT class,
        # you MUST call the constructor of the parent explicitly.
        super(CustomEntropyCalibrator, self).__init__()

        if isinstance(size, (int, float)):
            size = int(size)
            self.size = (size, size)
        else:
            self.size = tuple([int(s) for s in size[0:2]])

        self.cache_file = cache_file if cache_file else 'coco-yolo.cache'

        image_list = glob.glob(os.path.join(folder, '*'))
        image_list = random.sample(image_list, count)

        images = []

        for i, p in enumerate(image_list):
            img = np.array(Image.open(p).resize(self.size,Image.ANTIALIAS)) / 255
            if img.ndim == 3:
                img = img[:,:,0:3].transpose(2, 0, 1).astype('float32')
            else:
                img = np.stack([img, img, img], axis = 2)
                img = img[:,:,0:3].transpose(2, 0, 1).astype('float32')
            images.append(img)
            print('process', i, '/', count)
        images = np.ascontiguousarray(np.stack(images, axis = 0))


        self.batch_size = batch_size
        self.data = images
        self.current_index = 0
        self.device_input = cuda.mem_alloc(self.data[0].nbytes * self.batch_size)


    def get_batch_size(self):
        return self.batch_size

    # TensorRT passes along the names of the engine bindings to the get_batch function.
    # You don't necessarily have to use them, but they can be useful to understand the order of
    # the inputs. The bindings list is expected to have the same ordering as 'names'.
    def get_batch(self, names):
        if self.current_index + self.batch_size > self.data.shape[0]:
            return None

        current_batch = int(self.current_index / self.batch_size)
        if current_batch % 10 == 0:
            print("Calibrating batch {:}, containing {:} images".format(current_batch, self.batch_size))

        batch = self.data[self.current_index : self.current_index + self.batch_size].ravel()
        cuda.memcpy_htod(self.device_input, batch)
        self.current_index += self.batch_size
        return [self.device_input]

    def read_calibration_cache(self):
        # If there is a cache, use it instead of calibrating again. Otherwise, implicitly return None.
        if os.path.exists(self.cache_file):
            with open(self.cache_file, "rb") as f:
                return f.read()

    def write_calibration_cache(self, cache):
        with open(self.cache_file, "wb") as f:
            f.write(cache)

执行

产生了模型文件之后，就需要在程序里面来调用。大致如下：

1. 初始化环境

对于初始化环境，需要创建（或者说分配）运行 TensorRT engine 相关的 logger、内存、CUDA stream、上下文等等。

导入模块。

import tensorrt as trt
import pycuda.driver as cuda

创建一个 Logger，这个是创建运行时所需要的。
```
trt_logger = trt.Logger(trt.Logger.INFO)
```

加载模型。

with open('model.engine', 'rb') as f, trt.Runtime(trt_logger) as runtime:
        engine = runtime.deserialize_cuda_engine(d.read())

创建上下文。

context = engine.create_execution_context()

分配内存空间

h_input = cuda.pagelocked_empty(trt.volume(context.get_binding_shape(0)), dtype=np.float32)
h_output = cuda.pagelocked_empty(trt.volume(context.get_binding_shape(1)), dtype=np.float32)
d_input = cuda.mem_alloc(h_input.nbytes)
d_output = cuda.mem_alloc(h_output.nbytes)

有一个需要注意的地方是 Jetson 中 CPU 是和 GPU 公用存储空间的，所以这里可以不区分 host 和 device。

创建 CUDA 流
```
stream = cuda.Stream()
```
复制数据。这一步需要将输入的数据从对应的内存空间，比如 Numpy 矩阵，拷贝到上面的 h_input 或者直接拷贝到 d_output：
```
cuda.memcpy_htod_async(d_input, h_input, stream)
```

执行：

context.execute_async_v2(bindings=[int(d_input), int(d_output)], stream_handle=stream.handle)

获取推断结果。

cuda.memcpy_dtoh_async(h_output, d_output, stream)

同步，stream.synchronize()。一个关键点是上面的代码是进行了异步的操作，所以这里需要做一个等待，或者说是同步

最后就可以对输出的数据进行处理了，比如可视化。