Mobile AI Compute Engine Documentation

Welcome to Mobile AI Compute Engine documentation.

The main documentation is organized into the following sections:

Introduction

Mobile AI Compute Engine (MACE) is a deep learning inference framework optimized for mobile heterogeneous computing platforms. The following figure shows the overall architecture.

_images/mace-arch.png

Model format

MACE defines a customized model format which is similar to Caffe2. The MACE model can be converted from exported models by TensorFlow and Caffe. A YAML file is used to describe the model deployment details. In the next chapter, there is a detailed guide showing how to create this YAML file.

Model conversion

Currently, we provide model converters for TensorFlow and Caffe. And more frameworks will be supported in the future.

Model loading

The MACE model format contains two parts: the model graph definition and the model parameter tensors. The graph part utilizes Protocol Buffers for serialization. All the model parameter tensors are concatenated together into a continuous byte array, and we call this array tensor data in the following paragraphs. In the model graph, the tensor data offsets and lengths are recorded.

The models can be loaded in 3 ways:

  1. Both model graph and tensor data are dynamically loaded externally (by default, from file system, but the users are free to choose their own implementations, for example, with compression or encryption). This approach provides the most flexibility but the weakest model protection.
  2. Both model graph and tensor data are converted into C++ code and loaded by executing the compiled code. This approach provides the strongest model protection and simplest deployment.
  3. The model graph is converted into C++ code and constructed as the second approach, and the tensor data is loaded externally as the first approach.

Create a model deployment file

The first step to deploy you models is to create a YAML model deployment file.

One deployment file describes a case of model deployment, each file will generate one static library (if more than one ABIs specified, there will be one static library for each). The deployment file can contain one or more models, for example, a smart camera application may contain face recognition, object recognition, and voice recognition models, which can be defined in one deployment file),

Example

Here is an example deployment file used by an Android demo application.

TODO: change this example file to the demo deployment file (reuse the same file) and rename to a reasonable name.

# The name of library
library_name: mobilenet
target_abis: [arm64-v8a]
embed_model_data: 1
# The build mode for model(s).
# 'code' stand for transfer model(s) into cpp code, 'proto' for model(s) in protobuf file(s).
build_type: code
linkshared: 0
# One yaml config file can contain multi models' config message.
models:
  mobilenet_v1: # model tag, which will be used in model loading and must be specific.
    platform: tensorflow
    # support local path, http:// and https://
    model_file_path: https://cnbj1.fds.api.xiaomi.com/mace/miai-models/mobilenet-v1/mobilenet-v1-1.0.pb
    model_sha256_checksum: 71b10f540ece33c49a7b51f5d4095fc9bd78ce46ebf0300487b2ee23d71294e6
    subgraphs:
      - input_tensors: input
        input_shapes: 1,224,224,3
        output_tensors: MobilenetV1/Predictions/Reshape_1
        output_shapes: 1,1001
    runtime: cpu+gpu
    limit_opencl_kernel_time: 0
    nnlib_graph_mode: 0
    obfuscate: 0
    winograd: 0
  mobilenet_v2:
    platform: tensorflow
    model_file_path: https://cnbj1.fds.api.xiaomi.com/mace/miai-models/mobilenet-v2/mobilenet-v2-1.0.pb
    model_sha256_checksum: 369f9a5f38f3c15b4311c1c84c032ce868da9f371b5f78c13d3ea3c537389bb4
    subgraphs:
      - input_tensors: input
        input_shapes: 1,224,224,3
        output_tensors: MobilenetV2/Predictions/Reshape_1
        output_shapes: 1,1001
    runtime: cpu+gpu
    limit_opencl_kernel_time: 0
    nnlib_graph_mode: 0
    obfuscate: 0
    winograd: 0

Configurations

library_name library name.
target_abis The target ABI to build, can be one or more of 'host', 'armeabi-v7a' or 'arm64-v8a'.
target_socs [optional] build for specified socs if you just want use the model for that socs.
embed_model_data Whether embedding model weights as the code, default to 0.
build_type model build type, can be ['proto', 'code']. 'proto' for converting model to ProtoBuf file and 'code' for converting model to c++ code.
linkshared [optional] Use dynamic linking for libmace library when setting to 1, or static linking when setting to 0, default to 0.
model_name model name. should be unique if there are multiple models. LIMIT: if build_type is code, model_name will used in c++ code so that model_name must fulfill c++ name specification.
platform The source framework, one of [tensorflow, caffe].
model_file_path The path of the model file, can be local or remote.
model_sha256_checksum The SHA256 checksum of the model file.
weight_file_path [optional] The path of the model weights file, used by Caffe model.
weight_sha256_checksum [optional] The SHA256 checksum of the weight file, used by Caffe model.
subgraphs subgraphs key. ** DO NOT EDIT **
input_tensors The input tensor names (tensorflow), top name of inputs' layer (caffe). one or more strings.
output_tensors The output tensor names (tensorflow), top name of outputs' layer (caffe). one or more strings.
input_shapes The shapes of the input tensors, in NHWC order.
output_shapes The shapes of the output tensors, in NHWC order.
input_ranges The numerical range of the input tensors, default [-1, 1]. It is only for test.
validation_inputs_data [optional] Specify Numpy validation inputs. When not provided, [-1, 1] random values will be used.
runtime The running device, one of [cpu, gpu, dsp, cpu_gpu]. cpu_gpu contains cpu and gpu model definition so you can run the model on both cpu and gpu.
data_type [optional] The data type used for specified runtime. [fp16_fp32, fp32_fp32] for gpu, default is fp16_fp32. [fp32] for cpu. [uint8] for dsp.
limit_opencl_kernel_time [optional] Whether splitting the OpenCL kernel within 1 ms to keep UI responsiveness, default to 0.
nnlib_graph_mode [optional] Control the DSP precision and performance, default to 0 usually works for most cases.
obfuscate [optional] Whether to obfuscate the model operator name, default to 0.
winograd [optional] Whether to enable Winograd convolution, will increase memory consumption.

How to build

Supported Platforms

Platform Explanation
TensorFlow >= 1.6.0.
Caffe >= 1.0.

Environment Requirement

MACE requires the following dependencies:

software version install command
bazel >= 0.13.0 bazel installation guide
android-ndk r15c/r16b NDK installation guide or refers to the docker file
adb >= 1.0.32 apt-get install android-tools-adb
tensorflow >= 1.6.0 pip install -I tensorflow==1.6.0 (if you use tensorflow model)
numpy >= 1.14.0 pip install -I numpy==1.14.0
scipy >= 1.0.0 pip install -I scipy==1.0.0
jinja2 >= 2.10 pip install -I jinja2==2.10
PyYaml >= 3.12.0 pip install -I pyyaml==3.12
sh >= 1.12.14 pip install -I sh==1.12.14
filelock >= 3.0.0 pip install -I filelock==3.0.0
docker (for caffe) >= 17.09.0-ce docker installation guide

Note

export ANDROID_NDK_HOME=/path/to/ndk to specify ANDROID_NDK_HOME

MACE provides Dockerfile with these dependencies installed, you can build the image from the Dockerfile,

cd docker
docker build -t xiaomimace/mace-dev

or pull the pre-built image from Docker Hub,

docker pull xiaomimace/mace-dev

and then run the container with the following command.

# Create container
# Set 'host' network to use ADB
docker run -it --rm --privileged -v /dev/bus/usb:/dev/bus/usb --net=host \
           -v /local/path:/container/path xiaomimace/mace-dev /bin/bash

Usage

1. Pull MACE source code

git clone https://github.com/XiaoMi/mace.git
git fetch --all --tags --prune

# Checkout the latest tag (i.e. release version)
tag_name=`git describe --abbrev=0 --tags`
git checkout tags/${tag_name}

Note

It's highly recommanded to use a release version instead of master branch.

2. Model Preprocessing

  • TensorFlow

TensorFlow provides Graph Transform Tool to improve inference efficiency by making various optimizations like Ops folding, redundant node removal etc. It's strongly recommended to make these optimizations before graph conversion step.

The following commands show the suggested graph transformations and optimizations for different runtimes,

# CPU/GPU:
./transform_graph \
    --in_graph=tf_model.pb \
    --out_graph=tf_model_opt.pb \
    --inputs='input' \
    --outputs='output' \
    --transforms='strip_unused_nodes(type=float, shape="1,64,64,3")
        strip_unused_nodes(type=float, shape="1,64,64,3")
        remove_nodes(op=Identity, op=CheckNumerics)
        fold_constants(ignore_errors=true)
        flatten_atrous_conv
        fold_batch_norms
        fold_old_batch_norms
        strip_unused_nodes
        sort_by_execution_order'
# DSP:
./transform_graph \
    --in_graph=tf_model.pb \
    --out_graph=tf_model_opt.pb \
    --inputs='input' \
    --outputs='output' \
    --transforms='strip_unused_nodes(type=float, shape="1,64,64,3")
        strip_unused_nodes(type=float, shape="1,64,64,3")
        remove_nodes(op=Identity, op=CheckNumerics)
        fold_constants(ignore_errors=true)
        fold_batch_norms
        fold_old_batch_norms
        backport_concatv2
        quantize_weights(minimum_size=2)
        quantize_nodes
        strip_unused_nodes
        sort_by_execution_order'
  • Caffe

MACE converter only supports Caffe 1.0+, you need to upgrade your models with Caffe built-in tool when necessary,

# Upgrade prototxt
$CAFFE_ROOT/build/tools/upgrade_net_proto_text MODEL.prototxt MODEL.new.prototxt

# Upgrade caffemodel
$CAFFE_ROOT/build/tools/upgrade_net_proto_binary MODEL.caffemodel MODEL.new.caffemodel

3. Build static/shared library

3.1 Overview

MACE can build either static or shared library (which is specified by linkshared in YAML model deployment file). The followings are two use cases.

  • Build well tuned library for specific SoCs

    When target_socs is specified in YAML model deployment file, the build tool will enable automatic tuning for GPU kernels. This usually takes some time to finish depending on the complexity of your model.

    Note

    You should plug in device(s) with the correspoding SoC(s).

  • Build generic library for all SoCs

    When target_socs is not specified, the generated library is compatible with general devices.

    Note

    There will be around of 1 ~ 10% performance drop for GPU runtime compared to the well tuned library.

MACE provide command line tool (tools/converter.py) for model conversion, compiling, test run, benchmark and correctness validation.

Note

  1. tools/converter.py should be run at the root directory of this project.
  2. When linkshared is set to 1, build_type should be proto. And currently only android devices supported.
3.2 tools/converter.py usage

Commands

  • build

    build library and test tools.

# Build library
python tools/converter.py build --config=models/config.yaml
  • run

    run the model(s).

# Test model run time
python tools/converter.py run --config=models/config.yaml --round=100

# Validate the correctness by comparing the results against the
# original model and framework, measured with cosine distance for similarity.
python tools/converter.py run --config=models/config.yaml --validate

# Check the memory usage of the model(**Just keep only one model in configuration file**)
python tools/converter.py run --config=models/config.yaml --round=10000 &
sleep 5
adb shell dumpsys meminfo | grep mace_run
kill %1

Warning

run rely on build command, you should run after build.

  • benchmark

    benchmark and profiling model.

# Benchmark model, get detailed statistics of each Op.
python tools/converter.py benchmark --config=models/config.yaml

Warning

benchmark rely on build command, you should benchmark after build.

Common arguments

option type default commands explanation
--omp_num_threads int -1 run/benchmark number of threads
--cpu_affinity_policy int 1 run/benchmark 0:AFFINITY_NONE/1:AFFINITY_BIG_ONLY/2:AFFINITY_LITTLE_ONLY
--gpu_perf_hint int 3 run/benchmark 0:DEFAULT/1:LOW/2:NORMAL/3:HIGH
--gpu_perf_hint int 3 run/benchmark 0:DEFAULT/1:LOW/2:NORMAL/3:HIGH
--gpu_priority_hint int 3 run/benchmark 0:DEFAULT/1:LOW/2:NORMAL/3:HIGH

Using -h to get detailed help.

python tools/converter.py -h
python tools/converter.py build -h
python tools/converter.py run -h
python tools/converter.py benchmark -h

4. Deployment

build command will generate the static/shared library, model files and header files and packaged as build/${library_name}/libmace_${library_name}.tar.gz.

  • The generated static libraries are organized as follows,
build/
└── mobilenet-v2-gpu
    ├── include
    │   └── mace
    │       └── public
    │           ├── mace.h
    │           └── mace_runtime.h
    ├── libmace_mobilenet-v2-gpu.tar.gz
    ├── lib
    │   ├── arm64-v8a
    │   │   └── libmace_mobilenet-v2-gpu.MI6.msm8998.a
    │   └── armeabi-v7a
    │       └── libmace_mobilenet-v2-gpu.MI6.msm8998.a
    ├── model
    │   ├── mobilenet_v2.data
    │   └── mobilenet_v2.pb
    └── opencl
        ├── arm64-v8a
        │   └── mobilenet-v2-gpu_compiled_opencl_kernel.MI6.msm8998.bin
        └── armeabi-v7a
            └── mobilenet-v2-gpu_compiled_opencl_kernel.MI6.msm8998.bin
  • The generated shared libraries are organized as follows,
build
└── mobilenet-v2-gpu
    ├── include
    │   └── mace
    │       └── public
    │           ├── mace.h
    │           └── mace_runtime.h
    ├── lib
    │   ├── arm64-v8a
    │   │   ├── libgnustl_shared.so
    │   │   └── libmace.so
    │   └── armeabi-v7a
    │       ├── libgnustl_shared.so
    │       └── libmace.so
    ├── model
    │   ├── mobilenet_v2.data
    │   └── mobilenet_v2.pb
    └── opencl
        ├── arm64-v8a
        │   └── mobilenet-v2-gpu_compiled_opencl_kernel.MI6.msm8998.bin
        └── armeabi-v7a
            └── mobilenet-v2-gpu_compiled_opencl_kernel.MI6.msm8998.bin

Note

  1. DSP runtime depends on libhexagon_controller.so.
  2. ${MODEL_TAG}.pb file will be generated only when build_type is proto.
  3. ${library_name}_compiled_opencl_kernel.${device_name}.${soc}.bin will be generated only when target_socs and gpu runtime are specified.
  4. Generated shared library depends on libgnustl_shared.so.

Warning

${library_name}_compiled_opencl_kernel.${device_name}.${soc}.bin depends on the OpenCL version of the device, you should maintan the compatibility or configure compiling cache store with ConfigKVStorageFactory.

5. How to use the library in your project

Please refer to mace/examples/example.ccfor full usage. The following list the key steps.

// Include the headers
#include "mace/public/mace.h"
#include "mace/public/mace_runtime.h"
// If the build_type is code
#include "mace/public/mace_engine_factory.h"

// 0. Set pre-compiled OpenCL binary program file paths when available
if (device_type == DeviceType::GPU) {
  mace::SetOpenCLBinaryPaths(opencl_binary_paths);
}

// 1. Set compiled OpenCL kernel cache, this is used to reduce the
// initialization time since the compiling is too slow. It's suggested
// to set this even when pre-compiled OpenCL program file is provided
// because the OpenCL version upgrade may also leads to kernel
// recompilations.
const std::string file_path ="path/to/opencl_cache_file";
std::shared_ptr<KVStorageFactory> storage_factory(
    new FileStorageFactory(file_path));
ConfigKVStorageFactory(storage_factory);

// 2. Declare the device type (must be same with ``runtime`` in configuration file)
DeviceType device_type = DeviceType::GPU;

// 3. Define the input and output tensor names.
std::vector<std::string> input_names = {...};
std::vector<std::string> output_names = {...};

// 4. Create MaceEngine instance
std::shared_ptr<mace::MaceEngine> engine;
MaceStatus create_engine_status;
// Create Engine from compiled code
create_engine_status =
    CreateMaceEngineFromCode(model_name.c_str(),
                             nullptr,
                             input_names,
                             output_names,
                             device_type,
                             &engine);
// Create Engine from model file
create_engine_status =
    CreateMaceEngineFromProto(model_pb_data,
                              model_data_file.c_str(),
                              input_names,
                              output_names,
                              device_type,
                              &engine);
if (create_engine_status != MaceStatus::MACE_SUCCESS) {
  // Report error
}

// 5. Create Input and Output tensor buffers
std::map<std::string, mace::MaceTensor> inputs;
std::map<std::string, mace::MaceTensor> outputs;
for (size_t i = 0; i < input_count; ++i) {
  // Allocate input and output
  int64_t input_size =
      std::accumulate(input_shapes[i].begin(), input_shapes[i].end(), 1,
                      std::multiplies<int64_t>());
  auto buffer_in = std::shared_ptr<float>(new float[input_size],
                                          std::default_delete<float[]>());
  // Load input here
  // ...

  inputs[input_names[i]] = mace::MaceTensor(input_shapes[i], buffer_in);
}

for (size_t i = 0; i < output_count; ++i) {
  int64_t output_size =
      std::accumulate(output_shapes[i].begin(), output_shapes[i].end(), 1,
                      std::multiplies<int64_t>());
  auto buffer_out = std::shared_ptr<float>(new float[output_size],
                                           std::default_delete<float[]>());
  outputs[output_names[i]] = mace::MaceTensor(output_shapes[i], buffer_out);
}

// 6. Run the model
MaceStatus status = engine.Run(inputs, &outputs);

Operator lists

Operator Android NN Supported Remark
AVERAGE_POOL_2D Y Y  
BATCH_NORM   Y Fusion with activation is supported
BATCH_TO_SPACE_ND Y Y  
BIAS_ADD   Y  
CHANNEL_SHUFFLE   Y  
CONCATENATION Y Y Only support channel axis concatenation
CONV_2D Y Y Fusion with BN and activation layer is supported
DECONV_2D N Y Only tensorflow model is supported
DEPTHWISE_CONV_2D Y Y Only multiplier = 1 is supported; Fusion is supported
DEPTH_TO_SPACE Y Y  
DEQUANTIZE Y Y Model quantization will be supported later
ELEMENT_WISE Y Y ADD/MUL/DIV/MIN/MAX/NEG/ABS/SQR_DIFF/POW
EMBEDDING_LOOKUP Y    
FLOOR Y    
FULLY_CONNECTED Y Y  
GROUP_CONV_2D     Caffe model with group count = channel count is supported
HASHTABLE_LOOKUP Y    
L2_NORMALIZATION Y    
L2_POOL_2D Y    
LOCAL_RESPONSE_NORMALIZATION Y Y  
LOGISTIC Y Y  
LSH_PROJECTION Y    
LSTM Y    
MATMUL   Y  
MAX_POOL_2D Y Y  
PAD N Y  
PSROI_ALIGN   Y  
PRELU   Y Only caffe model is supported
RELU Y Y  
RELU1 Y Y  
RELU6 Y Y  
RELUX   Y  
RESHAPE Y Y Limited support: only internal use of reshape in composed operations is supported
RESIZE_BILINEAR Y Y  
RNN Y    
RPN_PROPOSAL_LAYER   Y  
SLICE N Y Only support channel axis slice
SOFTMAX Y Y  
SPACE_TO_BATCH_ND Y Y  
SPACE_TO_DEPTH Y Y  
SVDF Y    
TANH Y Y  

Contributing guide

License

The source file should contain a license header. See the existing files as the example.

Python coding style

Changes to Python code should conform to PEP8 Style Guide for Python Code.

You can use pycodestyle check the style.

C++ coding style

Changes to C++ code should conform to Google C++ Style Guide.

You can use cpplint to check the style and use clang-format to format the code:

clang-format -style="{BasedOnStyle: google,            \
                      DerivePointerAlignment: false,   \
                      PointerAlignment: Right,         \
                      BinPackParameters: false}" $file

C++ logging guideline

VLOG is used for verbose logging, which is configured by environment variable MACE_CPP_MIN_VLOG_LEVEL. The guideline of VLOG level is as follows:

0. Ad hoc debug logging, should only be added in test or temporary ad hoc
   debugging
1. Important network level Debug/Latency trace log (Op run should never
   generate level 1 vlog)
2. Important op level Latency trace log
3. Unimportant Debug/Latency trace log
4. Verbose Debug/Latency trace log

C++ marco

C++ macros should start with MACE_, except for most common ones like LOG and VLOG.

Adding a new Op

You can create a custom op if it is not supported yet.

To add a custom op, you need to follow these steps:

Define the Op class

Define the new Op class in mace/ops/my_custom_op.h.

#ifndef MACE_OPS_MY_CUSTOM_OP_H_
#define MACE_OPS_MY_CUSTOM_OP_H_

#include "mace/core/operator.h"
#include "mace/kernels/my_custom_op.h"

namespace mace {
namespace ops {

template <DeviceType D, typename T>
class MyCustomOp : public Operator<D, T> {
 public:
  MyCustomOp(const OperatorDef &op_def, Workspace *ws)
      : Operator<D, T>(op_def, ws),
        functor_() {}

  bool Run(StatsFuture *future) override {
    const Tensor *input = this->Input(INPUT);
    Tensor *output = this->Output(OUTPUT);
   
    functor_(input, output, future);
    return true;
  }

 protected:
  OP_INPUT_TAGS(INPUT);
  OP_OUTPUT_TAGS(OUTPUT);

 private:
  kernels::MyCustomOpFunctor<D, T> functor_;
};

}  // namespace ops
}  // namespace mace

#endif  // MACE_OPS_MY_CUSTOM_OP_H_

Register the new Op

Define the Ops registering function in mace/ops/my_custom_op.cc.

#include "mace/ops/my_custom_op.h"

namespace mace {
namespace ops {

void Register_My_Custom_Op(OperatorRegistry *op_registry) {
  REGISTER_OPERATOR(op_registry, OpKeyBuilder("my_custom_op")
                                     .Device(DeviceType::CPU)
                                     .TypeConstraint<float>("T")
                                     .Build(),
                    Custom_Op<DeviceType::CPU, float>);

  REGISTER_OPERATOR(op_registry, OpKeyBuilder("my_custom_op")
                                     .Device(DeviceType::OPENCL)
                                     .TypeConstraint<float>("T")
                                     .Build(),
                    Custom_Op<DeviceType::OPENCL, float>);

  REGISTER_OPERATOR(op_registry, OpKeyBuilder("my_custom_op")
                                     .Device(DeviceType::OPENCL)
                                     .TypeConstraint<half>("T")
                                     .Build(),
                    Custom_Op<DeviceType::OPENCL, half>);
}

}  // namespace ops
}  // namespace mace

And then register the new Op in mace/core/operator.cc.

Implement the Op kernel code

You need to implement the CPU kernel in a mace/kernels/my_custom_op.h and optionally OpenCL kernel in mace/kernels/kernels/my_custom_op_opencl.cc and mace/kernels/kernels/cl/my_custom_op.cl. You can also optimize the CPU kernel with NEON.

Add test and benchmark

It's strongly recommended to add unit tests and micro benchmarks for your new Op. If you wish to contribute back, it's required.

Document the new Op

Finally, add an entry in operator table in the document.

Memory layout

CPU runtime memory layout

The CPU tensor buffer is organized in the following order:

Tensor type Buffer
Intermediate input/output NCHW
Convolution Filter OIHW
Depthwise Convolution Filter MIHW
1-D Argument, length = W W

GPU runtime memory layout

GPU runtime implementation base on OpenCL, which uses 2D image with CL_RGBA channel order as the tensor storage. This requires OpenCL 1.2 and above.

The way of mapping the Tensor data to OpenCL 2D image (RGBA) is critical for kernel performance.

In CL_RGBA channel order, each 2D image pixel contains 4 data items. The following tables describe the mapping from different type of tensors to 2D RGBA Image.

Input/Output Tensor

The Input/Output Tensor is stored in NHWC format:

Tensor type Buffer Image size [width, height] Explanation
Channel-Major Input/Output NHWC [W * (C+3)/4, N * H] Default Input/Output format
Height-Major Input/Output NHWC [W * C, N * (H+3)/4] Winograd Convolution format
Width-Major Input/Output NHWC [(W+3)/4 * C, N * H] Winograd Convolution format

Each Pixel of Image contains 4 elements. The below table list the coordination relation between Image and Buffer.

Tensor type Pixel coordinate relationship Explanation
Channel-Major Input/Output P[i, j] = {E[n, h, w, c] | (n=j/H, h=j%H, w=i%W, c=[i/W * 4 + k])} k=[0, 4)
Height-Major Input/Output P[i, j] = {E[n, h, w, c] | (n=j%N, h=[j/H*4 + k], w=i%W, c=i/W)} k=[0, 4)
Width-Major Input/Output P[i, j] = {E[n, h, w, c] | (n=j/H, h=j%H, w=[i%W*4 + k], c=i/W)} k=[0, 4)

Filter Tensor

Tensor Buffer Image size [width, height] Explanation
Convolution Filter OIHW [I, (O+3)/4 * W * H] Convolution filter format,There is no difference compared to [H*W*I, (O+3)/4]
Depthwise Convlution Filter MIHW [H * W * M, (I+3)/4] Depthwise-Convolution filter format

Each Pixel of Image contains 4 elements. The below table list the coordination relation between Image and Buffer.

Tensor type Pixel coordinate relationship Explanation
Convolution Filter P[m, n] = {E[o, i, h, w] | (o=[n/HW*4+k], i=m, h=T/W, w=T%W)} HW= H * W, T=n%HW, k=[0, 4)
Depthwise Convlution Filter P[m, n] = {E[0, i, h, w] | (i=[n*4+k], h=m/W, w=m%W)} only support multiplier == 1, k=[0, 4)

1-D Argument Tensor

Tensor type Buffer Image size [width, height] Explanation
1-D Argument W [(W+3)/4, 1] 1D argument format, e.g. Bias

Each Pixel of Image contains 4 elements. The below table list the coordination relation between Image and Buffer.

Tensor type Pixel coordinate relationship Explanation
1-D Argument P[i, 0] = {E[w] | w=i*4+k} k=[0, 4)

Frequently asked questions

Does the tensor data consume extra memory when compiled into C++ code?

When compiled into C++ code, the tensor data will be mmaped by the system loader. For the CPU runtime, the tensor data are used without memory copy. For the GPU and DSP runtime, the tensor data are used once during model initialization. The operating system is free to swap the pages out, however, it still consumes virtual memory addresses. So generally speaking, it takes no extra physical memory. If you are short of virtual memory space (this should be very rare), you can use the option to load the tensor data from data file (can be manually unmapped after initialization) instead of compiled code.

Why is the generated static library file size so huge?

The static library is simply an archive of a set of object files which are intermediate and contain many extra information, please check whether the final binary file size is as expected.

Why is the generated binary file (including shared library) size so huge?

When compiling the model into C++ code, the final binary may contains extra debug symbols, they usually takes a lot of space. Try to strip the shared library or binary and make sure you are following best practices to reduce the size of an ELF binary, including disabling C++ exception, disabling RTTI, avoiding C++ iostream, hidden internal functions etc. In most cases, the expected overhead should be less than {model weights size in float32}/2 + 3MB.

OpenCL allocator failed with CL_OUT_OF_RESOURCES

OpenCL runtime usually requires continuous virtual memory for its image buffer, the error will occur when the OpenCL driver can't find the continuous space due to high memory usage or fragmentation. Several solutions can be tried:

  • Change the model by reducing its memory usage
  • Split the Op with the biggest single memory buffer
  • Change from armeabi-v7a to arm64-v8a to expand the virtual address space
  • Reduce the memory consumption of other modules of the same process

Why is the performance worse than the official result for the same model?

The power options may not set properly, see mace/public/mace_runtime.h for details.

Why is the UI getting poor responsiveness when running model with GPU runtime?

Try to set limit_opencl_kernel_time to 1. If still not resolved, try to modify the source code to use even smaller time intervals or changed to CPU or DSP runtime.