MiAI Compute Engine Documentation¶
Welcome to MiAI Compute Engine documentation.
The main documentation is organized into the following sections:
Introduction¶
MiAI Compute Engine is a deep learning inference framework optimized for mobile heterogeneous computing platforms. The following figure shows the overall architecture.

Model format¶
MiAI Compute Engine defines a customized model format which is similar to Caffe2. The MiAI model can be converted from exported models by TensorFlow and Caffe. We define a YAML schema to describe the model deployment. In the next chapter, there is a detailed guide showing how to create this YAML file.
Model conversion¶
Currently, we provide model converters for TensorFlow and Caffe. And more frameworks will be supported in the future.
Model loading¶
The MiAI model format contains two parts: the model graph definition and the model parameter tensors. The graph part utilizes Protocol Buffers for serialization. All the model parameter tensors are concatenated together into a continuous array, and we call this array tensor data in the following paragraphs. In the model graph, the tensor data offsets and lengths are recorded.
The models can be loaded in 3 ways:
- Both model graph and tensor data are dynamically loaded externally (by default, from file system, but the users are free to choose their own implementations, for example, with compression or encryption). This approach provides the most flexibility but the weakest model protection.
- Both model graph and tensor data are converted into C++ code and loaded by executing the compiled code. This approach provides the strongest model protection and simplest deployment.
- The model graph is converted into C++ code and constructed as the second approach, and the tensor data is loaded externally as the first approach.
Create a model deployment file¶
The first step to deploy you models is to create a YAML model deployment file.
One deployment file describes a case of model deployment, each file will generate one static library (if more than one ABIs specified, there will be one static library for each). The deployment file can contains one or more models, for example, a smart camera application may contains face recognition, object recognition, and voice recognition models, which can be defined in one deployment file),
Example¶
Here is an deployment file example used by Android demo application.
TODO: change this example file to the demo deployment file (reuse the same file) and rename to a reasonable name.
# 库的名字
library_name: library_name
# 配置文件名会被用作生成库的名称:libmace-${library_name}.a
target_abis: [armeabi-v7a, arm64-v8a]
# 具体机型的soc编号,可以使用`adb shell getprop | grep ro.board.platform | cut -d [ -f3 | cut -d ] -f1`获取
target_socs: [msm8998]
embed_model_data: 1
build_type: code # 模型build类型。code表示将模型转为代码,proto表示将模型转为protobuf文件
models: # 一个配置文件可以包含多个模型的配置信息,最终生成的库中包含多个模型
model_name: # 模型的标签,在调度模型的时候,会用这个变量,必须唯一
platform: tensorflow
model_file_path: path/to/model64.pb # also support http:// and https://
model_sha256_checksum: 7f7462333406e7dea87222737590ebb7d94490194d2f21a7d72bafa87e64e9f9
subgraphs:
- input_tensors: input_node
input_shapes: 1,64,64,3
output_tensors: output_node
output_shapes: 1,64,64,2
runtime: gpu
data_type: fp16_fp32
limit_opencl_kernel_time: 0
nnlib_graph_mode: 0
obfuscate: 1
winograd: 0
input_files:
- path/to/input_files # support http://
second_net:
platform: caffe
model_file_path: path/to/model.prototxt
weight_file_path: path/to/weight.caffemodel
model_sha256_checksum: 05d92625809dc9edd6484882335c48c043397aed450a168d75eb8b538e86881a
weight_sha256_checksum: 05d92625809dc9edd6484882335c48c043397aed450a168d75eb8b538e86881a
subgraphs:
- input_tensors:
- input_node0
- input_node1
input_shapes:
- 1,256,256,3
- 1,128,128,3
output_tensors:
- output_node0
- output_node1
output_shapes:
- 1,256,256,2
- 1,1,1,2
runtime: cpu
limit_opencl_kernel_time: 1
nnlib_graph_mode: 0
obfuscate: 1
winograd: 0
input_files:
- path/to/input_files # support http://
Configurations¶
library_name | library name |
---|---|
target_abis | The target ABI to build, can be one or more of 'host', 'armeabi-v7a' or 'arm64-v8a' |
target_socs | build for specified socs if you just want use the model for that socs. |
embed_model_data | Whether embedding model weights as the code, default to 0 |
build_type | model build type, can be ['proto', 'code']. 'proto' for converting model to ProtoBuf file and 'code' for converting model to c++ code. |
model_name | model name. should be unique if there are multiple models. LIMIT: if build_type is code, model_name will used in c++ code so that model_name must fulfill c++ name specification. |
platform | The source framework, one of [tensorflow, caffe] |
model_file_path | The path of the model file, can be local or remote |
model_sha256_checksum | The SHA256 checksum of the model file |
weight_file_path | The path of the model weights file, used by Caffe model |
weight_sha256_checksum | The SHA256 checksum of the weight file, used by Caffe model |
subgraphs | subgraphs key. ** DO NOT EDIT ** |
input_tensors | The input tensor names (tensorflow), top name of inputs' layer (caffe). one or more strings |
output_tensors | The output tensor names (tensorflow), top name of outputs' layer (caffe). one or more strings |
input_shapes | The shapes of the input tensors, in NHWC order |
output_shapes | The shapes of the output tensors, in NHWC order |
runtime | The running device, one of [cpu, gpu, dsp, cpu_gpu]. cpu_gpu contains cpu and gpu model definition so you can run the model on both cpu and gpu. |
data_type | [optional] The data type used for specified runtime. [fp16_fp32, fp32_fp32] for gpu, default is fp16_fp32. [fp32] for cpu. [uint8] for dsp. |
limit_opencl_kernel_time | [optional] Whether splitting the OpenCL kernel within 1 ms to keep UI responsiveness, default to 0 |
nnlib_graph_mode | [optional] Control the DSP precision and performance, default to 0 usually works for most cases |
obfuscate | [optional] Whether to obfuscate the model operator name, default to 0 |
winograd | [optional] Whether to enable Winograd convolution, will increase memory consumption |
input_files | [optional] Specify Numpy validation inputs. When not provided, [-1, 1] random values will be used |
How to build¶
Supported Platforms¶
Platform | Explanation |
---|---|
Tensorflow | >= 1.6.0. (first choice, convenient for Android NN API in the future) |
Caffe | >= 1.0. |
Environment Requirement¶
mace
supply a docker image which contains all required environment. Dockerfile
under the ./docker
directory.
the followings are start commands:
sudo docker pull cr.d.xiaomi.net/mace/mace-dev
sudo docker run -it --rm --privileged -v /dev/bus/usb:/dev/bus/usb --net=host -v /local/path:/container/path cr.d.xiaomi.net/mace/mace-dev /bin/bash
if you want to run on your local computer, you have to install the following softwares.
software | version | install command |
---|---|---|
bazel | >= 0.13.0 | bazel installation |
android-ndk | r15c/r16b | reference the docker file |
adb | >= 1.0.32 | apt-get install android-tools-adb |
tensorflow | >= 1.6.0 | pip install -I tensorflow==1.6.0 (if you use tensorflow model) |
numpy | >= 1.14.0 | pip install -I numpy=1.14.0 |
scipy | >= 1.0.0 | pip install -I scipy=1.0.0 |
jinja2 | >= 2.10 | pip install -I jinja2=2.10 |
PyYaml | >= 3.12.0 | pip install -I pyyaml=3.12 |
sh | >= 1.12.14 | pip install -I sh=1.12.14 |
filelock | >= 3.0.0 | pip install -I filelock=3.0.0 |
docker (for caffe) | >= 17.09.0-ce | install doc |
Docker Images¶
- Login in Xiaomi Docker Registry
docker login cr.d.xiaomi.net
- Build with Dockerfile
docker build -t cr.d.xiaomi.net/mace/mace-dev
- Pull image from docker registry
docker pull cr.d.xiaomi.net/mace/mace-dev
- Create container
# Set 'host' network to use ADB
docker run -it --rm -v /local/path:/container/path --net=host cr.d.xiaomi.net/mace/mace-dev /bin/bash
Usage¶
1. Pull code with latest tag¶
Warning
please do not use master branch for deployment.
git clone git@v9.git.n.xiaomi.com:deep-computing/mace.git
# update
git fetch --all --tags --prune
# get latest tag version
tag_name=`git describe --abbrev=0 --tags`
# checkout to latest tag branch
git checkout -b ${tag_name} tags/${tag_name}
2. Model Optimization¶
- Tensorflow
Tensorflow supply a model optimization tool for speed up inference. The docker image contain the tool, by the way you can download from transform_graph or compile from tensorflow source code.
The following commands are optimization for CPU, GPU and DSP.
# CPU/GPU:
./transform_graph \
--in_graph=tf_model.pb \
--out_graph=tf_model_opt.pb \
--inputs='input' \
--outputs='output' \
--transforms='strip_unused_nodes(type=float, shape="1,64,64,3")
strip_unused_nodes(type=float, shape="1,64,64,3")
remove_nodes(op=Identity, op=CheckNumerics)
fold_constants(ignore_errors=true)
flatten_atrous_conv
fold_batch_norms
fold_old_batch_norms
strip_unused_nodes
sort_by_execution_order'
# DSP:
./transform_graph \
--in_graph=tf_model.pb \
--out_graph=tf_model_opt.pb \
--inputs='input' \
--outputs='output' \
--transforms='strip_unused_nodes(type=float, shape="1,64,64,3")
strip_unused_nodes(type=float, shape="1,64,64,3")
remove_nodes(op=Identity, op=CheckNumerics)
fold_constants(ignore_errors=true)
fold_batch_norms
fold_old_batch_norms
backport_concatv2
quantize_weights(minimum_size=2)
quantize_nodes
strip_unused_nodes
sort_by_execution_order'
- Caffe
Only support versions greater then 1.0, please use the tools caffe supplied to upgrade the models.
# Upgrade prototxt
$CAFFE_ROOT/build/tools/upgrade_net_proto_text MODEL.prototxt MODEL.new.prototxt
# Upgrade caffemodel
$CAFFE_ROOT/build/tools/upgrade_net_proto_binary MODEL.caffemodel MODEL.new.caffemodel
3. Build static library¶
3.1 Overview¶
Mace only build static library. the followings are two use cases.
build for specified SOC
You must assign
target_socs
in yaml configuration file. if you want to use gpu for the soc, mace will tuning the parameters for better performance automatically.Warning
you should plug in a phone with that soc.
build for all SOC
When no
target_soc
specified, the library is suitable for all soc.Warning
The performance will be a little poorer than the first case.
We supply a python script tools/converter.py
to build the library and run the model on the command line.
Warning
must run the script on the root directory of the mace code.
3.2 tools/converter.py
explanation¶
Commands
build
Note
build static library and test tools.
- --config (type=str, default="", required): the path of model yaml configuration file.
- --tuning (default=false, optional): whether tuning the parameters for the GPU of specified SOC.
- --enable_openmp (default=true, optional): whether use openmp.
run
Note
run the models in command line
- --config (type=str, default="", required): the path of model yaml configuration file.
- --round (type=int, default=1, optional): times for run.
- --validate (default=false, optional): whether to verify the results of mace are consistent with the frameworks。
- --caffe_env (type=local/docker, default=docker, optional): you can specific caffe environment for validation. local environment or caffe docker image.
- --restart_round (type=int, default=1, optional): restart round between run.
- --check_gpu_out_of_memory (default=false, optional): whether check out of memory for gpu.
- --vlog_level (type=int[0-5], default=0, optional): verbose log level for debug.
Warning
run
rely onbuild
command, you shouldrun
afterbuild
.
- benchmark
- --config (type=str, default="", required): the path of model yaml configuration file.
Warning
benchmark
rely onbuild
command, you shouldbenchmark
afterbuild
.common arguments
argument(key) argument(value) default required commands explanation --omp_num_threads int -1 N run
/benchmark
number of threads --cpu_affinity_policy int 1 N run
/benchmark
0:AFFINITY_NONE/1:AFFINITY_BIG_ONLY/2:AFFINITY_LITTLE_ONLY --gpu_perf_hint int 3 N run
/benchmark
0:DEFAULT/1:LOW/2:NORMAL/3:HIGH --gpu_perf_hint int 3 N run
/benchmark
0:DEFAULT/1:LOW/2:NORMAL/3:HIGH --gpu_priority_hint int 3 N run
/benchmark
0:DEFAULT/1:LOW/2:NORMAL/3:HIGH
3.3 tools/converter.py
usage examples¶
# print help message
python tools/converter.py -h
python tools/converter.py build -h
python tools/converter.py run -h
python tools/converter.py benchmark -h
# Build the static library
python tools/converter.py build --config=models/config.yaml
# Test model run time
python tools/converter.py run --config=models/config.yaml --round=100
# Compare the results of mace and platform. use the **cosine distance** to represent similarity.
python tools/converter.py run --config=models/config.yaml --validate
# Benchmark Model: check the execution time of each Op.
python tools/converter.py benchmark --config=models/config.yaml
# Check the memory usage of the model(**Just keep only one model in configuration file**)
python tools/converter.py run --config=models/config.yaml --round=10000 &
adb shell dumpsys meminfo | grep mace_run
sleep 10
kill %1
4. Deployment¶
build
command will generate a package which contains the static library, model files and header files.
the package is at ./build/${library_name}/libmace_${library_name}.tar.gz
.
The followings list the details.
- header files
include/mace/public/*.h
- static libraries
library/${target_abi}/*.a
- dynamic libraries
library/libhexagon_controller.so
Note
only use for DSP
- model files
model/${MODEL_TAG}.pb
model/${MODEL_TAG}.data
Note
.pb
file will be generated only when build_type isproto
.- OpenCL compiled kernel binary file
opencl/compiled_kernel.bin
Note
This file will be generated only when specify
target_soc
and runtime isgpu
.Warning
This file rely on the OpenCL driver on the phone, you should update the file when OpenCL driver changed.
5. how to use¶
Please refer to mace/examples/example.cc
for full usage. the following list the key steps.
// include the header files
#include "mace/public/mace.h"
#include "mace/public/mace_runtime.h"
#include "mace/public/mace_engine_factory.h"
// 0. set internal storage factory(**Call once**)
const std::string file_path ="/path/to/store/internel/files";
std::shared_ptr<KVStorageFactory> storage_factory(
new FileStorageFactory(file_path));
ConfigKVStorageFactory(storage_factory);
// 1. set precompiled OpenCL binary file paths if you use gpu of specified SOC,
// Besides the binary rely on the OpenCL driver of the SOC,
// if OpenCL driver changed, you should recompiled the binary file.
if (device_type == DeviceType::GPU) {
mace::SetOpenCLBinaryPaths(opencl_binary_paths);
}
// 2. Declare the device type(must be same with ``runtime`` in configuration file)
DeviceType device_type = DeviceType::GPU;
// 3. Define the input and output tensor names.
std::vector<std::string> input_names = {...};
std::vector<std::string> output_names = {...};
// 4. Create MaceEngine object
std::shared_ptr<mace::MaceEngine> engine;
MaceStatus create_engine_status;
// Create Engine from code
create_engine_status =
CreateMaceEngineFromCode(model_name.c_str(),
nullptr,
input_names,
output_names,
device_type,
&engine);
// Create Engine from proto file
create_engine_status =
CreateMaceEngineFromProto(model_pb_data,
model_data_file.c_str(),
input_names,
output_names,
device_type,
&engine);
if (create_engine_status != MaceStatus::MACE_SUCCESS) {
// do something
}
// 5. Create Input and Output objects
std::map<std::string, mace::MaceTensor> inputs;
std::map<std::string, mace::MaceTensor> outputs;
for (size_t i = 0; i < input_count; ++i) {
// Allocate input and output
int64_t input_size =
std::accumulate(input_shapes[i].begin(), input_shapes[i].end(), 1,
std::multiplies<int64_t>());
auto buffer_in = std::shared_ptr<float>(new float[input_size],
std::default_delete<float[]>());
// load input
...
inputs[input_names[i]] = mace::MaceTensor(input_shapes[i], buffer_in);
}
for (size_t i = 0; i < output_count; ++i) {
int64_t output_size =
std::accumulate(output_shapes[i].begin(), output_shapes[i].end(), 1,
std::multiplies<int64_t>());
auto buffer_out = std::shared_ptr<float>(new float[output_size],
std::default_delete<float[]>());
outputs[output_names[i]] = mace::MaceTensor(output_shapes[i], buffer_out);
}
// 6. Run the model
MaceStatus status = engine.Run(inputs, &outputs);
Operator lists¶
Operator | Android NN | Supported | Remark |
---|---|---|---|
AVERAGE_POOL_2D | Y | Y | |
BATCH_NORM | Y | Fusion with activation is supported | |
BATCH_TO_SPACE_ND | Y | Y | |
BIAS_ADD | Y | ||
CHANNEL_SHUFFLE | Y | ||
CONCATENATION | Y | Y | Only support channel axis concatenation |
CONV_2D | Y | Y | Fusion with BN and activation layer is supported |
DECONV_2D | N | Y | Only tensorflow model is supported |
DEPTHWISE_CONV_2D | Y | Y | Only multiplier = 1 is supported; Fusion is supported |
DEPTH_TO_SPACE | Y | Y | |
DEQUANTIZE | Y | Y | Model quantization will be supported later |
ELEMENT_WISE | Y | Y | ADD/MUL/DIV/MIN/MAX/NEG/ABS/SQR_DIFF/POW |
EMBEDDING_LOOKUP | Y | ||
FLOOR | Y | ||
FULLY_CONNECTED | Y | Y | |
GROUP_CONV_2D | Caffe model with group count = channel count is supported | ||
HASHTABLE_LOOKUP | Y | ||
L2_NORMALIZATION | Y | ||
L2_POOL_2D | Y | ||
LOCAL_RESPONSE_NORMALIZATION | Y | Y | |
LOGISTIC | Y | Y | |
LSH_PROJECTION | Y | ||
LSTM | Y | ||
MATMUL | Y | ||
MAX_POOL_2D | Y | Y | |
PAD | N | Y | |
PSROI_ALIGN | Y | ||
PRELU | Y | Only caffe model is supported | |
RELU | Y | Y | |
RELU1 | Y | Y | |
RELU6 | Y | Y | |
RELUX | Y | ||
RESHAPE | Y | Y | Limited support: only internal use of reshape in composed operations is supported |
RESIZE_BILINEAR | Y | Y | |
RNN | Y | ||
RPN_PROPOSAL_LAYER | Y | ||
SLICE | N | Y | Only support channel axis slice |
SOFTMAX | Y | Y | |
SPACE_TO_BATCH_ND | Y | Y | |
SPACE_TO_DEPTH | Y | Y | |
SVDF | Y | ||
TANH | Y | Y |
Contributing guide¶
License¶
The source file should contains a license header. See the existing files as an example.
Python coding style¶
Changes to Python code should conform to PEP8 Style Guide for Python Code.
You can use pycodestyle to check the style.
C++ coding style¶
Changes to C++ code should conform to Google C++ Style Guide.
You can use cpplint to check the style and use clang-format to format the code:
clang-format -style="{BasedOnStyle: google, \
DerivePointerAlignment: false, \
PointerAlignment: Right, \
BinPackParameters: false}" $file
C++ logging guideline¶
VLOG is used for verbose logging, which is configured by environment variable
MACE_CPP_MIN_VLOG_LEVEL
. The guideline of VLOG level is as follows:
0. Ad hoc debug logging, should only be added in test or temporary ad hoc
debugging
1. Important network level Debug/Latency trace log (Op run should never
generate level 1 vlog)
2. Important op level Latency trace log
3. Unimportant Debug/Latency trace log
4. Verbose Debug/Latency trace log
C++ marco¶
C++ macros should start with MACE_
, except for most common ones like LOG
and VLOG
.
Adding a new Op¶
You can create a custom op if it is not supported yet.
To add a custom op, you need to finish the following steps:
Define the Op class¶
Define the new Op class in mace/ops/my_custom_op.h
.
#ifndef MACE_OPS_MY_CUSTOM_OP_H_
#define MACE_OPS_MY_CUSTOM_OP_H_
#include "mace/core/operator.h"
#include "mace/kernels/my_custom_op.h"
namespace mace {
namespace ops {
template <DeviceType D, typename T>
class MyCustomOp : public Operator<D, T> {
public:
MyCustomOp(const OperatorDef &op_def, Workspace *ws)
: Operator<D, T>(op_def, ws),
functor_() {}
bool Run(StatsFuture *future) override {
const Tensor *input = this->Input(INPUT);
Tensor *output = this->Output(OUTPUT);
functor_(input, output, future);
return true;
}
protected:
OP_INPUT_TAGS(INPUT);
OP_OUTPUT_TAGS(OUTPUT);
private:
kernels::MyCustomOpFunctor<D, T> functor_;
};
} // namespace ops
} // namespace mace
#endif // MACE_OPS_MY_CUSTOM_OP_H_
Register the new Op¶
Define the Ops registering function in mace/ops/my_custom_op.cc
.
#include "mace/ops/my_custom_op.h"
namespace mace {
namespace ops {
void Register_My_Custom_Op(OperatorRegistry *op_registry) {
REGISTER_OPERATOR(op_registry, OpKeyBuilder("my_custom_op")
.Device(DeviceType::CPU)
.TypeConstraint<float>("T")
.Build(),
Custom_Op<DeviceType::CPU, float>);
REGISTER_OPERATOR(op_registry, OpKeyBuilder("my_custom_op")
.Device(DeviceType::OPENCL)
.TypeConstraint<float>("T")
.Build(),
Custom_Op<DeviceType::OPENCL, float>);
REGISTER_OPERATOR(op_registry, OpKeyBuilder("my_custom_op")
.Device(DeviceType::OPENCL)
.TypeConstraint<half>("T")
.Build(),
Custom_Op<DeviceType::OPENCL, half>);
}
} // namespace ops
} // namespace mace
And then register the new Op in mace/core/operator.cc
.
Implement the Op kernel code¶
You need to implement the CPU kernel in a mace/kernels/my_custom_op.h
and
optionally OpenCL kernel in mace/kernels/kernels/my_custom_op_opencl.cc
and
mace/kernels/kernels/cl/my_custom_op.cl
. You can also optimize the CPU
kernel with NEON.
Add test and benchmark¶
It's strongly recommended to add unit test and micro benchmark for your new Op. If you wish to contribute back, it's required.
Document the new Op¶
Finally, add an entry in operator table in the document.
Memory layout¶
CPU runtime memory layout¶
The CPU tensor buffer is organized in the following order:
Tensor type | Buffer |
---|---|
Intermediate input/output | NCHW |
Convolution Filter | OIHW |
Depthwise Convolution Filter | MIHW |
1-D Argument, length = W | W |
GPU runtime memory layout¶
GPU runtime implementation base on OpenCL, which uses 2D image with CL_RGBA channel order as the tensor storage. This requires OpenCL 1.2 and above.
The way of mapping the Tensor data to OpenCL 2D image (RGBA) is critical for kernel performance.
In CL_RGBA channel order, each 2D image pixel contains 4 data items. The following tables describe the mapping from different type of tensors to 2D RGBA Image.
Input/Output Tensor¶
The Input/Output Tensor is stored in NHWC format:
Tensor type | Buffer | Image size [width, height] | Explanation |
---|---|---|---|
Channel-Major Input/Output | NHWC | [W * (C+3)/4, N * H] | Default Input/Output format |
Height-Major Input/Output | NHWC | [W * C, N * (H+3)/4] | Winograd Convolution format |
Width-Major Input/Output | NHWC | [(W+3)/4 * C, N * H] | Winograd Convolution format |
Each Pixel of Image contains 4 elements. The below table list the coordination relation between Image and Buffer.
Tensor type | Pixel coordinate relationship | Explanation |
---|---|---|
Channel-Major Input/Output | P[i, j] = {E[n, h, w, c] | (n=j/H, h=j%H, w=i%W, c=[i/W * 4 + k])} | k=[0, 4) |
Height-Major Input/Output | P[i, j] = {E[n, h, w, c] | (n=j%N, h=[j/H*4 + k], w=i%W, c=i/W)} | k=[0, 4) |
Width-Major Input/Output | P[i, j] = {E[n, h, w, c] | (n=j/H, h=j%H, w=[i%W*4 + k], c=i/W)} | k=[0, 4) |
Filter Tensor¶
Tensor | Buffer | Image size [width, height] | Explanation |
---|---|---|---|
Convolution Filter | OIHW | [I, (O+3)/4 * W * H] | Convolution filter format,There is no difference compared to [H*W*I, (O+3)/4] |
Depthwise Convlution Filter | MIHW | [H * W * M, (I+3)/4] | Depthwise-Convolution filter format |
Each Pixel of Image contains 4 elements. The below table list the coordination relation between Image and Buffer.
Tensor type | Pixel coordinate relationship | Explanation |
---|---|---|
Convolution Filter | P[m, n] = {E[o, i, h, w] | (o=[n/HW*4+k], i=m, h=T/W, w=T%W)} | HW= H * W, T=n%HW, k=[0, 4) |
Depthwise Convlution Filter | P[m, n] = {E[0, i, h, w] | (i=[n*4+k], h=m/W, w=m%W)} | only support multiplier == 1, k=[0, 4) |
1-D Argument Tensor¶
Tensor type | Buffer | Image size [width, height] | Explanation |
---|---|---|---|
1-D Argument | W | [(W+3)/4, 1] | 1D argument format, e.g. Bias |
Each Pixel of Image contains 4 elements. The below table list the coordination relation between Image and Buffer.
Tensor type | Pixel coordinate relationship | Explanation |
---|---|---|
1-D Argument | P[i, 0] = {E[w] | w=i*4+k} | k=[0, 4) |
Frequently asked questions¶
Does the tensor data consume extra memory when compiled into C++ code?¶
When compiled into C++ code, the data will be mmaped by the system loader. For CPU runtime, the tensor data are used without memory copy. For GPU and DSP runtime, the tensor data is used once during model initialization. The operating system is free to swap the pages out, however, it still consumes virtual memory space. So generally speaking, it takes no extra physical memory. If you are short of virtual memory space (this should be very rare), you can choose load the tensor data from a file, which can be unmapped after initialization.
Why is the generated static library file size so huge?¶
The static library is simply an archive of a set of object files which are intermediate and contains many extra information, please check whether the final binary file size is as expected.
OpenCL allocator failed with CL_OUT_OF_RESOURCES¶
OpenCL runtime usually requires continuous virtual memory for its image buffer, the error will occur when the OpenCL driver can't find the continuous space due to high memory usage or fragmentation. Several solutions can be tried:
- Change the model by reducing its memory usage
- Split the Op with the biggest single memory buffer
- Changed from armeabi-v7a to arm64-v8a to expand the virtual address space
- Reduce the memory consumption of other modules of the same process
Why the performance is worce than the official result for the same model?¶
The power options may not set properly, see mace/public/mace_runtime.h
for
details.
Why the UI is getting poor responsiveness when running model with GPU runtime?¶
Try to set limit_opencl_kernel_time
to 1
. If still not resolved, try to
modify the source code to use even smaller time intervals or changed to CPU
or DSP runtime.
How to include more than one deployment files in one application(process)?¶
This case may happen when an application is developed by multiple teams as submodules. If the all the submodules are linked into a single shared library, then use the same version of MiAI Compute Engine will resolve this issue. Ortherwise, different deployment models are contained in different shared libraries, it's not required to use the same MiAI version but you should controls the exported symbols from the shared library. This is actually a best practice for all shared library, please read about GNU loader version script for more details.