How to build

Supported Platforms

Platform Explanation
Tensorflow >= 1.6.0. (first choice, convenient for Android NN API in the future)
Caffe >= 1.0.

Environment Requirement

macesupply a docker image which contains all required environment. Dockerfile under the ./docker directory. the followings are start commands:

sudo docker pull cr.d.xiaomi.net/mace/mace-dev
sudo docker run -it --rm --privileged -v /dev/bus/usb:/dev/bus/usb --net=host -v /local/path:/container/path cr.d.xiaomi.net/mace/mace-dev /bin/bash

if you want to run on your local computer, you have to install the following softwares.

software version install command
bazel >= 0.13.0 bazel installation
android-ndk r15c/r16b reference the docker file
adb >= 1.0.32 apt-get install android-tools-adb
tensorflow >= 1.6.0 pip install -I tensorflow==1.6.0 (if you use tensorflow model)
numpy >= 1.14.0 pip install -I numpy=1.14.0
scipy >= 1.0.0 pip install -I scipy=1.0.0
jinja2 >= 2.10 pip install -I jinja2=2.10
PyYaml >= 3.12.0 pip install -I pyyaml=3.12
sh >= 1.12.14 pip install -I sh=1.12.14
filelock >= 3.0.0 pip install -I filelock=3.0.0
docker (for caffe) >= 17.09.0-ce install doc

Docker Images

docker login cr.d.xiaomi.net
  • Build with Dockerfile
docker build -t cr.d.xiaomi.net/mace/mace-dev
  • Pull image from docker registry
docker pull cr.d.xiaomi.net/mace/mace-dev
  • Create container
# Set 'host' network to use ADB
docker run -it --rm -v /local/path:/container/path --net=host cr.d.xiaomi.net/mace/mace-dev /bin/bash

Usage

1. Pull code with latest tag

Warning

please do not use master branch for deployment.

git clone git@v9.git.n.xiaomi.com:deep-computing/mace.git

# update
git fetch --all --tags --prune

# get latest tag version
tag_name=`git describe --abbrev=0 --tags`

# checkout to latest tag branch
git checkout -b ${tag_name} tags/${tag_name}

2. Model Optimization

  • Tensorflow

Tensorflow supply a model optimization tool for speed up inference. The docker image contain the tool, by the way you can download from transform_graph or compile from tensorflow source code.

The following commands are optimization for CPU, GPU and DSP.

# CPU/GPU:
./transform_graph \
    --in_graph=tf_model.pb \
    --out_graph=tf_model_opt.pb \
    --inputs='input' \
    --outputs='output' \
    --transforms='strip_unused_nodes(type=float, shape="1,64,64,3")
        strip_unused_nodes(type=float, shape="1,64,64,3")
        remove_nodes(op=Identity, op=CheckNumerics)
        fold_constants(ignore_errors=true)
        flatten_atrous_conv
        fold_batch_norms
        fold_old_batch_norms
        strip_unused_nodes
        sort_by_execution_order'

# DSP:
./transform_graph \
    --in_graph=tf_model.pb \
    --out_graph=tf_model_opt.pb \
    --inputs='input' \
    --outputs='output' \
    --transforms='strip_unused_nodes(type=float, shape="1,64,64,3")
        strip_unused_nodes(type=float, shape="1,64,64,3")
        remove_nodes(op=Identity, op=CheckNumerics)
        fold_constants(ignore_errors=true)
        fold_batch_norms
        fold_old_batch_norms
        backport_concatv2
        quantize_weights(minimum_size=2)
        quantize_nodes
        strip_unused_nodes
        sort_by_execution_order'
  • Caffe

Only support versions greater then 1.0, please use the tools caffe supplied to upgrade the models.

# Upgrade prototxt
$CAFFE_ROOT/build/tools/upgrade_net_proto_text MODEL.prototxt MODEL.new.prototxt

# Upgrade caffemodel
$CAFFE_ROOT/build/tools/upgrade_net_proto_binary MODEL.caffemodel MODEL.new.caffemodel

3. Build static library

3.1 Overview

Mace only build static library. the followings are two use cases.

  • build for specified SOC

    You must assign target_socs in yaml configuration file. if you want to use gpu for the soc, mace will tuning the parameters for better performance automatically.

    Warning

    you should plug in a phone with that soc.

  • build for all SOC

    When no target_soc specified, the library is suitable for all soc.

    Warning

    The performance will be a little poorer than the first case.

We supply a python script tools/converter.py to build the library and run the model on the command line.

Warning

must run the script on the root directory of the mace code.

3.2 tools/converter.py explanation

Commands

  • build

    Note

    build static library and test tools.

    • --config (type=str, default="", required): the path of model yaml configuration file.
    • --tuning (default=false, optional): whether tuning the parameters for the GPU of specified SOC.
    • --enable_openmp (default=true, optional): whether use openmp.
  • run

    Note

    run the models in command line

    • --config (type=str, default="", required): the path of model yaml configuration file.
    • --round (type=int, default=1, optional): times for run.
    • --validate (default=false, optional): whether to verify the results of mace are consistent with the frameworks。
    • --caffe_env (type=local/docker, default=docker, optional): you can specific caffe environment for validation. local environment or caffe docker image.
    • --restart_round (type=int, default=1, optional): restart round between run.
    • --check_gpu_out_of_memory (default=false, optional): whether check out of memory for gpu.
    • --vlog_level (type=int[0-5], default=0, optional): verbose log level for debug.

    Warning

    run rely on build command, you should run after build.

  • benchmark
    • --config (type=str, default="", required): the path of model yaml configuration file.

    Warning

    benchmark rely on build command, you should benchmark after build.

common arguments

argument(key) argument(value) default required commands explanation
--omp_num_threads int -1 N run/benchmark number of threads
--cpu_affinity_policy int 1 N run/benchmark 0:AFFINITY_NONE/1:AFFINITY_BIG_ONLY/2:AFFINITY_LITTLE_ONLY
--gpu_perf_hint int 3 N run/benchmark 0:DEFAULT/1:LOW/2:NORMAL/3:HIGH
--gpu_perf_hint int 3 N run/benchmark 0:DEFAULT/1:LOW/2:NORMAL/3:HIGH
--gpu_priority_hint int 3 N run/benchmark 0:DEFAULT/1:LOW/2:NORMAL/3:HIGH

3.3 tools/converter.pyusage examples

# print help message
python tools/converter.py -h
python tools/converter.py build -h
python tools/converter.py run -h
python tools/converter.py benchmark -h

# Build the static library
python tools/converter.py build --config=models/config.yaml

# Test model run time
python tools/converter.py run --config=models/config.yaml --round=100

# Compare the results of mace and platform. use the **cosine distance** to represent similarity.
python tools/converter.py run --config=models/config.yaml --validate

# Benchmark Model: check the execution time of each Op.
python tools/converter.py benchmark --config=models/config.yaml

# Check the memory usage of the model(**Just keep only one model in configuration file**)
python tools/converter.py run --config=models/config.yaml --round=10000 &
adb shell dumpsys meminfo | grep mace_run
sleep 10
kill %1

4. Deployment

build command will generate a package which contains the static library, model files and header files. the package is at ./build/${library_name}/libmace_${library_name}.tar.gz. The followings list the details.

header files
  • include/mace/public/*.h
static libraries
  • library/${target_abi}/*.a
dynamic libraries
  • library/libhexagon_controller.so

Note

only use for DSP

model files
  • model/${MODEL_TAG}.pb
  • model/${MODEL_TAG}.data

Note

.pb file will be generated only when build_type is proto.

OpenCL compiled kernel binary file
  • opencl/compiled_kernel.bin

Note

This file will be generated only when specify target_soc and runtime is gpu.

Warning

This file rely on the OpenCL driver on the phone, you should update the file when OpenCL driver changed.

5. how to use

Please refer to mace/examples/example.ccfor full usage. the following list the key steps.

// include the header files
#include "mace/public/mace.h"
#include "mace/public/mace_runtime.h"
#include "mace/public/mace_engine_factory.h"

// 0. set internal storage factory(**Call once**)
const std::string file_path ="/path/to/store/internel/files";
std::shared_ptr<KVStorageFactory> storage_factory(
    new FileStorageFactory(file_path));
ConfigKVStorageFactory(storage_factory);

// 1. set precompiled OpenCL binary file paths if you use gpu of specified SOC,
//    Besides the binary rely on the OpenCL driver of the SOC,
//    if OpenCL driver changed, you should recompiled the binary file.
if (device_type == DeviceType::GPU) {
  mace::SetOpenCLBinaryPaths(opencl_binary_paths);
}

// 2. Declare the device type(must be same with ``runtime`` in configuration file)
DeviceType device_type = DeviceType::GPU;

// 3. Define the input and output tensor names.
std::vector<std::string> input_names = {...};
std::vector<std::string> output_names = {...};

// 4. Create MaceEngine object
std::shared_ptr<mace::MaceEngine> engine;
MaceStatus create_engine_status;
// Create Engine from code
create_engine_status =
    CreateMaceEngineFromCode(model_name.c_str(),
                             nullptr,
                             input_names,
                             output_names,
                             device_type,
                             &engine);
// Create Engine from proto file
create_engine_status =
    CreateMaceEngineFromProto(model_pb_data,
                              model_data_file.c_str(),
                              input_names,
                              output_names,
                              device_type,
                              &engine);
if (create_engine_status != MaceStatus::MACE_SUCCESS) {
  // do something
}

// 5. Create Input and Output objects
std::map<std::string, mace::MaceTensor> inputs;
std::map<std::string, mace::MaceTensor> outputs;
for (size_t i = 0; i < input_count; ++i) {
  // Allocate input and output
  int64_t input_size =
      std::accumulate(input_shapes[i].begin(), input_shapes[i].end(), 1,
                      std::multiplies<int64_t>());
  auto buffer_in = std::shared_ptr<float>(new float[input_size],
                                          std::default_delete<float[]>());
  // load input
  ...

  inputs[input_names[i]] = mace::MaceTensor(input_shapes[i], buffer_in);
}

for (size_t i = 0; i < output_count; ++i) {
  int64_t output_size =
      std::accumulate(output_shapes[i].begin(), output_shapes[i].end(), 1,
                      std::multiplies<int64_t>());
  auto buffer_out = std::shared_ptr<float>(new float[output_size],
                                           std::default_delete<float[]>());
  outputs[output_names[i]] = mace::MaceTensor(output_shapes[i], buffer_out);
}

// 6. Run the model
MaceStatus status = engine.Run(inputs, &outputs);