2021-12-20: Machine Learning on Mobile and Embedded Devices

What is TensorFlow and TensorFlow Lite?

TensorFlow is an open-source machine learning library widely adopted among deep learning practitioners and researchers. The library provides functionalities for developing, training, and testing machine learning models on various devices through different Application Programmable Interfaces (APIs). Among them are the Python API and the JavaScript API. Despite these APIs being able to meet on-device, cloud, and web machine learning requirements, they do not support edge computing devices such as mobile and embedded devices. To address the issue, TensorFlow introduced TensorFlow Lite (TFLite), a lightweight solution for mobile and embedded systems.

This blog post will briefly discuss how TFLite integrates into a development workflow and selected vital concepts in TF Lite. Then we will apply some of the ideas to a simple model to observe their impact. Finally, we will discuss how we (NIRDSLab) use TF Lite and techniques with key takeaways.

TF Lite in Practice

Where does TensorFlow Lite fit in a typical machine learning workflow? For instance, let us consider a scenario where we need to build a deep learning model-based application in a mobile application. Irrespective of how we will deploy it, we can classify the workflow into three steps: 1. Model creation and training, 2. Model serialization and conversion, 3. Deployment.

Components of a Machine Learning Application Workflow

We can start our workflow by creating and training a model. Here, we can form a custom model, fine-tune a pre-trained model, or directly use a pre-trained model without any training. There is little to no distinction between the workflows of models targeting different platforms at this stage. In general, we perform any training required at this stage with the aid of high-performance computers.

Once we have the trained model, we can use it in the next step of model conversion and serialization. Here, we perform deployment-specific tasks to convert the model to an appropriate format for the runtime/platform. For applications we plan on running on cloud or on-premise, we do not require any form of conversions. However, for applications running on a browser, or edge devices, we need to perform the conversion using the appropriate converter; TensorFlowJS converter for the browser and NodeJS, and the TFLite converter for edge devices. The conversion process will apply a series of optimization steps during this stage, which will make the model efficient for the corresponding platform.

Once we have completed the conversion process, we can use the serialized/saved converted model to perform inferencing (or predictions) on the target platform. At the time of the blog, TFLite support covers Android, iOS, embedded Linux platforms, and a selected set of microcontrollers, including Arduino and Adafruit. An important consideration is that there are limitations relating to supported devices and operations. As a result, we need to be cautious when deploying TFLite models on edge devices as we might not encounter the limitations until the very end of our workflow.

TFLite Conversion

TF Lite Converter

As highlighted earlier, the TFLite conversion process is vital in converting the model we developed into an efficient format. The converter takes a TensorFlow model and generates the corresponding TFLite model. Here, we can either use the Python API or the command-line tool. However, the recommended option is to use the Python API since we can easily integrate it with the model development workflow and limited functionalities in the command-line tool.

When using Python API, we have three options to provide the model.

Saved model: A model instance that uses low-level TensorFlow APIs (i.e., tf.* APIs).
Keras model: A model instance uses TensorFlow Keras high-level APIs (i.e., tf.keras.* APIs).
Concrete functions: A model defined as TensorFlow functions (i.e., uses tf.function)

The TF Lite file generated by the converter uses the FlatBuffer format for storing the model information, similar to Protocol Buffers (Protobuf) used in TensorFlow. The main advantage of FlatBuffer compared to the Protobuf format is that the FlatBuffers do not need parsing or unpacking before accessing the serialized data. Further, the FlatBuffers are more efficient in memory and data access.

Optimizations

In addition to the file level changes, we can apply optimizations during the conversion process to further optimize the model for our edge device. These optimizations focus on 1. Model size - Smaller storage and memory usage, 2. Inference - Improve the time taken for the model to process inputs, and 3. Edge device accelerator optimizations. At the time of the blog, the TF Lite supports optimizations through quantization, pruning, and weight clustering.

Quantization

Process of reducing the size of the model by modifying the representation of the model parameters. For instance, consider a model that uses 32-bit floating-point numbers for all model parameters. If we use 16-bit floating-point numbers to represent parameters, we can reduce the model size up to 50%. Further, we can also achieve faster inferencing depending on the edge device. However, a such conversion will reduce the precision of the model parameters, leading to a lower accuracy for the quantized model.

There are two main types of quantization in TF Lite: post-training quantization and quantization-aware training. As the name suggests, we perform post-training quantization once we have trained the model in the serialization and conversion step in a workflow. As a result, it is simpler and easier to implement and use and does not require modifications to the model creation and training step in a workflow. The quantization-aware training emulates the inference-time quantization during training. Because of that, quantization aware training requires modifications to our workflow in the model creation and training step. Since we mimic the quantization during training, it often provides better accuracy than post-training quantization.

Determination of the quantization scheme depends on several model features beyond the blog's scope. However, the TF Lite article on model optimization provides a decision tree to guide the process. Moreover, it is essential to consider the edge device support when deciding on a quantization mechanism irrespective of the quantization mechanism.

Pruning

Pruning works by eliminating redundant components or components that have only a minor impact on the model, resulting in a sparse model. We achieve pruning during the training phase of the workflow by eliminating the smallest weights (i.e., by setting to zero). Even though this technique does not change the model's configurations, we can compress the model efficiently due to the sparsity. However, a pruned model will have the same runtime memory and latency.

Compressing the model plays a vital role in applications that requires downloading models on demand. However, depending on the application, there can be a loss of accuracy, often insignificant.

Clustering

Clustering (or weight clustering) optimizes models by reducing the number of unique weight values in a model. For this purpose, first, we group the weights in each layer into a predefined number of clusters and replace the weight with the centroid values. Similar to pruning, this technique improves the model compression capabilities.

A Simple Experiment

Now let's see how we can apply the concepts to a simple model to detect and classify handwritten digits. Here we will use the MNIST dataset, which contains images of handwritten digits along with the labels. Let's use a simple feed-forward neural network with three dense layers (dense 128, dense 32, dense 10) for simplicity and experiment with quantization and clustering. Let us establish a benchmark by training the model without any optimizations and converting it to TF Lite for all the experiments. The non-optimized model yielded a size of 412 KB and an accuracy of 97.4%. All the code and generated files in the experiment are available in the GitHub repository: https://github.com/mahanama94/TF-lite-blog/

Architecture of the experimentation model

Quantization

Post-training Quantization

Now, for comparison, let us optimize the model with quantization by including the default optimization strategy in the converter. As a result of the optimization, the model's size shrank to 106 KB with a drop in accuracy to 97.38%.

Quantization Aware Training

As identified earlier, quantization-aware training requires modifications to the training step of our workflow. Here, instead of training the model after creation, we have to convert it to a model emulating quantization during training. For the task, we can use the Tensorflow model optimization API (tensorflow_model_optimization

.quantization.keras.quantize_model API call). The API call returns a quantization aware model instance from the model we formed, which requires training. Before conversion to TF Lite, the quantization-aware model achieved a test accuracy of 97.29%, which dropped to 93.36% upon conversion, with a converted model size of 106 KB.

Our experiments suggest that quantization shrinks the model size significantly with a slight accuracy trade-off.

Weight Clustering

Performing weigh clustering is somewhat similar to the quantization aware training. Here, instead of starting with an untrained model, we can use a pre-trained model and apply weight clustering as a fine-tuning step. For that, we can use the TensorFlow model optimization API and wrap the model with the clustering functionality (cluster model weights during training). Here we can use different parameters for the number of clusters and how the cluster centroids are determined initially. Then we can proceed to fine-tune the model, usually with a relatively low learning rate and fewer steps.

Then we can convert the model into the TF Lite model, compress it, and compare it with the compressed TF Lite converted base model. During our experiment, the zip archive of the base model had a size of 384 KB as opposed to the 56 KB size of the zip archive of the clustered model.

How do we use it?

At NIRDSLab, we use some of these concepts to improve the performance of our machine learning applications in eye tracking and computer vision[1]. The use of machine learning in edge devices has several benefits for our applications:

It eliminates potential issues with the network connectivity, such as network latency and connectivity issues.
Some of the optimization techniques discussed in the blog can reduce the inference latency, therefore improving the throughput of our applications.
Since there's no transmission over a network, we also get additional privacy benefits.

However, there is a major technical constraint when using TF Lite is not all operations can be performed on all devices, and as a result, all TensorFlow models cannot be converted to TF Lite models. Therefore it is essential to check for compatibility before continuing in the machine learning workflow.

References

[1] B. Mahanama, Y. Jayawardana, en S. Jayarathna, “Gaze-Net: appearance-based gaze estimation using capsule networks”, in Proceedings of the 11th Augmented Human International Conference, 2020, bll 1–4.

-- Bhanuka (@mahanama94)

Search This Blog

Web Science and Digital Libraries Research Group