Fast Inference: TFLite GPU Delegate!!

Running inference over the edge devices, especially on mobile devices is very demanding. When you have a really big machine learning model taking inference with the limited resources is a very crucial task. 

Many mobile devices especially mobile devices have hardware accelerators such as GPU. Tensorflow Lite Delegate is useful to optimize our trained model and leveraged the benefits of hardware acceleration.

What is Tensorflow Lite Delegate?

Delegator’s job, in general, is to delegate or transfer your work to someone. TensorFlow Lite supports several hardware accelerators.

A TensorFlow Lite delegate is a way to delegate part or all of graph execution to another executor.

Why should you use delegates?

Running inference on compute-heavy deep learning models on edge devices is resource-demanding due to the mobile devices’ limited processing, memory, and power. Instead of relying on the device CPU, some devices have hardware accelerators, such as GPU or DSP(Digital Signal Processing), that allows for better performance and higher energy efficiency.

How TFLite Delegate work?

How TFLite Delegate work. tensorflow.org

Let us consider the graph on the left side. It has an input node where we will get input for inference. We will get input node going through convolutional operation and then mean operation and it uses the output of these two operations to compute the SquareDifference. 

Let us assume we have a hardware accelerator that can perform Conv2d and mean operations very fastly and efficiently and above graph will be like this:

In this case, we will delegate conv2d and mean these two operations to a specialized hardware accelerator using the TFLite delegator. 

TFLite GPU delegator will delegate the operations to a GPU delegator if available.

TFLite allows us to provide delegates for specific operations, in which case the graph will split into multiple subgraphs, where each subgraph handled by a delegate. Each and every subgraph that is handled by a delegate will be replaced with a node that evaluates the subgraph on its invoked call. Depending on the model, the final graph can end up with one node or many nodes, which means that all of the graphs were delegated or multiple nodes handled the subgraphs. In general, you don’t want to have multiple subgraphs handled by the delegate, since each time you switch from delegate to the main graph, there is an overhead for passing the results from the subgraph to the main graph. 

It’s not always safe to share memory.

How to add a delegate?

  1. Define a kernel node that is responsible for evaluating the delegate subgraph.
  2. Create an instance of TfLiteDelegate, which will register the kernel and claim the nodes that the delegate can execute.

Android:

Tensorflow has provided a demo app for android:

In your application, add the AAR as above, import org.tensorflow.lite.gpu.GpuDelegate module, and use theaddDelegate function to register the GPU delegate to the interpreter

import org.tensorflow.lite.Interpreter;
import org.tensorflow.lite.gpu.GpuDelegate;

// Initialize interpreter with GPU delegate
GpuDelegate delegate = new GpuDelegate();
Interpreter.Options options = (new Interpreter.Options()).addDelegate(delegate);
Interpreter interpreter = new Interpreter(model, options);

// Run inference
while (true) {
  writeToInput(input);
  interpreter.run(input, output);
  readFromOutput(output);
}

// Clean up
delegate.close();

iOS:

Include the GPU delegate header and call the Interpreter::ModifyGraphWithDelegate function to register the GPU delegate to the interpreter:

#import "tensorflow/lite/delegates/gpu/metal_delegate.h"

// Initialize interpreter with GPU delegate
std::unique_ptr<Interpreter> interpreter;
InterpreterBuilder(*model, resolver)(&interpreter);
auto* delegate = NewGpuDelegate(nullptr);  // default config
if (interpreter->ModifyGraphWithDelegate(delegate) != kTfLiteOk) return false;

// Run inference
while (true) {
  WriteToInputTensor(interpreter->typed_input_tensor<float>(0));
  if (interpreter->Invoke() != kTfLiteOk) return false;
  ReadFromOutputTensor(interpreter->typed_output_tensor<float>(0));
}

// Clean up
interpreter = nullptr;
DeleteGpuDelegate(delegate);

Note:-

Some operations that are trivial on the CPU may have a high cost for the GPU.

Reference Link:

https://www.tensorflow.org/lite/performance/gpu

For more such stories