Introduction of Intel OpenVINO Toolkit!!!

I got udacity intel AI edge scholarship and in the introduction, they provided Introduction of Intel OpenVINO. In this article, we are going to explore the basic of openVINO.

OpenVINO toolkit can boost your inference applications across multiple deep neural networks with high throughput and efficiency.

OpenVINO stands for Open Visual Inference and Neural Network Optimization.

Download link

What is OpenVINO?

OpenVINO stands for Open Visual Inference and Neural Network Optimization. OpenVINO is a toolkit provided by Intel to facilitate faster inference of deep learning computer vision models. This toolkit helps developers to create cost-effective and robust computer vision applications.

Learn more about openvino here.

It enables deep learning inference at the edge and supports heterogeneous execution across computer vision accelerators — CPU, GPU, Intel® Movidius™ Neural Compute Stick, and FPGA.

OpenVION provides a number of inbuilt trained models here

Download & Install the OpenVINO:

There is the very good official documentation of OpenVION from intel, which descriptive and easy to understand. Please use below link

  1. To download

Choose & Download | Intel® Distribution of OpenVINO™ Toolkit
Download a version of the Intel® Distribution of OpenVINO™ toolkit for Linux, Windows, or

2. Get started

Get Started | Intel® Distribution of OpenVINO™ Toolkit
Get up-to-speed fast using resources and training materials for this computer vision

OverView of OpenVINO:

The execution process is as follows — 

  • We have to feed a pre-trained model to the Model Optimizer. It optimizes the model and converts it into its intermediate representation (.xml and .bin file).
  • The Inference Engine helps in the proper execution of the model on the different number of devices. It manages the libraries required to run the code properly on different platforms.

The two main components of the OpenVINO toolkit are Model Optimizer and Inference Engine. So, we will dip dive into there details, to have a better understanding of under the hood.

Model Optimizer:

The model optimizer is a cross-platform CLI tool that facilitates the transition between the training and deployment environment. It adjusts the deep learning models for optimal execution on end-point target devices. If you want to understand the optimization technique for TensorFlow you can check out this article.

If you check the diagram carefully optimizer contains three steps 

  1. Converting
  2. Optimizing
  3. Preparing to inference.

OpenVION is a toolkit, not a deep learning library that will help you to train a model. It helps you to optimize and serve the model on different devices.

There is a detailed documentation of how under the hood this works. I don’t want to go into detail. 

Inference Engine:

Now our model is ready for inferencing. The optimizer CLI converted and optimized model and ready for inference. The model optimizer produces the intermediate representation of a model. This is the input for the inference engine to take inference over the input data. 

The Inference Engine is a C++ library with a set of C++ classes to infer input data (images) and get a result. The C++ library provides an API to read the Intermediate Representation, set the input and output formats, and execute the model on devices.

The best thing about the OpenVION inference engine is the heterogeneous execution of the model and it is possible because of the Inference Engine. It uses different plug-ins for different devices.

Code sample Example:

We will take sample store-aisle-monitor-python. This code sample has been provided by intel.

We will take some code sample snippets and brief description.

# Initialize the class
infer_network = Network()

# Load the network to IE plugin to get shape of input layer

n, c, h, w = infer_network.load_model(args.model, args.device, 1, 1, 2, args.cpu_extension)[1]

The above code is self-explanatory. 

just initializing the Network class and loading the model using the load_model function.
The load_model the function returns the plugin along with the input shape.
We only need the input shape that’s why we have specified [1] after the function call.

# The exec_net function will start an asynchronous inference request.
infer_network.exec_net(next_request_id, in_frame)

We need to pass request-id and input frame for inference. 

res = infer_network.get_output(cur_request_id)
for obj in res[0][0]:
if obj[2] > args.prob_threshold:
xmin = int(obj[3] * initial_w)
ymin = int(obj[4] * initial_h)
xmax = int(obj[5] * initial_w)
ymax = int(obj[6] * initial_h)
class_id = int(obj[1])

 get_output the function will give us the model’s result.

You can clone the git repo and start making your hands dirty. Happy coding!


This reference implementation is also available in C++ This reference implementation counts the number of people…

Some Cool project just FYIIntel® IoT Developer Kit
IoT Libraries & Code Samples from Intel. Intel® IoT Developer Kit has 102 repositories available. Follow their code on…

Introduction to Intel® Distribution of OpenVINO™ toolkit for Computer Vision Applications |…
The Intel® Developer Zone offers tools and how-to information to enable cross-platform app development through platform…

Tensorflow Lite model inferencing fast and lean!!

This article is intended to talk more about how TFLite achieves inference over all the different types of edge devices in a fast and lean way

We have a different set of edge devices such as IoT devices, mobile devices, embedded devices, etc. How TFLite is taking inference seamlessly and elegant way. To understand this let us jump into it.

What is an interpreter?

As we know TFLite consists of a set of tools and the TFLite consist of two core components:

  1. Converter
  2. Interpreter

The converter will help us to convert deep learning models into the TFLite format and the interpreter makes our life easier while inferencing.

The TensorFlow Lite interpreter, which runs specially optimized models on many different hardware types, including mobile phones, embedded Linux devices, and microcontrollers

TFLite interpreter people refer to interchangeably as inferencing. The term inference refers to the process of executing a TensorFlow Lite model on edge devices in order to make predictions based on user input. To perform inference with a tensorflow lite model, you must run it through interpreter. 

TFLite interpreter is designed to be lean and fast to achieve this it uses a static graph ordering and a custom memory allocator to ensure minimal load, initialization, and execution latency.

Step of inferencing:

TensorFlow inference APIs are provided for most common mobile/embedded platforms such as Android, iOS, & Linux, in multiple programming languages. Across all libraries, the TensorFlow Lite API enables you to load models, feed inputs, and retrieve inference outputs.

TFLite interpreter follows below steps in general:

  1. Loading a model:- 

 The first and more must step is to load the .tflite model into the memory, which contains the execution graph.

2. Transforming data:- 

 The model doesn’t understand the raw input data. To make raw compatible into a model understandable format you need to transform the data. For e.g for the computer vision model, you need to resize the input image and then provide that image to model.

3. Running inference:- 

Now the model is in memory and data is in the required format let us take the inference. It involves a few steps such as building the interpreter and allocating tensors.

4. Interpreting output:-

After the third step, we will get some output after inference but the end-user won’t understand that. Model results most of the time are probabilities or approximate value. We have interpreted this result into meaningful output. 


Let us take model inferencing using python 

import numpy as np
import tensorflow as tf

# Load TFLite model and allocate tensors.
interpreter = tf.lite.Interpreter(model_path="converted_model.tflite")

# Get input and output tensors.
input_details = interpreter.get_input_details()
output_details = interpreter.get_output_details()

# Test model on random input data.
input_shape = input_details[0]['shape']
input_data = np.array(np.random.random_sample(input_shape), dtype=np.float32)
interpreter.set_tensor(input_details[0]['index'], input_data)


# The function `get_tensor()` returns a copy of the tensor data.
# Use `tensor()` in order to get a pointer to the tensor.
output_data = interpreter.get_tensor(output_details[0]['index'])

Example in C++, even though language will change or underlining platform will change steps are the same:

// Load the model
std::unique_ptr<tflite::FlatBufferModel> model =

// Build the interpreter
tflite::ops::builtin::BuiltinOpResolver resolver;
std::unique_ptr<tflite::Interpreter> interpreter;
tflite::InterpreterBuilder(*model, resolver)(&interpreter);

// Resize input tensors, if desired.

float* input = interpreter->typed_input_tensor<float>(0);
// Fill `input`.

//output data
float* output = interpreter->typed_output_tensor<float>(0);


In this article, we explored the TFLite interpreter and what are the steps involved in TFLite inferencing and how to do that.


Fast Inference: TFLite GPU Delegate!!

Running inference over the edge devices, especially on mobile devices is very demanding. When you have a really big machine learning model taking inference with the limited resources is a very crucial task. 

Many mobile devices especially mobile devices have hardware accelerators such as GPU. Tensorflow Lite Delegate is useful to optimize our trained model and leveraged the benefits of hardware acceleration.

What is Tensorflow Lite Delegate?

Delegator’s job, in general, is to delegate or transfer your work to someone. TensorFlow Lite supports several hardware accelerators.

A TensorFlow Lite delegate is a way to delegate part or all of graph execution to another executor.

Why should you use delegates?

Running inference on compute-heavy deep learning models on edge devices is resource-demanding due to the mobile devices’ limited processing, memory, and power. Instead of relying on the device CPU, some devices have hardware accelerators, such as GPU or DSP(Digital Signal Processing), that allows for better performance and higher energy efficiency.

How TFLite Delegate work?

How TFLite Delegate work.

Let us consider the graph on the left side. It has an input node where we will get input for inference. We will get input node going through convolutional operation and then mean operation and it uses the output of these two operations to compute the SquareDifference. 

Let us assume we have a hardware accelerator that can perform Conv2d and mean operations very fastly and efficiently and above graph will be like this:

In this case, we will delegate conv2d and mean these two operations to a specialized hardware accelerator using the TFLite delegator. 

TFLite GPU delegator will delegate the operations to a GPU delegator if available.

TFLite allows us to provide delegates for specific operations, in which case the graph will split into multiple subgraphs, where each subgraph handled by a delegate. Each and every subgraph that is handled by a delegate will be replaced with a node that evaluates the subgraph on its invoked call. Depending on the model, the final graph can end up with one node or many nodes, which means that all of the graphs were delegated or multiple nodes handled the subgraphs. In general, you don’t want to have multiple subgraphs handled by the delegate, since each time you switch from delegate to the main graph, there is an overhead for passing the results from the subgraph to the main graph. 

It’s not always safe to share memory.

How to add a delegate?

  1. Define a kernel node that is responsible for evaluating the delegate subgraph.
  2. Create an instance of TfLiteDelegate, which will register the kernel and claim the nodes that the delegate can execute.


Tensorflow has provided a demo app for android:

In your application, add the AAR as above, import org.tensorflow.lite.gpu.GpuDelegate module, and use theaddDelegate function to register the GPU delegate to the interpreter

import org.tensorflow.lite.Interpreter;
import org.tensorflow.lite.gpu.GpuDelegate;

// Initialize interpreter with GPU delegate
GpuDelegate delegate = new GpuDelegate();
Interpreter.Options options = (new Interpreter.Options()).addDelegate(delegate);
Interpreter interpreter = new Interpreter(model, options);

// Run inference
while (true) {
  writeToInput(input);, output);

// Clean up


Include the GPU delegate header and call the Interpreter::ModifyGraphWithDelegate function to register the GPU delegate to the interpreter:

#import "tensorflow/lite/delegates/gpu/metal_delegate.h"

// Initialize interpreter with GPU delegate
std::unique_ptr<Interpreter> interpreter;
InterpreterBuilder(*model, resolver)(&interpreter);
auto* delegate = NewGpuDelegate(nullptr);  // default config
if (interpreter->ModifyGraphWithDelegate(delegate) != kTfLiteOk) return false;

// Run inference
while (true) {
  if (interpreter->Invoke() != kTfLiteOk) return false;

// Clean up
interpreter = nullptr;


Some operations that are trivial on the CPU may have a high cost for the GPU.

Reference Link:

For more such stories

Optimization techniques – TFLite!!

One of the most popular Optimization technique is called quantization.

Running the machine learning model and making inference on mobile devices or embedded devices comes with certain challenges such as the limited amount of resources such as memory, power and data storage, so it’s crucial and critical to deploy ML model on edge devices. 

It’s critical to deploy optimized machine learning models on mobile and embedded devices so that they can run efficiently. There are optimization techniques and one of the optimization techniques is Quantization. In the last article, we have seen how to use the TFLite Converter to optimize the model for edge devices without any modification in weights and activation types.

What is Quantization?

Quantization is generally used in mathematics and digital signal processing. Below is the wiki definition.

Quantization, in mathematics and digital signal processing, is the process of mapping input values from a large set (often a continuous set) to output values in a (countable) smaller set, often with a finite number of elements. Rounding and truncation are typical examples of quantization processes.

Quantization refers to the process of reducing the number of bits that represent a number. In the context of deep learning, the dominant numerical format used for research and for deployment has so far been a 32-bit floating-point or FP32. Convert FP32 weights and output activations into the nearest 8-bit integer, some times 4/2/1 bit as well in quantization.

Quantization optimizes the model by quantizing the weights and activation type. TFLite uses quantization technique to speed up inference over the edge devices. TFLite converter is the answer to whether we can manage a deep learning model with lower precision. Now you know exactly quantization, let us, deep dive:

Quantization dramatically reduces both the memory requirement and computational cost of using neural networks.

The quantizing deep learning model uses techniques that allow for reduced precision representations of weights and, optionally, activations for both storage and computation.

TFLite provides several level of support to quantization.

  1. Post-training quantization
  2. Quantization aware training.

Below is a table that shows the benefits of model quantization for some CNN models. 

Benefits of model quantization for select CNN models.

Post-training quantization:

As the name implies its post-training technique, this is after your model is trained. Post-training quantization is a technique used to quantizing weights and activation types. This technique can reduce the model size and also improving CPU and hardware acceleration latency. There are different optimization options such as weight, full integer, etc based on our requirement we can choose. 

TensorFlow org provided a decision tree that can help us in making decision

Weight Quantization:

The very simple post-training quantization is quantizing only weights from FP to 8 bit precision. This option is available with TFLite converter. At inference, weights are converted from 8-bits of precision to floating-point and computed using floating-point kernels. This conversion is done once and cached to reduce latency. If you want to improve latency further use of a hybrid operator. 

import tensorflow as tfconverter = tf.lite.TFLiteConverter.from_saved_model(saved_model_dir)converter.optimizations = [tf.lite.Optimize.OPTIMIZE_FOR_SIZE]tflite_quant_model = converter.convert()

At the time of conversion, set the optimizations flag to optimize for model size.

This optimization provides latencies close to fully fixed-point inference. but, the outputs are still stored using floating-point.

Full integer quantization:

We can get further latency improvements, reductions in peak memory usage, and access to an integer only hardware accelerators by making sure all model math is quantized. In full integer quantization, you need to measure the dynamic range of activations and inputs by supplying data sets, create a dataset using an input data generator.

import tensorflow as tfdef representative_dataset_gen():  for _ in range(num_calibration_steps):    yield [input]
converter = tf.lite.TFLiteConverter.from_saved_model(saved_model_dir)converter.optimizations = [tf.lite.Optimize.DEFAULT]converter.representative_dataset = representative_dataset_gentflite_quant_model = converter.convert()

The result of full integer quantization should be full quantized, any ops don’t have quantized implementation left in FP. Full integer-only execution gets a model with even faster latency, smaller size, and integer-only accelerators compatible model.

you can enforce full integer quantization for all ops and use integer input and output by adding the following lines before you convert.

The converter throw an error if it encounters an operation it cannot currently quantize.

converter.target_spec.supported_ops = [tf.lite.OpsSet.TFLITE_BUILTINS_INT8]
converter.inference_input_type = tf.uint8converter.inference_output_type = tf.uint8

Float 16 Quantization example:

The IEEE standard for 16-bit floating-point numbers. We can reduce the size of a floating-point model by quantizing the weights to float16. This technique reduces the model size by half with minimal loss of accuracy as compared to other techniques. This technique model will “dequantize” the weights values to float32 when running on the CPU.

import tensorflow as tfconverter = tf.lite.TFLiteConverter.from_saved_model(saved_model_dir)converter.optimizations = [tf.lite.Optimize.DEFAULT]converter.target_spec.supported_types = [tf.lite.constants.FLOAT16]tflite_quant_model = converter.convert()

We have seen a different technique in post-training quantization: The float 16 quantization may not be a good choice if you need maximum performance. A Full integer quantization to fixed-point math would be better in that case. Weight quantization is a very basic quantization. Since weights are quantized post-training, there could be an accuracy loss, particularly for smaller networks.

Tensorflow Lite model accuracy

Quantization aware Training:

There could be an accuracy loss in a post-training model quantization and to avoid this and if you don’t want to compromise the model accuracy do quantization aware training. As we have learned the post-training quantization technique is after the model has been trained. To overcome post-training quantization technique drawbacks we have quantization aware model training. This technique ensures that the forward pass matches precision for both training and inference. In this technique Tensorflow created flow, wherein the process of constructing the graph you can insert fake nodes in each layer, to simulate the effect of quantization in the forward and backward passes and to learn ranges in the training process, for each layer separately.

There are two aspects of this technique

  • Operator fusion at inference time is accurately modeled at training time.
  • Quantization effects at inference are modeled at training time.
tf.quantization.quantize(    input,    min_range,    max_range,    T,    mode='MIN_COMBINED',    round_mode='HALF_AWAY_FROM_ZERO',    name=None)
out[i] = (in[i] - min_range) * range(T) / (max_range - min_range)if T == qint8: out[i] -= (range(T) + 1) / 2.0
num_discrete_values = 1 << (# of bits in T)range_adjust = num_discrete_values / (num_discrete_values - 1)range = (range_max - range_min) * range_adjustrange_scale = num_discrete_values / rangequantized = round(input * range_scale) - round(range_min * range_scale) +  numeric_limits<T>::min()quantized = max(quantized, numeric_limits<T>::min())quantized = min(quantized, numeric_limits<T>::max())

Check the complete example here:


Tensorflow Lite Converter Example!!

Let us deploy Deep learning TensorFlow model on edge devices using TF Lite. 

There are three different ways we can use TensorFlow lite converter

  1. Convert TF SaveModel to TF Lite 
  2. Convert Keras PreBuilt Model to TF Lite
  3. Concrete Function to TF Lite
  1. Convert TF SaveModel to TF Lite:- 

Let us create a simple model using TensorFlow and save that model using the TF SaveModel. To develop this model we will use TensorFlow API. In this example, we will show how to convert SaveModel into TF Lite FlatBuffer.

# we will train 
import tensorflow as tf# Construct a basic TF model.root = tf.train.Checkpoint()root.v1 = tf.Variable(3.)root.v2 = tf.Variable(2.)root.f = tf.function(lambda x: root.v1 * root.v2 * x)
# Save the model into temp directoryexport_dir = "/tmp/test_saved_model"input_data = tf.constant(1., shape=[1, 1])to_save = root.f.get_concrete_function(input_data), export_dir, to_save)
# Convert the model into TF Lite.converter = tf.lite.TFLiteConverter.from_saved_model(export_dir)tflite_model = converter.convert()
#save model 
tflite_model_files = pathlib.Path(‘/tmp/save_model_tflite.tflite’)

2. Convert Keras PreBuilt Model to TF Lite:-

In this section, we have explored how to convert the prebuilt Keras model into the TF lite model. We will run inference on a pre-trained tf.keras MobileNet model to TensorFlow Lite.

import numpy as npimport tensorflow as tf
# Load the MobileNet keras model.# we will create tf.keras model by loading pretrained model on #imagenet dataset
model = tf.keras.applications.MobileNetV2(    weights="imagenet", input_shape=(224, 224, 3))
# here we pretrained model no need use SaveModel 
# here we will pass model directly to TFLiteConverter
converter = tf.lite.TFLiteConverter.from_keras_model(model)tflite_model = converter.convert()

#if you want to save the TF Lite model use below steps or else skip
tflite_model_files = pathlib.Path(‘/tmp/pretrainedmodel.tflite’)
# Load TFLite model using interpreter and allocate tensors.interpreter = tf.lite.Interpreter(model_content=tflite_model)interpreter.allocate_tensors()

3. Concrete Function to TF Lite:- 

In order to convert TensorFlow 2.0 models to TensorFlow Lite, the model needs to be exported as a concrete function. If you have developed your model using TF 2.0 then this is for you. We will convert concrete function into the TF Lite model. In this section also we will use the Keras MobileNet model.

import tensorflow as tf
# load mobilenet model of keras 
model = tf.keras.applications.MobileNetV2(weights="imagenet", input_shape=(224, 224, 3))

We will tf.function to create a callable tensorflow graph of our model.

#get callable graph from model. 
run_model = tf.function(lambda x: model(x))
# to get the concrete function from callable graph 
concrete_funct = run_model.get_concrete_function(tf.Tensorpec(model.inputs[0].shape, model.inputs[0].dtype))

#convert concrete function into TF Lite model using TFLiteConverter
converter =  tf.lite.TFLiteConverter.from_concrete_functions([concrete_funct])
tflite_model = converte.convert()
#save model 
tflite_model_files = pathlib.Path(‘/tmp/concretefunc_model.tflite’)

CLI TF Lite Converter:-

Apart from this python API we can also use Command Line Interface to convert model. TF lite converter to convert SaveModel to the TFLite model.

The TensorFlow Lite Converter has a command-line tool tflite_convert which supports basic models.

#! /usr/bin/env/  bash
tflite_convert = --saved_model_dir=/tmp/mobilenet_saved_model \

 --output_file. Type: string. Specifies the full path of the output file.

--saved_model_dir. Type: string. Specifies the full path to the directory containing the SavedModel generated in 1.X or 2.X.

 --keras_model_file. Type: string. Specifies the full path of the HDF5 file containing the tf.keras model generated in 1.X or 2.X.

#! /usr/bin/env/  bash
tflite_convert = --keras_model_file=model.h5 \

The converter supports SavedModel directories, tf.keras models, and concrete functions.

For now, we will end off with these options only. Next article we will explore converting RNN model and Quantized Models.

Tensorflow Lite Model Deployment!

Here you go — — Introduction Story of Tensorflow Lite

In the above article, we introduced TensorFlow lite. What is TensorFlow lite and what is the purpose of it and what is TensorFlow lite is not.

In this article, we will dig deeper and steps involved in the TensorFlow lite model deployment. 

The above diagram states the deployment flow of Tensorflow lite mode at the edge devices.

Let us go through the steps from the top of the diagram.

Very high level convert this diagram into two functionality first step is converter and second, is the interpreter or inference the model.

  1. Train Model:- 

Train your model using TensorFlow. We can train our model using any high-level TensorFlow API such as Keras or either you have a legacy TensorFlow model. You can train our model using high-level API like Keras or low-level API. You can develop your own model or use TensorFlow inbuilt model. 

If you have any other model also you can convert those models into TensorFlow using ONNX and use it. Once the model is ready you have save that model. We can save our model in a different format based on APIs such as HDF5, SavedModel or FrozenGraphDef.

2. Convert Model:- 

In this step, we are actually using the Tensorflow Lite converter to convert the TensorFlow model into the TensorFlow lite flatbuffer format.

FlatBuffers is a special data serialization format that is optimized for performance. Tensorflow Lite flatbuffer aka TF Lite model. The TensorFlow Lite converter takes a TensorFlow model and generates a TensorFlow Lite FlatBuffer file (.tflite). The converter supports SavedModel directories, tf.keras models, and concrete functions. Now our TFLite model is ready.

You can convert a model using the Python API or command-line tool. CLI support very basic models.

Python API example:- 

//export_dir is the path of your TF model is saved.
converter = tf.lite.TFLiteConverter.from_saved_model(export_dir)tflite_model = converter.convert()

CLI example 

bazel run //tensorflow/lite/python:tflite_convert -- \  --saved_model_dir=/tmp/mobilenet_saved_model \  --output_file=/tmp/mobilenet.tflite

3. Deploy Model:-

Now our model is ready and we have ‘.tflite’ file. We can deploy this to IoT devices, embedded devices or mobile devices. We can 

4. Deploy model:-

To perform inference with a TensorFlow Lite model, you must run it through an interpreter. TensorFlow Lite model serves on a device using an interpreter. TensorFlow Lite interpreter provides a wide range of interfaces and supports a wide range of devices. The TensorFlow Lite interpreter is designed to be lean and fast. We can run models locally on these devices using the Tensorflow Lite interpreter. Once this model gets loaded into devices such as embedded devices, Android or iOS devices. Once a device is deployed then take inference. 

The inferencing model goes through the below steps in generally. 

a. Loading a model:- You must load .tflite model file into memory.

b. Transforming data:- Raw input data for the model generally does not much input data format expected by the model. You need to transform the data.

c. Running inference:- Execute inference over transformed data.

d. Interpreting output:- When you receive results from the model inference, you must interpret the tensors in a meaningful way that’s useful in your application.

PreTrained Models

Some Examples

Tensorflow Lite- machine learning at the edge!!

Tensorflow created a buzz in AI and deep learning forum and TensorFlow is the most popular framework in the deep learning community.


As we know that to train deep learning models we need to compute power and this age of computation. Now we are moving with cloud computing along with edge computing. Edge computing is the need of today’s world because of innovation in the IoT domain and due to compliance and data protection laws enforcing companies to do computation and the edge side instead of computing model in the cloud and sending the result back to a client device is now the legacy.

As TensorFlow is the most popular deep-learning framework. It comes with its lite weight version for edge computation. Now a day’s mobile devices have good processing power but edge devices have less power.

Train deep learning model in less than 100KB.

The official definition of Tensorflow Lite:

“TensorFlow Lite is an open-source deep learning framework for on-device inference.”

Deploy machine learning models on mobile and IoT devices.

Tensorflow Lite is package of tools to help developers to run TensorFlow models on mobile, embedded devices, and IoT devices. It enables on-device machine learning inference with low latency and a small binary size.

Tensorflow Lite is providing machine learning at the edge devices.

Edge computing means compute at local.

Deep Dive:-

This diagram illustrates the standard flow for deploying the model using TensorFlow Lite.

Deploying model using TensorFlow Lite at the edge devices

Tensorflow Lite is not a separate deep learning framework, it is providing a set of tools that will help developers run TensorFlow models or any other deep learning models on mobile, embedded and IoT devices.


  1. Choose Model or develop your own model.
  2. Choose Model
  3. Convert the Model
  4. Deploy the Model
  5. Run the inference with the Model
  6. Optimize the Model and repeat the above steps.

Tensorflow Lite consists of two main components

  1. Converter:- Tensorflow Lite Converter converts the TensorFlow model into the TensorFlow lite model.
  2. Interpreter:- It is supporting a set of core operators that are optimized for on-device applications and with a small binary size. It is basically for inferencing the model.

Why Edge Computing?

Edge computing is really best to use case along with cloud computing. Nowadays cloud computing becomes crazy but there are a certain requirement where edge computation will beat cloud computing. Why edge computation is more important and what is advantage you will derive from this.

  1. Privacy:- No data needs to leave the device. Everything is there only.
  2. Latency:- There’s no back and forth request to a server.
  3. Connectivity:- Internet connection not required
  4. Power Consumption:- Connecting to a network requires power.

Tensorflow Lite is the one-stop solution to convert your deep learning model and deploy efficiently and enjoy inferencing. TensorFlow lite supports both mobile devices and microcontrollers. 

Architecture for IoT applications.

Architecture for IoT applications.

Architecture that’s built to heal.

If you are not aware of the software architecture then go through this tutorial Software architecture. In this tutorial, we are gone through software architecture, not the hardware architecture & electronic device connectivity.

What is IOT?

Before going into the what is IOT let me tell one story that we all know? The story of blind men & elephant originated in the Indian subcontinent. It is a story of a group of blind men (or men in the dark) who touch an elephant to learn what it is like. Each one feels a different part, but only one part, such as the side or the tusk. They then compare notes and learn that they are in complete disagreement. The same thing with IOT we did.

The internet of things (IoT) is the internetworking of physical devices, vehicles (also referred to as “connected devices” and “smart devices”), buildings and other items — embedded with electronics, software, sensors, actuators, and network connectivity that enable these objects to collect and exchange data.

— wikipedia

If you overlook the definition, then we say there is nothing new it is just the stack of technologies which you already familiar. So someone says it is sensor programming, embedded programming, big data, machine learning, map reduce, etc. And one seeing man come and say this is NOT (this is the elephant).

The term “The Internet of Things” was coined by Kevin Ashton in a presentation to Proctor & Gamble in 1999. Ashton is a co-founder of MIT’s Auto-ID Lab. He pioneered RFID use in supply-chain management.

Now you are aware of what is IOT & software architecture. Then let us combine both things.

How we define IOT application architecture?

For defining architecture I will not go into the electronic & hardware part, I will keep you at the software stack and How we will need to arrange the software technologies as part of our dream IOT product.

This is just the blocks I arranged. Let us discuss in brief. We will explain from bottom to top.

  1. The bottom & core part of IOT application is the sensors and electronic devices that are able to connect to things & grab the data from it.

2. The sensor collects the data but that we need to convert it into understandable format & connect those sensor devices using some protocol that we need to configure here in layer two and also filter data i.e put some threshold for your data for taking a smart decision.

3. Network connectivity, connect your device with wireless connectivity or internet wired connection. This connectivity is changed based on context & domain.

4. We can say this layer as a security layer or application abstraction layer or data abstraction where we can apply security to our product. This layer position should be changeable based on the domain & How we want to apply abstraction to our application.

5. At this stage, we will persist our logic, use this data for taking a smart decision or for reporting purpose. This is the important layer, where our actual product & business logic comes into the picture.

6. This layer where we can say its presentation layer or decision was taken layer. Based on the requirement we can display reports or applying machine learning or some custom logic and takes a smart decision and send a signal back to the sensors.

7. This is where our God exists. For whom we are designing this whole product. The user interacts with this layer. This is the UI layer.

This is the brief description about we can arrange the blocks to design our IOT product.

IOT is simply connectivity of application, device and data.

For the more technical way, we can understand the same architecture.

There are number IOT device vendors & service provider every one has there own SDK & different protocols also there is a number of ways to connect a device to a network.

For more technical, I have designed simple IOT architecture using the Java stack. Here What we are following architecture for our use case.

On the left side is our sensors or core part of our IOT application.

Define architecture is art. There is a number perspective which is needed to be considered while designing architecture. This is a simple & general perspective to design IOT application architecture. Internet of Things for Architects.

This is just an idea. Dig deeper find more & let me know as well.

For more stories.

Lets connect on Stackoverflow , LinkedIn , Facebook& Twitter.