Once you opened the Colab and if you are already logged in Gmail account.
The google colab is available with zero configuration and free access to GPU and the best part is it sharable. The Google Collaboration is free service for the developers to try TensorFlow on CPU and GPU over the cloud instance of Google. This service is totally free for improving Python programming skills, developers can log in with their Google Gmail account and connect to this service. Here developers can try deep learning applications using popular machine learning libraries such as Keras, TensorFlow, PyTorch, OpenCV & others.
Sign in to google colab and create a new notebook for our HelloWorld example.
Go to File → New NoteBook(Google sign-in is required) →
Now new notebook is ready we want to use TF2.0.0 for our example so let us first install TensorFlow 2.0.0 is already released as a production version. For installing TensorFlow2.0.0 run the following command.
!pip install tensorflow==2.0.0
After a successful installation, we can verify the installed version.
import tensorflow as tf
Now everything is ready and looking promising. We have installed TensorFlow and verified versions too. Now let us look at helicopter overview and create a hello world example.
To change Runtime: Click on Runtime →Change Runtime Type → one popup will open choose perticular runtime and hardware accelrator such as GPU and TPU.
There are a lot of changes that are there in TF1.0 and TF 2.0.0 TF comes with the ease of development less coding it needs in this version of TF2.0.0. TensorFlow 2.0.0 is developed to remove the issues and complexity of previous versions.
In the TF 2.0 eager execution is enabled by default.
The eager execution mode evaluates the program immediately and without building the graph. The eager execution mode operation returns the concrete value instead of constructing a computational graph and then execute the program.
We will use the same Hello world code from tensorflow 1.x version for this and let us observe the output.
#This code snippet is from tensorflow 1.X version
import tensorflow as tf
msg = tf.constant('Hello and welcome to Tensorflow world')
sess = tf.Session()
#print the message
In this example, we are using Tensorflow 1.X.X version code to print the message, but Session has been removed in TF2.0.0 this will cause the exception i.e
AttributeError: module 'tensorflow' has no attribute 'Session'
We will use the same above code snippet by removing the Session
import tensorflow as tf
msg = tf.constant('Hello and welcome to Tensorflow world')
#print the message
#print using tf.print()
Here we have two print statement observe output for both print:
tf.Tensor(b’Hello and welcome to Tensorflow world’, shape=(), dtype=string)
Hello and welcome to Tensorflow world.
This is it, for now, we will start exploring different API of TF in the next article.
Code is available over github you can directly import that in colab and run it.
OpenVINO stands for Open Visual Inference and Neural Network Optimization. OpenVINO is a toolkit provided by Intel to facilitate faster inference of deep learning computer vision models. This toolkit helps developers to create cost-effective and robust computer vision applications.
If you check the diagram carefully optimizer contains three steps
Preparing to inference.
OpenVION is a toolkit, not a deep learning library that will help you to train a model. It helps you to optimize and serve the model on different devices.
There is a detailed documentation of how under the hood this works. I don’t want to go into detail.
Now our model is ready for inferencing. The optimizer CLI converted and optimized model and ready for inference. The model optimizer produces the intermediate representation of a model. This is the input for the inference engine to take inference over the input data.
The Inference Engine is a C++ library with a set of C++ classes to infer input data (images) and get a result. The C++ library provides an API to read the Intermediate Representation, set the input and output formats, and execute the model on devices.
The best thing about the OpenVION inference engine is the heterogeneous execution of the model and it is possible because of the Inference Engine. It uses different plug-ins for different devices.
We will take some code sample snippets and brief description.
# Initialize the class
infer_network = Network()
# Load the network to IE plugin to get shape of input layer
n, c, h, w = infer_network.load_model(args.model, args.device, 1, 1, 2, args.cpu_extension)
The above code is self-explanatory.
just initializing the Network class and loading the model using the load_model function. The load_model the function returns the plugin along with the input shape. We only need the input shape that’s why we have specified  after the function call.
# The exec_net function will start an asynchronous inference request.
We need to pass request-id and input frame for inference.
This article is intended to talk more about how TFLite achieves inference over all the different types of edge devices in a fast and lean way
We have a different set of edge devices such as IoT devices, mobile devices, embedded devices, etc. How TFLite is taking inference seamlessly and elegant way. To understand this let us jump into it.
What is an interpreter?
As we know TFLite consists of a set of tools and the TFLite consist of two core components:
The converter will help us to convert deep learning models into the TFLite format and the interpreter makes our life easier while inferencing.
The TensorFlow Lite interpreter, which runs specially optimized models on many different hardware types, including mobile phones, embedded Linux devices, and microcontrollers
TFLite interpreter people refer to interchangeably as inferencing. The term inference refers to the process of executing a TensorFlow Lite model on edge devices in order to make predictions based on user input. To perform inference with a tensorflow lite model, you must run it through interpreter.
TFLite interpreter is designed to be lean and fast to achieve this it uses a static graph ordering and a custom memory allocator to ensure minimal load, initialization, and execution latency.
Step of inferencing:
TensorFlow inference APIs are provided for most common mobile/embedded platforms such as Android, iOS, & Linux, in multiple programming languages. Across all libraries, the TensorFlow Lite API enables you to load models, feed inputs, and retrieve inference outputs.
TFLite interpreter follows below steps in general:
Loading a model:-
The first and more must step is to load the .tflite model into the memory, which contains the execution graph.
2. Transforming data:-
The model doesn’t understand the raw input data. To make raw compatible into a model understandable format you need to transform the data. For e.g for the computer vision model, you need to resize the input image and then provide that image to model.
3. Running inference:-
Now the model is in memory and data is in the required format let us take the inference. It involves a few steps such as building the interpreter and allocating tensors.
4. Interpreting output:-
After the third step, we will get some output after inference but the end-user won’t understand that. Model results most of the time are probabilities or approximate value. We have interpreted this result into meaningful output.
Let us take model inferencing using python
import numpy as np
import tensorflow as tf
# Load TFLite model and allocate tensors.
interpreter = tf.lite.Interpreter(model_path="converted_model.tflite")
# Get input and output tensors.
input_details = interpreter.get_input_details()
output_details = interpreter.get_output_details()
# Test model on random input data.
input_shape = input_details['shape']
input_data = np.array(np.random.random_sample(input_shape), dtype=np.float32)
# The function `get_tensor()` returns a copy of the tensor data.
# Use `tensor()` in order to get a pointer to the tensor.
output_data = interpreter.get_tensor(output_details['index'])
Example in C++, even though language will change or underlining platform will change steps are the same:
// Load the model
std::unique_ptr<tflite::FlatBufferModel> model =
// Build the interpreter
// Resize input tensors, if desired.
float* input = interpreter->typed_input_tensor<float>(0);
// Fill `input`.
float* output = interpreter->typed_output_tensor<float>(0);
In this article, we explored the TFLite interpreter and what are the steps involved in TFLite inferencing and how to do that.
One of the most popular Optimization technique is called quantization.
Running the machine learning model and making inference on mobile devices or embedded devices comes with certain challenges such as the limited amount of resources such as memory, power and data storage, so it’s crucial and critical to deploy ML model on edge devices.
It’s critical to deploy optimized machine learning models on mobile and embedded devices so that they can run efficiently. There are optimization techniques and one of the optimization techniques is Quantization. In the last article, we have seen how to use the TFLite Converter to optimize the model for edge devices without any modification in weights and activation types.
What is Quantization?
Quantization is generally used in mathematics and digital signal processing. Below is the wiki definition.
Quantization, in mathematics and digital signal processing, is the process of mapping input values from a large set (often a continuous set) to output values in a (countable) smaller set, often with a finite number of elements. Rounding and truncation are typical examples of quantization processes.
Quantization refers to the process of reducing the number of bits that represent a number. In the context of deep learning, the dominant numerical format used for research and for deployment has so far been a 32-bit floating-point or FP32. Convert FP32 weights and output activations into the nearest 8-bit integer, some times 4/2/1 bit as well in quantization.
Quantization optimizes the model by quantizing the weights and activation type. TFLite uses quantization technique to speed up inference over the edge devices. TFLite converter is the answer to whether we can manage a deep learning model with lower precision. Now you know exactly quantization, let us, deep dive:
Quantization dramatically reduces both the memory requirement and computational cost of using neural networks.
The quantizing deep learning model uses techniques that allow for reduced precision representations of weights and, optionally, activations for both storage and computation.
TFLite provides several level of support to quantization.
Quantization aware training.
Below is a table that shows the benefits of model quantization for some CNN models.
As the name implies its post-training technique, this is after your model is trained. Post-training quantization is a technique used to quantizing weights and activation types. This technique can reduce the model size and also improving CPU and hardware acceleration latency. There are different optimization options such as weight, full integer, etc based on our requirement we can choose.
TensorFlow org provided a decision tree that can help us in making decision
The very simple post-training quantization is quantizing only weights from FP to 8 bit precision. This option is available with TFLite converter. At inference, weights are converted from 8-bits of precision to floating-point and computed using floating-point kernels. This conversion is done once and cached to reduce latency. If you want to improve latency further use of a hybrid operator.
import tensorflow as tfconverter = tf.lite.TFLiteConverter.from_saved_model(saved_model_dir)converter.optimizations = [tf.lite.Optimize.OPTIMIZE_FOR_SIZE]tflite_quant_model = converter.convert()
At the time of conversion, set the optimizations flag to optimize for model size.
This optimization provides latencies close to fully fixed-point inference. but, the outputs are still stored using floating-point.
Full integer quantization:
We can get further latency improvements, reductions in peak memory usage, and access to an integer only hardware accelerators by making sure all model math is quantized. In full integer quantization, you need to measure the dynamic range of activations and inputs by supplying data sets, create a dataset using an input data generator.
import tensorflow as tfdef representative_dataset_gen(): for _ in range(num_calibration_steps): yield [input]
The result of full integer quantization should be full quantized, any ops don’t have quantized implementation left in FP. Full integer-only execution gets a model with even faster latency, smaller size, and integer-only accelerators compatible model.
you can enforce full integer quantization for all ops and use integer input and output by adding the following lines before you convert.
The converter throw an error if it encounters an operation it cannot currently quantize.
The IEEE standard for 16-bit floating-point numbers. We can reduce the size of a floating-point model by quantizing the weights to float16. This technique reduces the model size by half with minimal loss of accuracy as compared to other techniques. This technique model will “dequantize” the weights values to float32 when running on the CPU.
We have seen a different technique in post-training quantization: The float 16 quantization may not be a good choice if you need maximum performance. A Full integer quantization to fixed-point math would be better in that case. Weight quantization is a very basic quantization. Since weights are quantized post-training, there could be an accuracy loss, particularly for smaller networks.
There could be an accuracy loss in a post-training model quantization and to avoid this and if you don’t want to compromise the model accuracy do quantization aware training. As we have learned the post-training quantization technique is after the model has been trained. To overcome post-training quantization technique drawbacks we have quantization aware model training. This technique ensures that the forward pass matches precision for both training and inference. In this technique Tensorflow created flow, wherein the process of constructing the graph you can insert fake nodes in each layer, to simulate the effect of quantization in the forward and backward passes and to learn ranges in the training process, for each layer separately.
There are two aspects of this technique
Operator fusion at inference time is accurately modeled at training time.
Quantization effects at inference are modeled at training time.
Let us deploy Deep learning TensorFlow model on edge devices using TF Lite.
There are three different ways we can use TensorFlow lite converter
Convert TF SaveModel to TF Lite
Convert Keras PreBuilt Model to TF Lite
Concrete Function to TF Lite
Convert TF SaveModel to TF Lite:-
Let us create a simple model using TensorFlow and save that model using the TF SaveModel. To develop this model we will use TensorFlow API. In this example, we will show how to convert SaveModel into TF Lite FlatBuffer.
# Save the model into temp directoryexport_dir = "/tmp/test_saved_model"input_data = tf.constant(1., shape=[1, 1])to_save = root.f.get_concrete_function(input_data)tf.saved_model.save(root, export_dir, to_save)
# Convert the model into TF Lite.converter = tf.lite.TFLiteConverter.from_saved_model(export_dir)tflite_model = converter.convert()
#save model tflite_model_files = pathlib.Path(‘/tmp/save_model_tflite.tflite’) tflite_model_file.write_bytes(tflite_model)
2. Convert Keras PreBuilt Model to TF Lite:-
In this section, we have explored how to convert the prebuilt Keras model into the TF lite model. We will run inference on a pre-trained tf.keras MobileNet model to TensorFlow Lite.
import numpy as npimport tensorflow as tf
# Load the MobileNet keras model.# we will create tf.keras model by loading pretrained model on #imagenet dataset
model = tf.keras.applications.MobileNetV2( weights="imagenet", input_shape=(224, 224, 3))
# here we pretrained model no need use SaveModel # here we will pass model directly to TFLiteConverter
# Load TFLite model using interpreter and allocate tensors.interpreter = tf.lite.Interpreter(model_content=tflite_model)interpreter.allocate_tensors()
3. Concrete Function to TF Lite:-
In order to convert TensorFlow 2.0 models to TensorFlow Lite, the model needs to be exported as a concrete function. If you have developed your model using TF 2.0 then this is for you. We will convert concrete function into the TF Lite model. In this section also we will use the Keras MobileNet model.
import tensorflow as tf
# load mobilenet model of keras model = tf.keras.applications.MobileNetV2(weights="imagenet", input_shape=(224, 224, 3))
We will tf.function to create a callable tensorflow graph of our model.
#get callable graph from model.
run_model = tf.function(lambda x: model(x))
# to get the concrete function from callable graph
In the above article, we introduced TensorFlow lite. What is TensorFlow lite and what is the purpose of it and what is TensorFlow lite is not.
In this article, we will dig deeper and steps involved in the TensorFlow lite model deployment.
The above diagram states the deployment flow of Tensorflow lite mode at the edge devices.
Let us go through the steps from the top of the diagram.
Very high level convert this diagram into two functionality first step is converter and second, is the interpreter or inference the model.
Train your model using TensorFlow. We can train our model using any high-level TensorFlow API such as Keras or either you have a legacy TensorFlow model. You can train our model using high-level API like Keras or low-level API. You can develop your own model or use TensorFlow inbuilt model.
If you have any other model also you can convert those models into TensorFlow using ONNX and use it. Once the model is ready you have save that model. We can save our model in a different format based on APIs such as HDF5, SavedModel or FrozenGraphDef.
2. Convert Model:-
In this step, we are actually using the Tensorflow Lite converter to convert the TensorFlow model into the TensorFlow lite flatbuffer format.
FlatBuffers is a special data serialization format that is optimized for performance. Tensorflow Lite flatbuffer aka TF Lite model. The TensorFlow Lite converter takes a TensorFlow model and generates a TensorFlow Lite FlatBuffer file (.tflite). The converter supports SavedModel directories, tf.keras models, and concrete functions. Now our TFLite model is ready.
bazel run //tensorflow/lite/python:tflite_convert -- \ --saved_model_dir=/tmp/mobilenet_saved_model \ --output_file=/tmp/mobilenet.tflite
3. Deploy Model:-
Now our model is ready and we have ‘.tflite’ file. We can deploy this to IoT devices, embedded devices or mobile devices. We can
4. Deploy model:-
To perform inference with a TensorFlow Lite model, you must run it through an interpreter. TensorFlow Lite model serves on a device using an interpreter. TensorFlow Lite interpreter provides a wide range of interfaces and supports a wide range of devices. The TensorFlow Lite interpreter is designed to be lean and fast. We can run models locally on these devices using the Tensorflow Lite interpreter. Once this model gets loaded into devices such as embedded devices, Android or iOS devices. Once a device is deployed then take inference.
The inferencing model goes through the below steps in generally.
a. Loading a model:- You must load .tflite model file into memory.
b. Transforming data:- Raw input data for the model generally does not much input data format expected by the model. You need to transform the data.
c. Running inference:- Execute inference over transformed data.
d. Interpreting output:- When you receive results from the model inference, you must interpret the tensors in a meaningful way that’s useful in your application.
Train deep neural network free using google colaboratory.
GPU and TPU compute for free? Are you kidding?
Google Colaboratory is a free Jupyter notebook environment that requires no setup and runs entirely in the cloud.
With Colaboratory you can write and execute code, save and share your analyses, and access powerful computing resources, all for free from your browser. If you don’t have money to procure GPU and want to train neural network or want to makes hands dirty with zero investment then this if for you. Colab is a Google internal research tool for data science.
You can use GPU as a backend for free for 12 hours at a time.
It supports Python 2.7 and 3.6, but not R or Scala yet.
Many people want to train some machine learning model or deep learning model but playing with this requires GPU computation and huge resources that blocking many people to try out these things and make hands dirty.
Google Colab is nothing but cloud-hosted jupyter notebook.
Colaboratory is a free Jupyter notebook environment provided by Google where you can use free GPUs and TPUs which can solve all these issues. The best thing about colab is TPUs (tensor processing unity) the special hardware designed by google to process tensor.
To start with this you should know jupyter notebook and should have a google account.
Click on the above link to access google colaboratory. This is not only a static page but an interactive environment that lets you write and execute code in Python and other languages. You can create a new Jupyter notebook by File →New python3 notebook. clicking New Python3 Notebook or New Python2 Notebook.
We will create one python3 notebook and it will create one for us save it on google drive.
Colab is an ideal way to start everything from improving your Python coding skills to working with deep learning frameworks, like PyTorch, Keras, and TensorFlow and you can install any Python package which is require for your python coding like from simple sklearn, numpy too TensorFlow.
You can create notebooks in Colab, upload existing notebooks, store notebooks, share notebooks with anyone, mount your Google Drive and use whatever you’ve got stored in there, import most of your directories, upload notebooks directly from GitHub, upload Kaggle files, download your notebooks, and do whatever your doing with your local jupyter notebook.
On the top right you can choose to connect to hosted runtime or connect to local runtime.
Set up GPU or TPU:-
It’s very simple and straight forward as going to the “runtime” dropdown menu, selecting “change runtime type” and selecting GPU/TPU in the hardware accelerator drop-down menu!
Now you can start coding and start executing your code !!
How to install a framework or libraries?
It’s as simple as writing import statement in python!.
!pip install fastai
use normal pip install command to install different packages like TensorFlow or PyTorch and start playing with it.