Idempotent Kafka Consumer

In the last couple of articles, we have seen how Kafka guarantees message processing, as part of this we saw idempotent Kafka producers. Idempotence is something I appreciate, maybe the most, in data engineering. I know most of you have a better understanding of the word idempotent.

There are two questions always people do ask in the data industry. Data is the new oil.

How do we guarantee all messages are processed?

How do we avoid or handle duplicate messages?

In the last article, we have seen how producers should ensure that the “at least once” delivery semantics. The producer shouldn’t deliver duplicate messages.

simple message data flow.

Let us jump into the consumer part. Kafka provides consumer API to pull the data from kafka. When we consume or pull the data from kafka we need to specify the consumer group. Consumer group helps us to a group of consumers that coordinate to read data from a set of topic partitions. To save progress in reading data from Kafka, a consumer needs to save the offset of the next message it will read in each topic partition it is assigned to. Consumers are free to store their offsets wherever they want but by default and for all Kafka Streams applications, these are stored back in Kafka itself in an internal topic called _consumer_offsets. To use this mechanism consumers either enable automatic periodic commitment of offsets back to Kafka by setting the configuration flag enable.auto.commit to true or by making an explicit call to commit the offsets.

idempotency guarantees of consistent data and no duplicacy.

Delivery Semantics of Kafka Consumer:

There are different semantics, for more details go through the below image.

  1. At least Once: In this semantics, offset are committed after the message processed. If some exception occurs while processing so the message is not committed, so the consumer will be able to read the message again from kafka. This can leads to duplicate message processing. Make sure message processing twice won’t impact your system. 

2. At most Once: In this semantics offsets are committed as soon as the message batch is received. If the processing goes wrong the message will be lost(we are unable to read again) because we have committed the offset.

  1. The producer published the message successfully.
  2. The consumer is able to consume messages successfully and committed offset.
  3. But while processing messages some exception occurs or an error occurs, machine shutdown so we lost that message because we are unable to read the same message again due to offset were already committed. 

3. No guarantee: In this semantics, there is no guarantee of a message, which means the given message processed once, multiple times or no at all. This is a simple scenario where you will end up with these semantics is if you have a consumer with enable.auto.commit set to true (this is the default) and for each batch of messages, you async process and save the desired results into a database. The frequency of these commits is determined by the configuration parameter auto.commit.interval.ms.

4. Exactly Once: In 2007 confluent introduced exactly-once semantics. This can be achieved for kafka to kafka workflows using kafka stream API. This semantic ensures your message processing idempotent. 

By default, a consumer is at least once because of when we don’t set anything regarding offset commit then the default is auto-commit of the offset. 

Idempotent Consumer:

Kafka stream API will help us to achieve idempotent kafka consumers. Kafka Streams is a library for performing stream transformation on data from kafka. The exactly-once semantic feature was added to Kafka Streams in the 0.11.0 Kafka release. Enabling exactly once is a simple configuration change setting the streaming configuration parameter processing.guaranteeto exactly_once(the default is at_least_once). A general workload for a Kafka streams application is to read data from one or more partitions, perform some data transformations, update a state, then write some results to an output topic. When exactly-once semantics is enabled, Kafka Streams atomically updates consumer offsets, local state stores, state store changelog topics, and production to output topics altogether. If any one of these steps fail, all of the changes are rolled back.

There is another way to make it idempotent if you are not using kafka streams API.

Suppose you are pulling data from kafka and doing some processing and then persisting packet into the database. How you should guarantee data consistency using unique key i.e “id”.

There are two strategies to generate a unique id

  1. Kafka generic id:- You can take the help of kafka to generate unique id by appending simple strings like below.

 String id = record.topic()+”-“+record.partition()+”-“+record.offset();

2. Application-specific unique id logic:-

This could be based on your application this could be change based on domain and functionality with which you are dealing.

e.g.

  1. Suppose you are feeding twitter stream data then twitters feed specific id.
  2. Any unique transaction-id for any financial domain application.

Pragmatic REST API design guide.

Best way to design REST API.

I have both experiences as API developer and API consumer too. I am curious to know is there any best practices, There should be some consensus, here is the answer Google released an API design guide recently.

Characters of design good API:-

  1. Easy to learn
  2. Easy to a consumer without documentation.
  3. Secure & scalable.
  4. API should be easy to extend.
  5. Deliver higher quality in lower cost.
  6. Appropriate to an audience.
  7. High adaptability & Productivity.

The Google API design guide is pretty straightforward in its purpose, with a goal of helping “developers design simple, consistent, and easy-to-use networked APIs” — but I thought it was noteworthy that they were also looking to help “converging designs of socket-based RPC APIs with HTTP-based REST APIs.”

“This Design Guide explains how to apply REST principles to API designs independent of programming language, operating system, or network protocol. It is NOT a guide solely to creating REST APIs.”

Most of the companies think that API design is only in the context of REST API design this guide will help you to think beyond the REST API design style.

The best thing about this guide is this is not only for REST. I think many of the developers already aware of the API style book. API style book is really best guide to designing REST API. If you don’t know about the API style book then go through it and apply the same in your tool suite.

I love the way google guideline explain to us How to design good API.

API should do one thing and do it well.

Don’t leak implementation logic. Names really matter in API design, names should be largely self-explanatory. API should like prose.

This is the simple my understanding about the Google API design guideline and let me know more from you, How you guys design API in an organization?

If you enjoyed this article, please don’t forget to Clap.

For more stories.

Let’s connect on Stackoverflow, LinkedIn, Facebook& Twitter.

Manage Big Ball of Mud.

Manage Big Ball of Mud.

Everyone wants to develop a project from scratch no one wants to manage the big ball of mud. This article is for those people who wanted to

” Leave campground cleaner than you found it”.

What is Big Ball of mud? Ref- Wikipedia.

A big ball of mud is a software system that lacks a perceivable architecture.A BIG BALL OF MUD is haphazardly structured, sprawling, sloppy, duct-tape and bailing wire, spaghetti code jungle.

Ball vs Blocks

Let us observe the picture to understand the difference between to manage the big ball & manage the block. Carrying the big ball is tedious instead of the blocks. When a ball is small we can easily carry and maintain. When its size increases it’s very difficult to manage. As we generally try to arrange the big balls to make some structure it is quite difficult? While If we try to arrange the blocks and structure them it is quite easy and maintainable.

How these forms?

Big ball of mud is an architectural disaster. This kind of systems has usually been developed over a long period of time, with different individuals are working on various pieces. The people who develop this kind of architecture with no formal training of what is software architecture or programming design pattern training.

There are many reasons

  1. No formal training of software architecture.
  2. No formal knowledge of design pattern.
  3. Financial or Time pressure.
  4. Throwaway code.
  5. Inherent Complexity.
  6. Change of requirements.
  7. Change of developers.
  8. Piecemeal Growth.

How to manage BBOM?

Sometimes the best solution is simply to rewrite the application catering to new requirements. But this is the worst case scenario.

The clumsy solution is that stop the new development and refactor whole system step by step. To do this

  1. Write down the test cases.
  2. Refactor code.
  3. Redesign & Re-architect the whole system.

To overcome this BBOM anti-pattern. You must have to follow below steps

  1. Code Review
  2. Code refactoring overtime period.
  3. Fallow best practices.
  4. Use design Pattern & architectural pattern.
  5. If you have time constraint while developing at least design code in a modular way (Single responsibility principle), so you can easily rearrange later.
  6. Use TDD(Test Driven Development).

Big ball of mud isn’t just absence of architecture rather its own architectural pattern, that has merits and trade-offs.

I am not saying here that this case is never gone happen, This happens many times with many peoples including me as well.

This article will give you a brief idea, How to manage & overcome this issue?

There is a will there is a way.

I have followed this path, love to hear from you folks.

How you are gone managing this big ball of mud?

For more stories.

Let’s connect on Stackoverflow, LinkedIn, Facebook& Twitter.