Kafka Connect QuickStart.

kafka

understand the kaka connect.

Apache Kafka is a distributed streaming platform. If you are not aware of apache kafka then you can use below articles

Kafka Installation guide

Kafka Idempotent Consumer

Kafka Idempotent Producer

Kafka Connect, an open-source component of Apache Kafka. kafka connect is another project in Apache kafka family. As we know some common use cases of apache kafka such as consume data from some RDBMS and sink data into Hadoop. This framework is developed based on convention over configuration.

Kafka Connect is a framework for connecting Kafka with external systems such as databases, key-value stores, search indexes, and file systems.

https://kafka.apache.org/documentation/

Advantages of Using Kafka Connect: 

Below are some benefits of Kafka connect. 

  1. Data-Centric Pipeline
  2. Flexibility and Scalability
  3. Reusability and Extensibility
  4. Distributed and standalone modes
  5. Streaming/batch integration
  6. REST interface for better operations
  7. Automatic offset management

Kafka Connect is a tool suite for scalably and reliably streaming data between Kafka and other external systems such as databases, key-value stores, search indexes, and file systems, using so-called Connectors.

As in the diagram, you can see kafka connect generally contains three main components such as source connector, sink connector, and kafka topic. 

A source connector collects data from a source system. Source systems can be entire databases, tables, or any message brokers. A source connector could also collect metrics from application servers into Kafka topics, making the data available for stream processing with low latency.

A sink connector reads data from Kafka topics into other systems, which might be indexed such as Elasticsearch, Hadoop or any other database.

Prerequisite: 

  •  Java8, Maven, basic knowledge of Kafka installed and running on your system.
  • Kafka version used for this guide: kafka_2.10–0.10.2.1.
  1. Download Kafka from apache kafka

Kafka Connect Example:

In this example, we will take a very simple example to demonstrate the power of kafka connect. Basic kafka connector file source connector and file sink connector. 

  1. Start Zookeeper 
  2. Start Kafka 

Create a kafka topic for our testing

$KAFKA_HOME/bin/kafka-topics.sh \
  --create \
  --zookeeper localhost:2181 \
  --replication-factor 1 \
  --partitions 1 \
  --topic my-connect-test

Kafka Connect currently supports two modes of execution: standalone and distributed.

In standalone mode, all work is performed in a single process. This standalone mode is for development.

$ bin/connect-standalone.sh config/connect-standalone.properties connector1.properties [connector2.properties ...]

Source Connector configurations

Provide source configuration file and this should be in your home directory

For the source connector, the reference configuration is below:

name=local-file-source-connector
connector.class=FileStreamSource
tasks.max=1
topic=my-connect-test
file=test-connector.txt
  • name – Unique name for the connector. Attempting to register again with the same name will fail.
  • connector.class – The Java class for the connector
  • tasks.max – The maximum number of tasks that should be created for this connector. The connector may create fewer tasks if it cannot achieve this level of parallelism.
  • key.converter – (optional) Override the default key converter set by the worker.
  • value.converter – (optional) Override the default value converter set by the worker.
  • topics – A comma-separated list of topics to use as input for this connector
  • topics.regex – A Java regular expression of topics to use as input for this connector

Sink Connector Configuration

name=local-file-sink-connector
connector.class=FileStreamSink
tasks.max=1
file=test-sink-conector.txt
topics=my-connect-test

Sink connector, the reference configuration is below:

Now source and sink are ready to let us have worker configuration. Most of the properties are the same because both source and sink are file connectors. 

Worker Configuration

Worker connector, the reference configuration is below:

bootstrap.servers=localhost:9092
key.converter=org.apache.kafka.connect.json.JsonConverter
value.converter=org.apache.kafka.connect.json.JsonConverter
key.converter.schemas.enable=false
value.converter.schemas.enable=false
offset.storage.file.filename=/tmp/connect.offsets
offset.flush.interval.ms=20000
plugin.path=/share/java
  • bootstrap.servers – List of Kafka servers used to bootstrap connections to Kafka
  • offset.flush.interval -Maximum number of milliseconds to wait for records to flush and partition offset data to be committed to offset storage before canceling the process and restoring the offset data to be committed in a future attempt.
  • plugin.path -List of paths separated by commas (,) that contain plugins (connectors, converters, transformations).
  • key.converter – Converter class used to convert between Kafka Connect format and the serialized form that is written to Kafka. This controls the format of the keys in messages written to or read from Kafka, and since this is independent of connectors it allows any connector to work with any serialization format. Examples of common formats include JSON and Avro.
  • value.converter – Converter class used to convert between Kafka Connect format and the serialized form that is written to Kafka. This controls the format of the values in messages written to or read from Kafka, and since this is independent of connectors it allows any connector to work with any serialization format. Examples of common formats include JSON and Avro.

Below config property specific to standalone mode:

  • offset.storage.file.filename – File to store offset data in.

To check more on configuration official document

Now Everything is ready:

  1. kafka/config/connect-standalone.properties 
  2. kafka/config/local-file-source-connector.properties
  3. kafka/config/local-file-sink-connector.properties

the content of test-connector.text

{
		color: "red",
		value: "#f00"
	}
	{
		color: "green",
		value: "#0f0"
	}
	{
		color: "blue",
		value: "#00f"
	}
	{
		color: "cyan",
		value: "#0ff"
	}
	{
		color: "magenta",
		value: "#f0f"
	}
	{
		color: "yellow",
		value: "#ff0"
	}
	{
		color: "black",
		value: "#000"
	}

You can add content later in the same file and check kafka connector able to read and sink data in test-sink-conector.txt

The source connector will automatically detect the changes and publish content over kafka.

make sure to insert a newline at the end, otherwise, the source connector won’t consider the last line.

Now let us run and check the kafka connect 

$KAFKA_HOME/bin/connect-standalone.sh config/connect-standalone.propertiesconfig/local-file-source-connector.propertiesconfig/local-file-sink-connector.properties

This is the simple implementation of kafka connect.

Discover more kafka connector here:- https://www.confluent.io/hub/

Reference documents:-

https://docs.confluent.io/current/connect/quickstart.html

https://kafka.apache.org/documentation/#connect