Home Clients About us Blog Contact Us
Data Engineering

How Apache Kafka Works: A Deep Dive into Distributed Streaming

What is Apache Kafka and Why Does Everyone Use It?

If you've spent any time reading engineering blogs from Uber, Netflix, LinkedIn, or Shopify, you've probably noticed a common thread: they all use Apache Kafka. But what is it, and why has it become the backbone of modern data infrastructure?

Let's start with a simple analogy. Imagine a newspaper printing press:

  • The reporters write stories (producers)
  • The printing press stores and distributes the newspaper (the broker)
  • The readers pick up copies at their own pace (consumers)

The key insight is that the reporters don't hand-deliver stories to each reader. They publish to the press, and readers consume independently. If a reader goes on vacation, they can catch up on old newspapers when they return. No stories are lost.

Apache Kafka works exactly the same way, but instead of newspaper stories, it handles billions of data events — user clicks, financial transactions, IoT sensor readings, application logs, and more.

A brief history

Kafka was originally built at LinkedIn in 2010 to solve a specific problem: LinkedIn had dozens of systems (search, analytics, monitoring) that all needed real-time access to the same data streams. Connecting every system directly to every other system created an unmanageable web of point-to-point integrations. Kafka provided a single, central "highway" for all data to flow through.

It was open-sourced in 2011, became an Apache top-level project in 2012, and today it is used by over 80% of Fortune 100 companies.

Kafka's Core Concepts: Topics, Partitions, Offsets, and Consumer Groups

Topics — Organizing Your Data Streams

A topic is simply a named category or feed. Think of it as a folder in your email inbox. All "order" events go to the orders topic. All "user activity" events go to the user-activity topic.

# Creating a Kafka topic via CLI
kafka-topics.sh --create \
  --topic order-events \
  --partitions 6 \
  --replication-factor 3 \
  --bootstrap-server kafka:9092

Partitions — The Secret to Kafka's Speed

This is where Kafka gets interesting. A single topic like order-events might receive 100,000 messages per second. One server can't handle that alone. So Kafka splits each topic into multiple partitions.

Think of partitions as lanes on a highway. Instead of forcing all traffic through one lane, Kafka distributes messages across multiple lanes (partitions), each running on a different server.

Topic: order-events
├── Partition 0: [msg-0, msg-3, msg-6, msg-9, ...]   → Broker 1
├── Partition 1: [msg-1, msg-4, msg-7, msg-10, ...]  → Broker 2
├── Partition 2: [msg-2, msg-5, msg-8, msg-11, ...]  → Broker 3

Critical rule: Messages within a partition are strictly ordered (msg-0 always comes before msg-3). But there's no ordering guarantee across partitions. If you need strict ordering for a specific entity (like all events for Order #123), you use a message key — Kafka guarantees all messages with the same key land in the same partition.

// All events for the same order go to the same partition
producer.send({
  topic: 'order-events',
  messages: [{
    key: 'ORD-98712',     // Partition = hash(key) % numPartitions
    value: JSON.stringify(event)
  }]
});

Offsets — Your Bookmark in the Stream

Every message in a partition gets a unique, auto-incrementing number called an offset. It's like a page number in a book.

Partition 0:
Offset:  0    1    2    3    4    5    6    7
Data:   [A]  [B]  [C]  [D]  [E]  [F]  [G]  [H]
                          ↑
                    Consumer is here (offset 3)

The consumer tracks its own offset. This is fundamentally different from traditional message queues (like RabbitMQ) where the broker tracks what each consumer has read. By shifting responsibility to the consumer, Kafka massively reduces broker overhead and unlocks a superpower: time travel. A consumer can reset its offset to 0 and replay the entire history of a topic. This is incredibly useful for rebuilding search indices, fixing bugs, or reprocessing data with updated logic.

Consumer Groups — Scaling Consumers Horizontally

What if one consumer can't keep up with the message rate? Kafka uses consumer groups. Within a group, each partition is assigned to exactly one consumer, allowing parallel processing:

Consumer Group: "inventory-service"
  ├── Consumer A → reads Partition 0, Partition 1
  ├── Consumer B → reads Partition 2, Partition 3
  └── Consumer C → reads Partition 4, Partition 5

If Consumer B crashes, Kafka automatically rebalances and reassigns its partitions to A and C. No messages are lost. When Consumer B comes back, it resumes from its last committed offset.

Why is Kafka So Fast? (The Engineering Behind the Magic)

Here's the part that surprises most engineers: Kafka stores everything on disk, yet it routinely outperforms in-memory message queues. How?

Secret #1: Sequential I/O

Hard drives (and SSDs) are painfully slow at random reads/writes but blazingly fast at sequential reads/writes. Kafka exploits this by treating every partition as an append-only log file. New messages are always written to the end of the file — never in the middle. This turns disk I/O from Kafka's weakness into its greatest strength.

Random I/O:    ~100 operations/sec   (seek, read, seek, read...)
Sequential I/O: ~600 MB/sec          (just keep reading forward)

That's a 6,000x difference.

Secret #2: OS Page Cache

When Kafka writes data to disk, the operating system doesn't actually write to the physical disk immediately. It stores the data in a memory buffer called the Page Cache. When a consumer reads the same data milliseconds later, it comes directly from RAM — not from disk. Kafka essentially gets in-memory speed while maintaining disk durability, without managing its own cache.

Secret #3: Zero-Copy Data Transfer

In a traditional application, sending a file over the network takes 4 steps:

  1. Read from disk → Kernel buffer
  2. Kernel buffer → Application memory (user space)
  3. Application memory → Socket buffer (back to kernel)
  4. Socket buffer → Network Interface Card (NIC)

That's 4 copies and 2 context switches between kernel and user space. Kafka eliminates steps 2 and 3 using the Linux sendfile() system call:

  1. Read from disk → Kernel buffer
  2. Kernel buffer → NIC (directly!)

This zero-copy optimization reduces CPU usage by up to 65% and is a major reason Kafka can saturate a 10 Gbps network link.

Secret #4: Batching and Compression

Kafka doesn't send messages one-by-one. Producers accumulate messages into batches (configurable by size or time), compress the entire batch using algorithms like LZ4, Snappy, or Zstandard, and send one compressed batch as a single network request. This dramatically reduces individual network calls and network bandwidth usage.

// Kafka producer configuration for optimal throughput
const producer = kafka.producer({
  batch.size: 65536,           // 64KB batches
  linger.ms: 5,                // Wait up to 5ms to fill a batch
  compression.type: 'lz4',     // LZ4 compression
  acks: 1                      // Wait for leader acknowledgment
});

Kafka vs. RabbitMQ: Which One Should You Pick?

This is the most common question engineers ask. The answer is simple: they solve different problems.

Feature Apache Kafka RabbitMQ
Model Distributed log (pub/sub) Message queue (point-to-point)
Message retention Configurable (days, weeks, forever) Deleted after consumption
Replay capability ✅ Yes (reset consumer offset) ❌ No
Throughput Millions of msgs/sec Tens of thousands/sec
Ordering Per-partition ordering Per-queue ordering
Best for Event streaming, data pipelines, log aggregation Task queues, RPC, simple pub/sub

Use Kafka when:

  • You need to retain messages for replay (e.g., rebuilding a search index)
  • Multiple independent consumers need to read the same stream
  • You're building real-time analytics, event sourcing, or stream processing pipelines
  • Throughput matters more than latency (Kafka optimizes for throughput)

Use RabbitMQ when:

  • You need complex routing (fanout, topic-based, header-based)
  • You want simple task distribution among workers (e.g., send emails, resize images)
  • Messages should be deleted after successful processing
  • Latency matters more than throughput (RabbitMQ delivers faster per-message)

Real-World Kafka Use Cases

1. Netflix — Billions of Events for Personalization

Netflix uses Kafka to capture every user interaction — what you watched, when you paused, what you searched — and feeds it into their recommendation engine in real-time. This data pipeline processes over 1 trillion events per day.

2. Uber — Real-Time Ride Matching

When you request an Uber, your location is published as an event to Kafka. Driver locations are continuously streamed as events. The matching service consumes both streams and pairs the closest driver to you — all within seconds.

3. Shopify — Black Friday at Scale

During Black Friday 2024, Shopify processed $9.3 billion in sales. Kafka handled the fire hose of order events, inventory updates, and payment confirmations — ensuring no order was lost even at peak traffic of millions of events per second.

4. LinkedIn — The Birthplace of Kafka

LinkedIn processes over 7 trillion messages per day through Kafka. Every profile view, connection request, and job application flows through the Kafka pipeline before reaching the end user.

Getting Started with Kafka: Your First Steps

Kafka has a steep learning curve, but you don't need to boil the ocean. Here's a pragmatic roadmap:

  1. Run Kafka locally using Docker Compose. The Confluent Platform Docker setup gets you running in 5 minutes.
  2. Build a simple producer/consumer using KafkaJS (Node.js), confluent-kafka-python (Python), or Spring Kafka (Java).
  3. Understand partitioning. Experiment with message keys and observe how Kafka distributes data across partitions.
  4. Try Kafka Connect to stream data from your existing PostgreSQL/MySQL database into Kafka topics automatically (Change Data Capture).
  5. Explore Kafka Streams or ksqlDB for real-time stream processing without needing external tools like Apache Flink.

Kafka is not just a message queue — it's a distributed commit log that fundamentally changes how organizations think about data flow. Once you understand its primitives, you'll see opportunities to use it everywhere.

Author

Data Team

Senior Data Engineer

Published:
February 24, 2026

Updated:
February 24, 2026

Frequently asked questions

What is Apache Kafka and why is it widely used?

Apache Kafka is a distributed event streaming platform designed to handle large volumes of real-time data. It allows applications to publish, store, and process streams of records efficiently. Companies use Kafka to build real-time data pipelines, streaming analytics systems, and event-driven architectures that can scale to billions of events per day.

What are Kafka topics, partitions, and offsets?

In Kafka, data is organized into topics, which act as categories for messages. Each topic is divided into partitions that allow data to be distributed across multiple servers for scalability. Every message within a partition has a unique offset, which represents its position in the event log and helps consumers track which messages have already been read.

What are producers and consumers in Kafka?

Producers are applications that publish data to Kafka topics, while consumers read and process that data. Multiple consumers can read the same data independently through consumer groups, allowing systems to scale horizontally and process large data streams efficiently.

Why is Apache Kafka so fast and scalable?

Kafka achieves high performance through sequential disk writes, efficient data replication, and partition-based parallel processing. Instead of constantly updating data like traditional messaging systems, Kafka stores events in append-only logs, enabling extremely fast throughput and reliable event streaming.

What is the difference between Kafka and RabbitMQ?

Kafka is designed for high-throughput event streaming and large-scale data pipelines, while RabbitMQ is primarily used as a traditional message broker for task queues and request-response messaging. Kafka excels in handling massive event streams, whereas RabbitMQ is often preferred for simpler messaging workflows.

What are common real-world use cases of Apache Kafka?

Kafka is commonly used for real-time analytics, log aggregation, event streaming, financial transaction processing, IoT data pipelines, and microservices communication. Many large technology companies use Kafka to power data-driven applications and real-time decision systems.

Transform Your Ideas Into Powerful Software Solutions

We will add your info to our CRM for contacting you regarding your request. For more info please consult our privacy policy

Trusted by Our Clients
Quba developed our mental health journaling app with incredible attention to user privacy and therapeutic best practices. The mood tracking, guided journaling prompts, and AI-powered reflection features have helped thousands of users improve their emotional well-being. The app has achieved 4.7-star rating with over 50,000 downloads, proving its impact on mental health support.
Dr. Sarah Chen
Dr. Sarah
Clinical Psychologist, Mindful
Quba delivered an exceptional Islamic banking platform that perfectly aligns with Shariah compliance requirements. Their expertise in financial technology and understanding of Islamic banking principles helped us create a secure, user-friendly system. The platform has enhanced our customer experience and increased our digital banking adoption by 45%.
Yunus
Yunus
CFO, Islamic Bank
Quba transformed our business operations with their custom CRM development. They built a comprehensive system that handles our aesthetic machines, training programs, supply chain management, and lead generation. The platform has streamlined our entire workflow and improved our customer management by 60%. Their attention to detail and understanding of our industry needs was exceptional.
Farhan Daila
Farhan Daila
Founder, Unilog
Working with Quba on our logistics platform was a game-changer. They developed a robust system that handles shipment tracking, customer communication, and real-time updates. Their technical expertise and ability to understand complex business requirements helped us deliver a superior customer experience. The platform has significantly improved our operational efficiency.
Kalpesh
Kalpesh
COO, Gowheels