Kafka Interview Questions: Complete Guide With Answers

Back to Blog
Kafka Interview Questions (50 Questions with Answers)

Kafka Interview Questions: Complete Guide With Answers

Kafka Interview Questions: A Comprehensive Guide for Distributed Streaming

Apache Kafka has become the industry standard for event streaming and real-time data processing. Whether you’re interviewing for a Kafka engineer role, a data platform engineer position, or a backend developer slot at a company running Kafka at scale, you need to understand not just the mechanics of the platform but the architectural thinking behind it. This guide covers the questions you’re likely to face, the answers interviewers expect, and the depth of understanding that separates competent engineers from those who truly grasp distributed streaming.

Kafka interviews test several layers of knowledge. At the surface level, they assess whether you understand core concepts like topics, partitions, and consumer groups. At the intermediate level, they probe your hands-on experience with producers, consumers, and the tradeoffs between different configuration approaches. At the senior level, they explore your ability to design systems, troubleshoot production issues, and make architectural decisions across replication, performance, and security. This guide addresses all three levels, providing not just answers but the reasoning and context that demonstrates true expertise.

The questions in this guide span theoretical foundations, practical implementation, operational concerns, and architectural patterns. Many interview questions test whether you can articulate tradeoffs. For example, asking about acks settings isn’t just about knowing the three options; it’s about understanding when and why you’d choose each one based on your application’s requirements. Strong candidates can discuss these tradeoffs clearly and suggest how to measure the impact of different choices.

Core Kafka Concepts and Architecture

What is Apache Kafka and what problems does it solve?

Apache Kafka is a distributed event streaming platform designed for building real-time data pipelines and streaming applications. It’s built on top of a publish-subscribe model but extends it to handle massive scale, high throughput, and fault tolerance. The core problems it solves include decoupling producers from consumers so they can evolve independently, buffering bursts of events without losing data, replaying historical events for new subscribers, and enabling multiple consumers to process the same stream independently.

At the simplest level, Kafka acts as a messaging system. But unlike traditional message brokers where a queue is consumed once and then discarded, Kafka retains messages for a configurable period and allows multiple independent consumers to replay them. This shift from “queue semantics” to “log semantics” is fundamental to understanding why Kafka became so popular for event sourcing, change data capture, and analytical data pipelines. Traditional message queues like RabbitMQ or ActiveMQ are designed for task distribution: you send a task to a queue, a worker picks it up and processes it, and the message is deleted. Kafka is designed differently: every event is appended to a log, persisted durably, and available for any number of subscribers to read independently.

The ability to replay events is profound. If you deploy a new analytics consumer six months after events started flowing, you can rewind that consumer to the beginning of time and process all events from the start. This is impossible with a traditional queue. This capability enables patterns like event sourcing (storing the complete history of state changes), rebuilding projections, and onboarding new consumers without coordination.

Explain topics, partitions, and how they relate to scalability.

A topic is a named stream of events. When you send a message to Kafka, you specify a topic. Multiple producers can write to the same topic, and multiple consumers can read from it. Topics alone wouldn’t provide much scalability though. To achieve horizontal scalability, each topic is divided into partitions. A partition is an ordered, immutable sequence of messages assigned a unique offset.

The key insight is that messages within a partition maintain order and are stored durably on disk. If you have a topic with 10 partitions, Kafka can distribute those partitions across multiple broker nodes. Each partition can be processed independently, meaning you can parallelize consumption. If you have 10 consumers in a consumer group and 10 partitions, each consumer can own one partition and process messages in parallel. This is how Kafka achieves both scalability and ordering guarantees. A single partition can be handled by a single machine, so throughput per partition is bounded by one machine’s capacity. But by adding more partitions, you add more machines, and total cluster throughput scales linearly.

However, ordering is only guaranteed within a partition. If you send messages with keys to the same topic, Kafka uses a hash function on the key to determine which partition each message goes to. All messages with the same key always land in the same partition, preserving order for that key’s stream. If you don’t provide a key, messages are distributed round-robin across partitions, and ordering across the entire topic is not guaranteed. Understanding this is critical: many engineers assume Kafka preserves global ordering until they hit a production incident where events arrive out of sequence.

What are offsets and how does Kafka track consumer position?

An offset is a monotonically increasing integer that uniquely identifies a message’s position within a partition. The first message in a partition has offset 0, the second has offset 1, and so on. Offsets are not assigned by the consumer; they’re assigned by Kafka when the message is written to the log. This is simple but powerful: offsets provide a deterministic way to refer to any point in a partition’s history.

Kafka tracks where a consumer has read up to by storing an offset commit. When a consumer processes a message and commits, it’s saying “I have successfully processed all messages up to and including offset X in this partition.” The next time that consumer starts (or if another consumer in the same group starts), it reads from offset X+1. This mechanism allows consumers to recover from failures without reprocessing or losing messages. If a consumer crashes after processing offset 5000 but before committing, on restart it will resume from the last committed offset, typically something like 4990, and will reprocess a few messages. This is acceptable in most systems.

Offset commits are stored in a special internal topic called __consumer_offsets. This is a compacted topic (log compaction keeps only the latest offset per partition per consumer group) that acts as the source of truth for consumer position. When you configure a consumer with group.id, it automatically manages offset commits. Modern Kafka clients commit offsets to the broker by default, but you can customize this behavior. You can commit offsets automatically after every poll, commit them manually after processing, or disable auto-commit entirely and manage them yourself. The choice depends on your tolerance for message duplication or loss during consumer restarts.

A subtle but important point: offset commits are asynchronous. When you call commitSync(), the consumer sends the commit request to the broker and waits for a response. If the broker fails before returning, the commit might not have been persisted. In such cases, resuming the consumer will retry the commit. With commitAsync(), the consumer doesn’t wait; it sends the commit and continues. This is faster but riskier if the application crashes before the broker acknowledges. Most production systems use commitSync() after a batch of messages is successfully processed.

What is a consumer group and how does Kafka partition work with it?

A consumer group is a set of consumers that together consume messages from a topic. All consumers in the group share the same group.id. Kafka automatically assigns partitions to consumers in the group such that each partition is assigned to at most one consumer. This ensures that messages from a partition are processed sequentially and not duplicated across consumers in the group. This is a critical guarantee: if your consumer group has 3 instances, Kafka ensures that each partition goes to exactly one instance, not multiple.

When a consumer joins a group, Kafka triggers a rebalance. During rebalancing, Kafka pauses all consumption, revokes the current partition assignments, computes new assignments based on the group’s current members and the available partitions, and then resumes consumption. If you have a topic with 10 partitions and 3 consumers in a group, Kafka will assign approximately 3 to 4 partitions to each consumer. If a 4th consumer joins, rebalancing occurs and partitions are redistributed. The pause during rebalancing can be significant in high-throughput systems. A 30-second rebalance on a system processing 1 million messages per second means you lose 30 million message processing opportunities.

Multiple consumer groups can consume from the same topic independently. Each group maintains its own offset position. This is powerful because it means you can have one group processing events for analytics, another group processing the same events for business logic, and a third archiving them, all without interfering with each other. Offsets are per partition per group; group A’s offset for partition 0 is stored separately from group B’s offset for partition 0.

Describe the role of brokers in a Kafka cluster.

A broker is a Kafka server that stores partitions and handles client requests. A typical Kafka cluster consists of multiple brokers, often three or more for high availability. Each partition is replicated across multiple brokers. One broker is designated as the leader for a partition, and the others are replicas. All reads and writes for a partition go through the leader. This design concentrates all state changes in one place (the leader) for a partition, simplifying consistency logic. Followers don’t process client requests; they just replicate the log.

Brokers also maintain metadata about topics, partitions, and consumer groups. This metadata is used for discovery, routing requests to the correct broker, and managing consumer group assignments. In Kafka 3.0 and later, the KRaft mode removes the dependency on ZooKeeper for metadata management and makes brokers self-managing. Previously, ZooKeeper was a separate system that stored metadata, which introduced operational complexity and a separate failure domain.

Brokers persist data to disk using a log-based storage format. Each partition’s log is divided into segments, and within segments, messages are stored sequentially. This design allows Kafka to achieve both high throughput (sequential I/O is fast) and durability (messages are persisted before acknowledging to the producer). Segments are typically 1 GB each (configurable). Old segments are deleted when they exceed retention time or the log exceeds retention size. Segment boundaries are important for operations like log compaction, which processes one segment at a time.

What is replication and how does it ensure fault tolerance?

Replication means each partition’s log is copied across multiple brokers. You configure replication.factor to control how many copies exist. A replication.factor of 3 means each partition’s data is stored on 3 brokers. If one broker fails, the partition is still available on the other two, and no data is lost. Replication is the core mechanism for Kafka’s high availability. Without it, a single broker failure would cause data loss.

Replication introduces the concept of in-sync replicas (ISR). The ISR is a subset of replicas that are considered “in sync” with the leader. For a replica to remain in the ISR, it must acknowledge messages within a certain time window (controlled by replica.lag.time.max.ms, default 10 seconds). If a replica falls behind and doesn’t acknowledge messages timely, it’s removed from the ISR. The ISR can shrink when brokers are slow or offline.

The min.insync.replicas configuration is crucial for durability guarantees. If a producer sets acks=all, the message is considered successfully written only after all min.insync.replicas have acknowledged it. If min.insync.replicas=2 and replication.factor=3, then losing one broker doesn’t prevent writes; the leader and one other replica can acknowledge. But if you set min.insync.replicas=3, then losing even one broker will pause writes because only 2 replicas remain (leader plus 1 other), which is less than the required 3. Setting min.insync.replicas high ensures no data loss at the cost of availability during broker failures.

Explain in-sync replicas (ISR) and leader election.

The in-sync replica set is the set of replica brokers that are sufficiently up-to-date with the leader. Kafka measures “up-to-date” by checking how recently the replica has fetched messages. If a replica’s last fetch was within replica.lag.time.max.ms (default 10 seconds), it stays in the ISR. If a replica falls behind (e.g., because the broker is overloaded and can’t keep up), Kafka removes it from the ISR. Removing a replica from the ISR immediately affects durability: if min.insync.replicas=2 and your 3-replica partition ISR shrinks to 2 (because one replica fell too far behind), you’re at the minimum threshold. If another broker fails, you can no longer achieve min.insync.replicas and writes pause.

When a partition’s leader fails, Kafka must elect a new leader from the ISR. The controller (a special broker role) orchestrates this election. It picks the first replica in the ISR as the new leader. This ensures the new leader has all the committed messages that were acknowledged. If the ISR becomes empty (all replicas are too far behind), you have a critical situation: Kafka can be configured to either choose the first offline replica as leader (risking data loss because that replica is stale) or refuse to elect a leader and keep the partition offline (sacrificing availability). The choice is controlled by unclean.leader.election.enable.

What is log compaction and when would you use it?

Log compaction is a retention policy that removes older messages from a topic while keeping the latest message for each key. Instead of deleting all messages older than 7 days (time-based retention), log compaction keeps the most recent message for each unique key indefinitely. This is useful for changelog topics where you care about the current state of an entity, not its complete history.

For example, if you have a “users” topic with messages like “user_id=5 name=Alice, user_id=5 name=Alice Smith, user_id=5 name=Alice S.”, log compaction would keep only the latest one: “user_id=5 name=Alice S.”. This makes the topic a compacted changelog representing the current state of all users. New consumers that subscribe to the topic will receive all the latest states, effectively getting a snapshot of the current database. This is valuable for rebuilding state: instead of querying a database, you can consume the compacted topic.

Log compaction requires messages to have keys. Kafka periodically runs a cleanup process on compacted topics. There’s a delay between when a message is written and when it’s eligible for compaction, controlled by min.compaction.lag.ms and min.clean.ratio. The compaction process is IO-intensive and can consume CPU and disk bandwidth, so it’s usually run during off-peak hours or on slower brokers.

Describe message retention policies and their tradeoffs.

Kafka supports multiple retention policies. Time-based retention (retention.ms) keeps messages for a specified duration. Size-based retention (retention.bytes) keeps messages until the partition size exceeds a threshold. You can combine both; Kafka deletes a message when either condition is met. There’s also log compaction, which we covered. Retention can be set at the topic level and overridden for specific consumer groups using retention.group.ms.

Longer retention means more disk space required but more flexibility for consumers to lag and catch up. If you set retention.ms to 7 days but a consumer is down for 8 days, it loses its position and must either reset to the earliest offset (processing all messages from the start) or the latest offset (skipping everything it missed). Shorter retention reduces costs but increases operational risk. A common pattern is to set retention to 7 days for hot data, then archive older messages to S3 for cold storage analysis.

For long-term archival, you typically offload old messages to cheaper storage like S3 using tools like Kafka Connect with HDFS or S3 sinks. This lets you keep short retention on the cluster for performance while maintaining a full historical record elsewhere. The tradeoff is operational complexity: you must manage the archival process, ensure data integrity, and provide a way to rehydrate data from cold storage if needed.

Compare ZooKeeper-based Kafka with KRaft mode.

Kafka originally relied on Apache ZooKeeper for cluster metadata management. ZooKeeper maintained metadata about brokers, topics, partitions, and consumer groups. This created a dependency: you needed a separate ZooKeeper cluster for each Kafka cluster. ZooKeeper has its own failure modes and operational complexity: quorum management, leader election, session timeouts. Operating Kafka meant operating two separate distributed systems.

KRaft mode (Kafka Raft consensus) replaces ZooKeeper by making Kafka itself manage metadata using the Raft consensus protocol. Some brokers take on controller roles and replicate metadata among themselves. This simplifies deployment: you only manage Kafka brokers, not an external ZooKeeper ensemble. KRaft brokers can also serve client requests, so metadata is collocated with data.

KRaft is now the recommended approach for new clusters (GA as of Kafka 3.3). Legacy clusters using ZooKeeper still work, and the migration path involves running both systems in parallel temporarily. KRaft enables faster metadata propagation (less latency in topic creation, partition assignment), simpler operational procedures, and potential performance improvements.

What message ordering guarantees does Kafka provide?

Kafka provides strong ordering guarantees at the partition level: messages within a single partition are ordered by offset, and if a consumer processes them sequentially, it sees them in the order they were written. Additionally, if you use message keys, all messages with the same key go to the same partition, so ordering is preserved across multiple messages with the same key. Within a single partition, messages are totally ordered. This is one of Kafka’s most powerful guarantees.

However, there’s no global ordering across partitions. If you have a topic with 5 partitions, messages from different partitions can be consumed in any relative order. If you need global ordering across all messages, you must use a single partition. But this kills parallelism and scalability: a single partition on a single broker can handle maybe 1 MB/s, whereas 10 partitions across 10 brokers can handle 10 MB/s.

For most applications, ordering per key is sufficient. For example, in an e-commerce system, all events for a specific order ID can have that order ID as the key, ensuring they’re processed in the order they occurred. Events for different orders can be processed in parallel. This is called per-key ordering and is the sweet spot for most systems: it preserves causality for entities while allowing horizontal scaling.

Producers in Depth

Explain the acks parameter and its impact on durability and latency.

The acks parameter controls how many brokers must acknowledge a message before the producer considers it successfully written. Three settings exist. acks=0 means the producer doesn’t wait for any acknowledgment. It sends the message and immediately continues. This has minimal latency but highest risk of data loss. If the broker crashes immediately after receiving the message but before persisting it, the message is lost. Some producers don’t even wait for the message to be sent; they fire and forget.

acks=1 (the default) means the leader broker acknowledges after writing the message to its local log. The producer waits for this acknowledgment. If the leader crashes immediately after acknowledging but before replicating to followers, the message is lost. The replicas don’t have it yet. This provides a middle ground between latency and safety. It’s the default because it’s a reasonable balance for most use cases. Latency is typically 1-10 milliseconds (just the time for the broker to write to its log and send back an acknowledgment).

acks=all means the leader waits for acknowledgments from all in-sync replicas before acknowledging to the producer. This guarantees that if the acknowledge is received, the message is durable even if the leader crashes immediately. Multiple replicas have the data, so data loss would require simultaneous failure of all replicas. However, it introduces latency because the producer must wait for multiple replicas. Latency is typically 10-50 milliseconds or more, depending on replication lag and network latency.

The actual durability guarantee also depends on min.insync.replicas. If you set acks=all and min.insync.replicas=1, you’re only waiting for the leader, which defeats the purpose. But if min.insync.replicas=2 and replication.factor=3, the leader needs acknowledgments from at least one additional replica. This ensures data is on at least 2 brokers.

What are idempotent producers and exactly-once semantics?

An idempotent producer guarantees that a message is written exactly once, even if the producer retries. Normally, if a producer sends a message and doesn’t receive an acknowledgment within the timeout, it retries. But the broker may have actually received the first message; the acknowledgment was just lost or delayed. Without idempotency, the second attempt writes a duplicate, creating two copies of the same message.

With idempotence enabled (enable.idempotence=true), the producer attaches a producer ID and sequence number to each message. If the broker receives the same message twice (same producer ID and sequence number), it deduplicates and returns the cached acknowledgment. This ensures that in the face of retries, the broker writes the message exactly once. Idempotence is transparent to the producer; the client library handles it automatically.

However, idempotence only works within a single producer instance. If the producer restarts, it gets a new producer ID, and sequence numbers reset. For true exactly-once semantics across producer restarts and failures, you need transactions. Idempotence is essential for modern Kafka; it’s enabled by default in many client libraries.

Describe Kafka producer transactions and their use cases.

Kafka transactions allow a producer to write messages atomically to multiple partitions. Either all messages are written or none are. This is useful when you need to write to multiple topics or partitions in a coordinated way without leaving the system in an inconsistent state. For example, if you’re processing an order, you might want to write an OrderCreated event to one topic and an OrderMetrics event to another. Transactions ensure both are written or neither.

To use transactions, you set transactional.id to a unique producer identifier. You call beginTransaction(), write messages, and then call commitTransaction(). If anything fails between begin and commit, the transaction is aborted, and the messages aren’t visible to consumers. Consumers with isolation.level=read_committed won’t see uncommitted messages, so they only see transactionally consistent snapshots.

Transactions are slower because they require coordinator communication and multiple round-trips. They’re essential for systems that need strong consistency, like financial systems or event sourcing where you need to update multiple topics atomically. For high-throughput systems where some eventual consistency is acceptable, transactions might be overkill. Transactions add overhead (typically 10-50% latency increase), so use them only when necessary.

How do producer batching and linger.ms affect throughput and latency?

By default, the Kafka producer batches messages. Instead of sending each message individually, it accumulates messages in memory up to a certain size (batch.size) or time (linger.ms), then sends them all at once. Batching reduces network round-trips and CPU overhead, dramatically improving throughput. If you send 1000 messages without batching, that’s 1000 network round-trips. With batching, it might be 10 round-trips if each batch contains 100 messages.

batch.size is the maximum bytes per batch. If messages arrive quickly, the producer fills up the batch and sends it immediately. linger.ms is the maximum time to wait for a batch to fill. If messages arrive slowly, the producer waits up to linger.ms before sending a partial batch. Setting a higher linger.ms increases batching, improving throughput but increasing latency. Setting linger.ms=100 means every message experiences up to 100 milliseconds of latency while waiting for a batch to fill. Setting linger.ms=0 disables lingering; messages are sent immediately.

In production, typical settings are batch.size=32 KB or 64 KB and linger.ms=10 to 100 milliseconds. This provides good batching without excessive latency. For ultra-low latency systems, you might reduce linger.ms to 0, sending messages immediately, though this severely hurts throughput. For maximum throughput, you increase both values and accept higher latency. The choice depends on your application’s requirements.

What compression codecs are available and when should you use each?

Kafka supports compression.type settings: none (no compression), gzip (good compression ratio around 10:1, higher CPU usage), snappy (lower compression ratio around 4:1, very low CPU overhead), lz4 (good balance around 6:1), and zstd (best compression around 12:1 but requires Kafka 2.1+). Compression is applied at the batch level, so each batch of messages is compressed as a unit.

gzip is useful when you’re optimizing for storage and network bandwidth and can tolerate higher CPU usage. Decompressing gzip is CPU-intensive. snappy is popular in real-time systems because it’s fast and imposes minimal CPU overhead. Decompressing a snappy batch takes microseconds. lz4 is a good general-purpose choice, balancing compression and speed. zstd provides the best compression for modern clusters with sufficient CPU resources.

The tradeoff is CPU cycles (compressing/decompressing) versus I/O bandwidth (smaller batches over the network and on disk). In compute-constrained systems (where CPU is the bottleneck), use snappy or none. In bandwidth-constrained systems (where network or disk bandwidth is the bottleneck), use gzip or zstd. Monitor compression ratio and latency impact when choosing.

Explain partitioner strategies and how to implement custom partitioners.

The partitioner determines which partition a message goes to. The default partitioner hashes the message’s key and maps it to a partition. If you don’t provide a key, it uses round-robin by default in recent versions (round-robin distributes evenly; older versions used a random partition). The hash function is MurmurHash2, which distributes keys fairly evenly across partitions.

You can implement a custom partitioner by implementing the Partitioner interface and overriding the partition() method. For example, if you have geographic data and want all messages from a specific region to go to the same partition, you’d extract the region from the message and map it deterministically to a partition. Custom partitioners are useful for maintaining ordering guarantees (all events for an entity to the same partition) or for load balancing (distributing based on some dimension other than a simple hash).

However, be careful: if your partition logic creates skew (many messages to a few partitions), you create bottlenecks. If 80 percent of your messages have a key that hashes to partition 0, that partition will be overloaded while others are idle. This is a common mistake. Always validate that your partitioning strategy distributes messages relatively evenly across partitions.

How do you handle serialization and handle producer errors?

Kafka transports byte arrays. A serializer converts Java objects to bytes. Common serializers include StringSerializer for strings, ByteArraySerializer for byte arrays, and various JSON serializers. You can also implement a custom Serializer interface. The choice of serializer affects both performance (serialization speed) and storage (message size).

A good practice is to use Avro or Protobuf serialization with a schema registry. This decouples producers and consumers from tight coupling to a specific data format. If the message format changes, the schema registry can handle versioning and evolution. For example, if you add a new optional field to your message schema, old consumers can still read new messages (they ignore the new field), and new consumers can read old messages (they get a default value for the new field).

For error handling, the producer callback interface allows you to define a function called on success or failure. Producers also support retries (retries parameter, default 2147483647) and exponential backoff (retry.backoff.ms, default 100 milliseconds). However, not all errors are retryable. Authorization errors, for example, won’t succeed on retry; they’ll always fail. Your callback should log or alert on failures, and your application should decide whether to drop, queue, or propagate the error.

Consumers in Depth

Explain the consumer poll loop and backpressure handling.

The consumer API centers around the poll() method. You call poll(timeout) to fetch messages. The consumer maintains network connections to brokers, tracks offsets, and manages session state. Each poll() call returns a set of records (usually 0 to hundreds, depending on fetch.min.bytes and fetch.max.wait.ms). The consumer internally handles broker discovery, metadata refresh, and partition assignment.

You process the returned records in your application, then call poll() again. If you take too long between polls, the session might timeout (session.timeout.ms, default 45 seconds), and the consumer is considered dead. The group coordinator removes it from the group and triggers a rebalance. To prevent this, ensure your processing time is less than max.poll.interval.ms (default 5 minutes). This is a common cause of rebalancing storms: consumers process messages slowly, exceed max.poll.interval.ms, get kicked out, rejoin, and repeat.

Backpressure handling means processing messages at a rate your application can sustain. If you call poll() but can’t process the returned records before the next poll, you’ve created a backlog. You have a few options: increase partitions and scale consumers horizontally (if you have fewer consumers than partitions, this increases parallelism), reduce max.poll.records to process fewer messages per poll (trades throughput for latency), or improve your processing throughput (optimize the processing logic).

What is a group coordinator and how does rebalancing work?

The group coordinator is a broker that’s responsible for a specific consumer group. It manages group membership, partition assignments, and offsets. When a consumer joins a group, the coordinator detects it (via a heartbeat), triggers a rebalance, and assigns partitions. The coordinator is elected from the brokers; for a given group, one broker is the coordinator, and it stores group metadata on that broker.

Rebalancing happens when a consumer joins, leaves, or the partition count changes. During rebalancing, the coordinator pauses all consumers, revokes their current partition assignments, computes new assignments, and resumes consumption. The time consumed by rebalancing can be noticeable in applications with high throughput. A 30-second rebalance stops all processing for 30 seconds. The length of rebalancing depends on the assignment strategy and the number of partitions.

The assignment strategy determines how partitions are distributed. Strategies include RangeAssignor (assigns contiguous partition ranges, which can create imbalance if partition counts don’t divide evenly), RoundRobinAssignor (distributes round-robin, generally more balanced), StickyAssignor (minimizes partition movement, reducing rebalancing overhead), and CooperativeStickyAssignor (a recent addition that allows consumers to maintain some partitions while new ones are assigned, reducing the pause window). The choice of strategy affects rebalancing time and the overhead of changing assignments.

Compare eager and cooperative rebalancing.

Eager rebalancing revokes all partitions from all consumers before assigning new ones. This ensures a clean slate but causes a complete pause in message processing during rebalancing. If rebalancing takes 10 seconds and you have 100 consumers joining, you could lose 1000 consumer-seconds of processing. For a high-throughput system, this is significant.

Cooperative rebalancing (introduced in Kafka 2.4) assigns partitions incrementally. During rebalancing, some consumers retain their partitions while others are revoked. This minimizes the set of partitions that go through a revoke-assign cycle. Only when necessary (e.g., a new consumer joins and needs partitions) are partitions moved. The pause window is much smaller, often just a few seconds.

To enable cooperative rebalancing, use cooperative-sticky as the partition.assignment.strategy. It requires consumers to support cooperative rebalancing, which is true for all modern Kafka clients. The tradeoff is complexity; eager is simpler but slower.

Explain offset commits, auto-commit vs manual, and when to use each.

Offset commits record that a consumer has successfully processed messages up to a certain offset. The consumer stores this commit in the __consumer_offsets topic. On restart, it reads its last committed offset and resumes from there. Offset management is critical for reliability: if you commit too early (before processing), you risk losing messages if the consumer crashes. If you commit too late, you risk reprocessing messages.

Auto-commit (enable.auto.commit=true) commits offsets automatically at an interval (auto.commit.interval.ms, default 5000 milliseconds). This is simple but risky: if your consumer crashes after polling messages but before processing them, on restart it resumes from the last commit and skips the unprocessed messages. Auto-commit is convenient but not suitable for systems where message loss is unacceptable.

Manual commits (enable.auto.commit=false) give you control. You process messages and call commitSync() or commitAsync() when you’re confident they’re safely processed. This prevents message loss but requires careful error handling. If your processing fails, you don’t commit, and on restart you reprocess the messages. commitSync() is safer (it waits for the broker to acknowledge), whereas commitAsync() is faster but riskier.

A hybrid approach is to disable auto-commit and manually commit after each batch of messages is successfully processed. Alternatively, use commitAsync() with a callback to handle commit failures gracefully. Choose based on your reliability requirements and application design.

What are at-least-once and exactly-once consumption semantics?

At-least-once means a message is guaranteed to be processed, but might be processed multiple times. This happens if you process a message successfully but crash before committing the offset. On restart, you reprocess the message. At-least-once is the default behavior and is acceptable for idempotent operations (operations that produce the same result whether executed once or multiple times). For example, if you’re incrementing a counter, processing the same message twice increments it twice, which is wrong. But if you’re updating a cache with a key-value pair, processing the same message twice results in the same final state.

Exactly-once means a message is processed exactly once, never duplicated. This is harder to achieve and usually requires either transactions or deduplication logic in your consumer. One approach is to set isolation.level=read_committed and use producer transactions to ensure you only see transactionally committed messages. Another approach is to write an idempotency key to your state (e.g., a database) so that reprocessing the same message is idempotent. You store the message ID with the result; if you see the same message ID again, you skip processing.

Most systems operate in an at-least-once model and make their processing idempotent. True exactly-once is expensive and often unnecessary if your applications can tolerate occasional duplicates. The choice depends on your use case: financial transactions demand exactly-once, but click-stream analytics can tolerate duplicates.

How do you measure and address consumer lag?

Consumer lag is the difference between the consumer’s committed offset and the producer’s latest offset. If the producer is at offset 1000000 and the consumer is at offset 999900, the lag is 100. Lag indicates how far behind the consumer is. A lag of 0 means the consumer is caught up and processing messages in real-time. A lag of 100000 means the consumer is about 100 seconds behind (assuming 1000 messages per second throughput).

You can monitor lag using the Consumer Lag metric exposed by Kafka brokers. Many tools like Burrow (LinkedIn’s Kafka monitoring tool) expose this. Lag of 0 means the consumer is caught up. Increasing lag means consumers are falling behind producers. Alert if lag exceeds a threshold (e.g., 5 minutes). This indicates a problem: either the producer is faster than the consumer can keep up with, or the consumer is experiencing issues.

To reduce lag, you can increase the number of consumers (up to the partition count) to parallelize processing and increase overall throughput. You can optimize your processing logic to be faster or increase the fetch sizes to reduce the number of round-trips. In some cases, you might sacrifice throughput for latency or vice versa depending on your application’s needs.

Describe partition assignment strategies and their implications.

RangeAssignor divides partitions into ranges and assigns ranges to consumers. For a topic with 6 partitions and 2 consumers, consumer 1 gets partitions 0 to 2 and consumer 2 gets 3 to 5. This is simple but can create imbalance if partition counts don’t divide evenly. With 10 partitions and 3 consumers, consumer 1 gets 4, consumer 2 gets 3, consumer 3 gets 3. RangeAssignor is the default for backward compatibility but is rarely the best choice.

RoundRobinAssignor distributes partitions round-robin across consumers. For 6 partitions and 2 consumers, consumer 1 gets partitions 0, 2, 4 and consumer 2 gets 1, 3, 5. This is more balanced but can create more partition movement during rebalancing if consumers join or leave. StickyAssignor minimizes partition movement by sticking to the previous assignment when possible. CooperativeStickyAssignor does the same but using cooperative rebalancing, reducing pause time.

Custom assignment strategies can be implemented by extending PartitionAssignor if you need application-specific logic. For example, you might assign partitions to consumers based on geographic location or compute capacity.

Kafka Streams Framework

What are KStream, KTable, and GlobalKTable, and how do they differ?

Kafka Streams provides abstractions over Kafka topics. A KStream represents an unbounded stream of events. Each record is independent, and you can process each record individually. KStreams are stateless by default, though you can add state through operations like aggregate(). A KStream represents every message, including duplicates. If the same key appears 10 times, you get 10 records in the stream.

A KTable represents a changelog stream (a compacted topic). It’s conceptually a table where each message’s key is the primary key and the latest message’s value is the current row. When you aggregate a KStream, you often produce a KTable. KTables are useful for joining streams with the latest state of entities. A KTable effectively deduplicates by key; if you have 10 messages with the same key, the KTable represents only the latest value.

A GlobalKTable is a fully replicated KTable. Every instance of your application loads the entire table into local state (stored in RocksDB). This is useful for reference data that’s frequently joined against but doesn’t change often. Instead of shipping your stream partitions across the network for a join, you look up in the local GlobalKTable. The tradeoff is memory: GlobalKTables must fit in memory. If your reference data is 10 GB, every instance needs 10 GB of RAM.

Explain stateless vs stateful operations in Streams.

Stateless operations like map(), filter(), and branch() transform or filter records without maintaining any state. Each record is processed independently. These are low-latency and horizontally scalable. A map() that converts each message is stateless; its output depends only on the input. A filter() that drops messages above a certain size is stateless.

Stateful operations like aggregate(), count(), and reduce() maintain state across records. When you aggregate a stream (e.g., sum prices per product ID), you’re accumulating state in a state store (RocksDB by default). This state must be distributed across the application instances, and each instance is responsible for its partition’s state. State stores are local to each instance; if you have 3 instances consuming 10 partitions, each instance has state for about 3-4 partitions.

Stateful operations introduce complexity. You must manage state store backups (usually in a changelog topic in Kafka), handle interactive queries (looking up state from a non-lead instance), and manage state restoration after failures. However, they enable powerful stream processing. Without stateful operations, many real-time analytics use cases are impossible.

Describe windowing types: tumbling, hopping, and session windows.

Tumbling windows divide time into non-overlapping buckets. A 1-minute tumbling window processes all messages from 00:00 to 00:59 together, then 01:00 to 01:59, and so on. You’d use this for aggregations like “count messages per minute.” Each message belongs to exactly one tumbling window. Tumbling windows are useful for periodic computations: “how many orders per minute?”, “average latency per 5 seconds?”

Hopping windows are overlapping. A 1-minute hopping window with 30-second hop size creates windows [00:00 to 00:59], [00:30 to 01:29], [01:00 to 01:59], etc. Each message belongs to multiple windows. This is useful when you want overlapping aggregations, like “count messages per minute, updated every 30 seconds.” Hopping windows provide a moving average effect.

Session windows group messages by activity. A session window closes when there’s a gap of inactivity exceeding a grace period. If messages arrive at 00:00, 00:05, 00:10 with a 20-second gap, they’re in one session. If the next message arrives at 00:35, it starts a new session. Sessions are useful for user session analysis in click streams: “How long did a user stay on the site? How many clicks per session?”

How does Kafka Streams guarantee exactly-once processing?

Kafka Streams processes in transactions. For each batch of input messages processed, it writes results atomically to output topics and commits the input offset. If the application crashes, the uncommitted transaction is rolled back, and on restart, the same input batch is reprocessed. This ensures that each input message is processed exactly once (each input message contributes exactly once to the output), even across crashes.

To enable exactly-once, set processing.guarantee=exactly_once_v2 (v2 is newer and more efficient than the original exactly_once). The application uses transactions internally; you don’t explicitly manage them as a user. Exactly-once in Streams relies on idempotent producers and transactional writes.

Exactly-once doesn’t mean the output is guaranteed to be unique globally; it means each input message is processed exactly once, and outputs are derived deterministically from inputs. If your output includes deduplication keys, downstream systems can deduplicate by key. Exactly-once is slower than at-least-once due to transaction overhead.

What are interactive queries and processor API in Kafka Streams?

Interactive queries allow you to query the state of a running Kafka Streams application. Instead of externalizing state to a database, you can query state stores directly from the application instance. This is useful for building dashboards or serving the current state of aggregations. For example, if you’re computing the current balance of each account, you can query the application to get an account’s balance without hitting a database.

To use interactive queries, you name your state stores and use the getStore() method from the Streams instance. Be aware that state stores are distributed; for a given key, you might need to query the instance that owns that key’s partition. The application provides metadata about which instance owns which partitions, so you can route requests correctly.

The Processor API is a lower-level API below KStream or KTable. It gives you fine-grained control over state stores, scheduling, and record processing. Most applications use KStream or KTable, but the Processor API is available for advanced use cases where you need to schedule periodic actions or maintain complex state logic.

Kafka Connect Framework

Explain source and sink connectors and their differences.

A source connector reads data from an external system and writes it to Kafka. An example is a JDBC source connector that polls a database and publishes each row as a message to Kafka. The connector handles polling, change detection, serialization, and offset management. Source connectors are useful for capturing data into Kafka from various systems like databases, APIs, and logs.

A sink connector reads from Kafka topics and writes to an external system. An example is a JDBC sink connector that reads messages and inserts or updates rows in a database. The connector handles deserialization, batching, and commit coordination. Sink connectors are useful for exporting data from Kafka to other systems for analysis, storage, or integration.

Source connectors are useful for capturing data into Kafka. Sink connectors are useful for exporting data from Kafka to other systems. Together, they form the integration layer between Kafka and the broader data ecosystem. Kafka Connect is much simpler than writing custom producers and consumers: connectors are declarative, configurable without code, and handle many edge cases.

Compare standalone and distributed mode for Kafka Connect workers.

Standalone mode runs a single worker process. You configure connectors in a properties file, start the worker, and it processes them. Standalone is simple for development and small deployments but doesn’t provide scalability or fault tolerance. If the worker crashes, all connectors stop. Standalone is often used for testing or one-off integrations.

Distributed mode runs multiple worker processes. You submit connectors to a cluster using REST APIs. Workers coordinate through a coordinator broker to balance load. If a worker dies, another picks up its connectors. Distributed mode is necessary for production deployments with multiple connectors or high throughput. Workers are stateless; state is stored on the brokers, so workers can be added or removed dynamically.

What is the converter chain and how do you customize serialization?

The converter chain describes how data flows through a connector. For a source connector: external data to value_converter to Kafka. For a sink connector: Kafka to value_converter to external system. The converter handles serialization or deserialization. Key and value converters are independent; you might use StringConverter for keys and AvroConverter for values.

Common converters include JsonConverter (serializes to JSON), StringConverter (for plain strings), and AvroConverter (for Avro format with schema registry). Each converter has configuration options; for example, JsonConverter can be configured to include a schema.

To customize, you implement the Converter interface. You might create a converter that compresses, encrypts, or applies domain-specific logic. This is powerful for integrations that require special handling.

Describe single message transforms (SMTs) and error handling with dead letter queues.

SMTs are lightweight transformations applied as messages flow through a connector. Examples include extracting a field, renaming a field, routing based on a value, or casting types. SMTs are chained; multiple transforms apply sequentially. SMTs are configured declaratively, making simple transformations easy without custom code.

SMTs are useful for data cleaning and routing without writing code. For example, a JDBC source connector can use an SMT to extract the primary key and send it as the message key, improving parallelism downstream. SMTs are less powerful than full stream processing but useful for simple cases.

Error handling in connectors uses dead letter queues (DLQs). If a message causes an error in processing (e.g., a sink connector fails to insert a row), it can be sent to a DLQ topic instead of failing the connector. Your monitoring system can then alert on DLQ messages. To enable, set errors.deadletterqueue.topic.name in the connector config.

Give an example of using JDBC connectors for database integration.

A JDBC source connector can be configured to poll a table every 60 seconds, read new rows (detected by a timestamp or incrementing ID column), and publish them to a topic. The connector manages offset tracking internally, so it doesn’t re-read the same rows. This is useful for CDC: keeping Kafka in sync with database changes.

A JDBC sink connector subscribed to the topic receives messages and batches them for insertion into a target database table. It handles batching and connection pooling. If a message conflicts with an existing row (duplicate key), the connector can be configured to upsert (update-insert) using a primary key. This is useful for synchronizing data across databases.

This is a common pattern for CDC and data synchronization between databases. Instead of implementing custom code, you configure connectors.

Performance and Tuning

How do you optimize producer throughput and latency?

Producer throughput is maximized by batching. Increase batch.size (e.g., 64 KB) and linger.ms (e.g., 100 milliseconds) to accumulate more messages per batch. More batches means fewer round-trips to brokers, improving throughput. Parallelism also helps; increase max.in.flight.requests.per.connection to allow multiple batches in flight simultaneously. This increases memory usage but improves throughput.

Latency is reduced by decreasing linger.ms and reducing batching. For ultra-low latency, set linger.ms=0 and batch.size=1, though this severely hurts throughput. Most applications benefit from 10-100 millisecond latency targets; setting linger.ms=50 provides good batching without excessive latency.

Compression helps too. Snappy or lz4 reduce network I/O without much CPU overhead. The choice depends on your bottleneck: if CPU is limited, use snappy. If bandwidth is limited, use gzip.

What’s the relationship between partition count, consumer count, and parallelism?

Parallelism is capped by the partition count. If a topic has 10 partitions, you can have at most 10 consumers in a group processing in parallel. The 11th consumer is idle. Scaling consumers beyond partitions doesn’t help; you don’t gain additional parallelism.

Choosing partition count is a balance. Too few partitions limit parallelism. Too many partitions create overhead: more metadata to manage, more storage for replicas, more network traffic for replication. A common heuristic is: partition count equals expected throughput (MB per second) divided by expected single consumer throughput (MB per second). If you expect 10 MB per second and a single consumer can handle 1 MB per second, aim for 10 to 20 partitions (20 provides headroom for scaling).

Partition count is hard to change; repartitioning requires coordination and downtime. Choose carefully during design. Start conservative (e.g., 10 partitions) and increase if needed.

How does replication factor and min.insync.replicas affect performance?

Higher replication factor increases durability but decreases write throughput. Each message must be replicated to more brokers, increasing network I/O. It also increases storage: with replication.factor=3, you need 3 times the disk space. Replication also increases the cost of leader election: if a leader fails, a new leader must be elected from the replicas.

min.insync.replicas creates a floor for write safety. If set to 3 and a broker fails, reducing the ISR below 3, writes are paused until another replica catches up. This prevents data loss but sacrifices availability during broker failures. It’s a tuning decision based on your RTO (Recovery Time Objective) or RPO (Recovery Point Objective) requirements.

Explain broker-side tuning: compression, batching, and I/O.

Brokers benefit from compression because compressed batches take less disk I/O and network I/O. Snappy is often preferred for its speed. The broker decompresses messages for consumers automatically.

The broker’s log.segment.bytes controls segment size. Larger segments mean fewer segments to manage, reducing metadata overhead. But larger segments increase the time to clean up old logs. A typical setting is 1 GB. log.roll.hours controls how often a new segment is created based on time; 24 hours is typical.

The OS page cache is crucial. Kafka relies on OS page caching to serve recent messages from memory instead of disk. Ensure the broker has sufficient RAM so the most recent N days of messages fit in cache. For a 1 GB per second throughput topic, one day of retention requires 86 TB of storage, though you probably have much less RAM. Cache hit ratio for old data is poor, but the most recent hour likely fits in cache. This is why Kafka brokers benefit from ample RAM even though Kafka offloads storage to disk.

How do you tune consumer throughput and reduce latency?

Consumer throughput is improved by increasing fetch.min.bytes and fetch.max.wait.ms, allowing the broker to batch more messages before responding. Increasing max.poll.records increases the batch size returned per poll. Parallelize by increasing the number of consumers (up to partition count). More consumers means more parallelism and higher aggregate throughput.

Latency is reduced by decreasing fetch.min.bytes and fetch.max.wait.ms so the broker responds quicker, even with fewer messages. Decrease max.poll.records to process smaller batches faster, reducing the time between polls. Latency is the time between a message being produced and a consumer processing it; batching increases latency but improves throughput.

Describe JVM heap tuning for Kafka brokers.

Kafka brokers are write-heavy and don’t typically require large heap sizes. The default 1 GB heap is often sufficient. Increasing heap size can help if you have many concurrent connections or complex request handling, but beyond a certain point, larger heaps increase GC pause times. A 32 GB heap can cause 30-second GC pauses, which is unacceptable for real-time systems.

Monitor GC logs. Young generation GCs should be frequent but fast (less than 100 milliseconds). Full GCs should be rare. If you see frequent full GCs, your heap is too small. If you see infrequent but very long full GCs, increasing heap size or switching GC algorithms might help. The G1GC collector is often better for large heaps (e.g., 8 GB or more) than the default ParallelGC because it bounds pause times.

Kafka Security

How do you configure SSL or TLS for Kafka?

Kafka supports SSL or TLS for encryption between clients and brokers. You generate certificates for each broker and for clients (if you want mutual TLS). Configure listeners on the broker: set advertised.listeners to indicate which addresses are exposed, and set listeners to the actual listening addresses. For example, advertised.listeners=SSL://public.hostname:9093 means clients connect to this address using SSL.

Configure security.protocol=SSL on clients to establish SSL connections. Clients must trust the broker’s certificate, so you provide a truststore with the broker’s CA certificate. SSL adds CPU overhead (encryption and decryption), so monitor CPU usage when enabling. Some deployments use SSL only between certain client-broker pairs and rely on network security for others.

Describe SASL mechanisms: PLAIN, SCRAM, and GSSAPI.

SASL is a framework for authentication. Kafka supports multiple SASL mechanisms. PLAIN sends credentials (username or password) in plaintext; always use with SSL. It’s simple to set up but requires external password management. SCRAM is more secure; passwords are hashed and salted. It’s Kafka’s recommended mechanism. Credentials are stored in ZooKeeper or KRaft metadata.

GSSAPI uses Kerberos for authentication. It’s enterprise-friendly but requires Kerberos infrastructure. It’s more complex to set up but provides strong authentication and integration with enterprise identity systems.

What are Kafka ACLs and how do you enforce access control?

ACLs (Access Control Lists) control who can perform what actions on which resources. You can restrict read, write, create, delete, and admin operations on topics, consumer groups, and transactional IDs. For example, you can allow user “alice” to read from topic “payments” and write to topic “analytics” but deny all other access. ACLs are granular and powerful.

ACLs are enforced at the broker. Invalid operations are rejected. To manage ACLs, use kafka-acls.sh. Note that ACLs are not encrypted; if an untrusted user gains access to the broker or network, they can see what resources others access.

Explain encryption at rest and audit logging for compliance.

Kafka doesn’t natively encrypt data at rest on disk. The broker writes plaintext data to disk. If you require encryption at rest, you can use OS-level solutions like LUKS or hardware-based solutions like encrypted SSDs. Some deployments use Kafka Connect to encrypt before persisting to an external storage system.

Audit logging is critical for compliance. You can enable security audit logs on the broker to log authorization decisions. These logs show who accessed what, when, and whether access was allowed or denied. Audit logs are useful for security investigations and compliance audits.

Architecture and Design Patterns

When should you use Kafka vs RabbitMQ vs SQS?

Kafka is optimized for high throughput, retention, and event replay. Use Kafka when you need to process millions of messages per second, retain messages for extended periods, or replay events. Kafka is best for log aggregation, event streaming, and real-time analytics. Kafka scales to massive throughput (millions of messages per second).

RabbitMQ is optimized for traditional pub-sub and queue semantics. It’s good for task queues and point-to-point messaging. RabbitMQ is simpler to operate than Kafka but doesn’t scale to Kafka’s throughput levels. RabbitMQ is better for systems where you don’t need persistence or replay.

SQS (AWS Simple Queue Service) is a managed queue service. It’s serverless and requires no operational overhead, but it lacks retention and replay capabilities. Use SQS for decoupling services when you don’t need event history. SQS is simpler than Kafka but less flexible.

Describe event sourcing with Kafka.

Event sourcing means storing changes to application state as a series of immutable events. Instead of storing just the current state of an entity, you store every change event. Kafka is an excellent event store because it retains events, supports replay, and provides ordering guarantees. For example, in an e-commerce system, instead of storing a single row for an order with its current status, you store events like OrderCreated, ItemAdded, PaymentProcessed, ItemShipped. The current state is derived by replaying these events. Benefits include audit trail, ability to rebuild state, and supporting complex temporal queries.

What is CQRS and how does Kafka support it?

CQRS (Command Query Responsibility Segregation) separates write and read models. Commands modify state and produce events. Queries read from a separate read-optimized view. Kafka acts as the event log connecting commands and queries. A command writes an event to a Kafka topic. A view builder (Kafka Streams application or similar) reads the topic and updates a read-optimized database (e.g., Elasticsearch or a cache). Queries hit the read database. This decoupling allows independent scaling of write and read paths and enables building multiple projections from a single event log.

Explain the saga pattern for distributed transactions.

The saga pattern orchestrates a distributed transaction across multiple services using events. Instead of a two-phase commit (which is slow and complex), a saga chains compensating transactions. For example, a “process order” saga: Service A creates an order (emit OrderCreated). Service B reserves inventory (emit InventoryReserved). Service C processes payment (emit PaymentProcessed). If Service C fails, compensating transactions issue events to undo the prior steps. Kafka carries these events between services. Sagas are more scalable and resilient than distributed transactions.

What is MirrorMaker 2 and when do you use multi-datacenter Kafka?

MirrorMaker 2 is a Kafka Connect-based tool that mirrors topics from one Kafka cluster to another. It’s useful for disaster recovery (maintaining a replica cluster), geographic distribution (keeping data close to consumers), and multi-datacenter deployments. MirrorMaker 2 preserves message order, manages offsets, and can be configured to mirror consumer groups. It’s more flexible than the older MirrorMaker 1, which was limited to mirroring within clusters.

Troubleshooting and Debugging

A consumer’s lag is not recovering; what are the causes and fixes?

If a consumer’s lag (difference between producer position and consumer position) is increasing or stuck, the consumer is slower than the producer. Causes include: slow processing logic, network latency, under-resourced consumer, skewed partition assignment, or consumer crashes and restarts. To debug: check consumer logs for errors or exceptions. Monitor processing time with distributed tracing. Verify CPU and memory usage on consumer instances. Check if the consumer is rebalancing (frequent rebalances indicate instability). Verify partition assignment is balanced.

Fixes include: optimize processing logic (profile and optimize slow code), increase resources (CPU or memory), increase consumer count (up to partition count), or reduce batch size to process messages faster.

What causes broker under-replicated partitions and how do you fix them?

Under-replicated partitions occur when a replica fails to stay in-sync with the leader. A broker crashes, and its replicas are temporarily unavailable. Or a broker is slow, causing replicas to fall behind and be removed from the ISR. To fix: if a broker is offline, restart it and let it catch up. If a broker is slow, check its logs for errors, increase its resources, or check for disk I/O issues. Use kafka-topics.sh –describe to check partition replica status and ISR. If a replica is permanently stuck, you might need to reassign replicas using kafka-reassign-partitions.sh. Under-replicated partitions indicate durability risk; address them promptly.

How do you debug and prevent consumer rebalancing storms?

Rebalancing storms occur when consumers repeatedly rejoin the group, triggering constant rebalancing. This kills throughput. Causes include: processing too slow (max.poll.interval.ms exceeded), network issues, or buggy consumer code that crashes and restarts repeatedly. To debug: monitor RebalanceLatency metrics. Log rebalance events and analyze when they occur. Check consumer logs for exceptions. Verify network connectivity. Increase max.poll.interval.ms and session.timeout.ms if your processing is legitimately slow but you want to avoid rebalances.

To prevent: fix your application bugs, optimize processing, or tune timeouts appropriately. Avoid killing or restarting consumers unnecessarily.

Message ordering is broken; what might have gone wrong?

Ordering is guaranteed per partition and per key. If you see out-of-order messages, you’re likely comparing messages across partitions (which have no global order) or using no key (messages are distributed round-robin, not ordered). To fix: ensure all messages related to an entity use the same key (e.g., user ID). This ensures they land in the same partition and are processed in order. Or switch to a single-partition topic if you need global ordering, though this kills parallelism.

If messages in the same partition are out of order, something is very wrong (producer bug or network corruption). Check broker logs for anomalies.

Producer calls are timing out; how do you investigate?

Producer timeouts indicate requests aren’t completing within request.timeout.ms (default 30 seconds). Causes include: broker overload, network latency, or broker crashes. To investigate: check broker logs for errors. Monitor broker CPU and disk I/O. Check network latency between producer and brokers. Increase request.timeout.ms if latency is genuinely high but the broker is healthy. Reduce producer batch sizes to reduce load. Monitor replication lag; if it’s high, brokers are struggling. Also check if you’re using acks=all and min.insync.replicas is set high; this increases latency.

Questions to Ask the Interviewer

Beyond answering questions, asking thoughtful questions demonstrates engagement and critical thinking. What’s your current Kafka architecture? Understand cluster size, replication factor, topic count, and throughput. This helps you assess the scale you’d be working at. What’s your current monitoring and observability setup? Good teams monitor lag, rebalancing, broker health, and application metrics. Poor monitoring indicates you’ll spend time fighting fires.

What are your biggest Kafka challenges? This is honest feedback on what the team struggles with. How do you handle schema evolution and backwards compatibility? This indicates maturity. Do you use Kafka Streams, Spark, or Flink for stream processing? This tells you what skills they want. What’s your disaster recovery strategy? Do they have backup clusters, MirrorMaker, cloud-based disaster recovery? How do you manage secrets and access control? Are they using SASL, SSL, and ACLs? What’s your onboarding experience for new services using Kafka? Great teams have standardized practices.

Final Thoughts on Kafka Interviews

Kafka is a large system with many moving parts. No one knows everything. Interviewers know this. They’re looking for clear thinking, ability to reason about tradeoffs, and honesty about knowledge boundaries. If you don’t know something, say so. Explain how you’d approach finding the answer.

Strong answers demonstrate understanding of the underlying principles, not just memorized facts. Why does acks=all provide durability? Because the leader waits for replicas to acknowledge before acknowledging to the producer, guaranteeing the message is replicated. Why does batching improve throughput? Because multiple messages are sent in one network round-trip. This kind of reasoning is what separates strong candidates.

Finally, practice on real systems. Read Kafka code, deploy a test cluster, break things intentionally, and fix them. Hands-on experience is invaluable and obvious to interviewers.

For further reading on Kafka topics, architecture, and distributed systems in general, explore our pillar resource on best answers to interview questions. You might also find our guides on Kubernetes interview questions, Terraform interview questions, and Web API interview questions valuable for building a broader foundation in distributed systems and infrastructure. Additionally, check out our resources on SDET interview questions, Snowflake interview questions, data analyst interview questions, and quality engineer interview questions to deepen your expertise across the modern data and infrastructure landscape.


Share this post

Leave a Reply

Your email address will not be published. Required fields are marked *

Back to Blog