How a misconfigured Kafka cleanup policy caused silent audit data loss

5 min read

During a performance test, we noticed that audit events for some messages were silently missing from MongoDB. No errors in the consumer logs, nothing in the DLT topic, and the producing services showed no errors. The audit service had consumed them. So why weren’t they in the database?

The answer turned out to be a one-line config change we had made weeks earlier. And it had nothing to do with the audit service itself.

The setup

In our project we use Kafka as an async communication mechanism for our microservices. One of our services is the auditing service, which listens to all audit events from every microservice. Each microservice processes an input message and produces an audit event keyed against the unique id of that message. The overall audit event structure is like this:

{
  "id": "8f2a91c3-4b5d-4e6a-9c1f-2d3e4f5a6b7c",
  "aggregateType": "payment",
  "eventName": "payment.processed",
  "timestamp": "2026-04-25T11:58:01.432Z"
}

and the audit service stores this in MongoDB in the following structure:

{
  "id": "8f2a91c3-4b5d-4e6a-9c1f-2d3e4f5a6b7c",
  "aggregateType": "order",
  "events": {
    "order": {
      "order.created": true
    },
    "payment": {
      "payment.processed": true
    },
    "inventory": {
      "inventory.reserved": true
    }
  }
}

So it groups events under the aggregateType of each microservice. All microservices publish events on a single topic, audit-events, which the audit service listens to. We use the unique id of the message as the partition key, so every audit event for a given message lands in the same partition. Each message goes through multiple services for processing.

We re-checked the audit consumer (no errors, empty DLT) and the producer (metrics confirmed emission). MongoDB was healthy too: no downtime or high utilization. Something between consumption and storage was eating it. So we looked at the audit topic’s recent config history.

We found one config change related to the cleanup policy of the audit topic. We hadn’t set a cleanup policy on this topic, so it defaulted to delete. With delete policy, Kafka discards old records based on either configured size or the retention period.

We had recently switched our CDC topics to compact, and the audit-events topic had been mistakenly included in that change. Since every service published its audit event with the same key (the message id), the topic always had multiple records per key: exactly what compact is designed to deduplicate.

Compact policy keeps only the latest message per key and removes all older ones.

This seemed like a probable cause, but we wanted proof. We needed to see when compaction triggered and whether the Kafka logs for it overlapped with the message processing window.

So the compaction trigger is based on the following three broker configs:

  1. log.cleaner.min.cleanable.ratio (default 0.5)
  2. log.cleaner.min.compaction.lag.ms (default 0)
  3. log.cleaner.max.compaction.lag.ms (default 9223372036854775807, the max value of a signed 64-bit integer)

The cleaner sees each partition as a clean head (already compacted, one record per key) followed by a dirty tail (new writes since the last compaction, possibly with duplicate keys):

 ◄────────── clean head ─────────►◄────────── dirty tail ─────────►
 ┌───┬───┬───┬───┬───┬───┬───┬───┬───┬───┬───┬───┬───┬───┬───┬───┐
 │ A │ B │ C │ D │ E │ F │ G │ H │ A │ I │ B │ C │ J │ A │ K │ B │
 └───┴───┴───┴───┴───┴───┴───┴───┴───┴───┴───┴───┴───┴───┴───┴───┘

 dirty ratio = dirty bytes / (clean + dirty bytes)

A partition becomes eligible for compaction along one of two paths:

                       partition state
                              │
                              ▼
            ┌────────────────────────────────┐
            │  oldest dirty message age      │
            │    > max.compaction.lag.ms ?   │── yes ──► compact (forced)
            └────────────────────────────────┘
                              │ no
                              ▼
            ┌────────────────────────────────┐
            │  dirty ratio                   │
            │    > min.cleanable.ratio ?     │
            │  AND                           │── no  ──► wait
            │  oldest dirty age              │
            │    > min.compaction.lag.ms ?   │
            └────────────────────────────────┘
                              │ yes
                              ▼
                           compact

With min.compaction.lag.ms at 0 (the default), there is no protection window. Messages are eligible for compaction the moment they are written. With max.compaction.lag.ms at Long.MAX_VALUE (the default), the forced path never fires in practice. So under default settings, compaction happens purely when the dirty ratio threshold is crossed.

In our case all the configs were at their default values. During the performance test, we were publishing around 3 million messages to the audit topic. Compaction fired during the test. The audit consumer was lagging behind under the load, and any audit events that compaction dropped before the consumer pulled them were lost: a consumer can only read what is still in the log when it polls. Events the consumer had already read by the time compaction ran were safely in MongoDB. The rest never made it.

We confirmed this from the Kafka cleaner logs: the compaction timestamps lined up with the last_updated timestamps in MongoDB. Same minute, same partition. The snippet below is a representative cleaner-thread log for this scenario, not the literal output from our cluster, but the format and fields are the same:

[2026-04-25 11:58:13,229] INFO Cleaner 0: Cleaning log audit-events-0 (cleaning prior to Sat Apr 25 11:55:51 UTC 2026, discarding tombstones prior to upper bound deletion horizon Thu Jan 01 00:00:00 UTC 1970)... (kafka.log.LogCleaner)
[2026-04-25 11:58:13,231] INFO Cleaner 0: Cleaning LogSegment(baseOffset=0, size=117869, ...) in log audit-events-0 into 0 with an upper bound deletion horizon 0 computed from the segment last modified time, retaining deletes. (kafka.log.LogCleaner)
[2026-04-25 11:58:13,244] INFO [kafka-log-cleaner-thread-0]:
	Log cleaner thread 0 cleaned log audit-events-0 (dirty section = [0, 5000])
[2026-04-25 11:58:32,395] INFO Cleaner 0: Cleaning LogSegment(baseOffset=0, size=1251, ...) in log audit-events-0 ... (kafka.log.LogCleaner)

In the trace, the first segment goes from 117KB to 1.2KB after compaction. 4900 of the 5000 messages were dropped, leaving only the latest value for each of the 50 unique keys.

We verified by switching the cleanup policy back to delete and running the performance tests multiple times. No events were lost from the audit collection.

compact and delete are fundamentally different models for what a topic is. A delete topic is a log; it retains history up to a size or time limit. A compact topic is closer to a key-value store; it retains only the latest state per key. Mixing an append-style use case (audit events, where every message matters) with compact policy is a silent data loss waiting to happen, especially when the partition key is reused across producers. The policy choice belongs in the design, not as an afterthought in a config file.

Thanks for reading. Have a great day!


Further reading:

  1. Kafka log compaction
  2. Kafka Topic Compaction Internals
  3. Understanding Kafka Compaction