What are helpful metrics to troubleshoot Kafka connection issues?
Plotting the following metrics (with fetch-latency-avg and fetch-latency-max as the time period) would be helpful in identifying issues relating to unexpected disconnections by your consumer:
- fetch-latency-avg: The average time taken for a fetch request.
- fetch-latency-max: The max time taken for a fetch request.
- fetch-rate: The number of fetch requests per second.
- fetch-size-avg: The average number of bytes fetched per request.
- fetch-size-max: The maximum number of bytes fetched per request.
- fetch-throttle-time-avg: The average throttle time in ms. When quotas are enabled, the broker may delay fetch requests in order to throttle a consumer which has exceeded its limit. This metric indicates how throttling time has been added to fetch requests on average.
- fetch-throttle-time-max: The maximum throttle time in ms.
Here is a resource that further details Kafka consumer metrics.