Experience with Observability

When I initially learned about the three pillars of observability a few years ago, I ignored them as a front-end engineer. I came to understand the significance of these three pillars while started owning the whole system. In this piece, I summarise my observations in layman's terms.

logs

As a developer, you will use logs at least every other day. When debugging an issue, you will start here. The best and first place to look for precise information about any errors, warnings, or exceptions is in the logs.

We already have a lot of mature logging systems like - Kibana, datadog etc.

Few points that you need to take care of while instrumenting logs in your application is

  • Make sure you are not logging sensitive information like passwords, OTP etc.

  • Make sure your log definition is precise, (could cause memory & cost issues)

  • Have a proper archive strategy for data.

metrics

Metrics is something you love if you are in a senior role. Metrics gives you a birds' eye view of your systems and API. we track the occurrence of an event, counting of items, the time taken to perform an action or to report the current value of a resource (CPU, memory, etc.).

A great tool for metrics would be Grafana. (There are many more)

While on metrics it is also useful to talk about some popular jargon we use while reading metrics -

Bandwidth - bandwidth is a theoretical system capacity definition. in contrast with throughput which is the actual capacity exhibited by the system. bandwidth is maxed throughput the system can have.

Latency - latency is generally the response time of your system. low latency is something we seek as system owners.

High latency is frequently caused by a transmission channel, propagation, routing, and storage delay. A few things we can do to minimize it include moving API to HTTP2, making fewer calls to other APIs, using CDN, and caching in browsers.

common latency representation

usually, we use p90, p95 & p99 notions to represent the accurate latency

p99 or 99th percentile, which means that 99% of the requests should be faster than the given latency. In other words, only 1% of the requests are expected to be slower.

A good read - https://www.enjoyalgorithms.com/blog/latency-in-system-design

Throughput - It usually tells you the rate of a system at which it can process requests. as system owners, high throughput is what we strive.

We face a lot of hardware and cost limitations in achieving high throughput.

A good read - https://www.enjoyalgorithms.com/blog/throughput-in-system-design

trace

trace gives us visibility into how a request is processed across multiple services in a microservices environment. Every trace needs to have a unique identifier associated with it. Elastic APM agents we commonly used to trace (again there are many more).

trace gives you a holistic view of any API Journey.

This three-pillar will enable you in doing a primary investigation for any error or product issue, as well as gathering information about any user/event/action.