The relationship between observability, OpenTelemetry, and UX

8 mins

In partnership with

The missing piece to improving your user experience might just be adopting an observability strategy.

According to McKinsey, the average share of digital customer interactions jumped from 18% pre-COVID to 55% during the pandemic. This digital acceleration is here to stay, making user experience (UX) a critical aspect of every digital product and online service. But can you really understand UX without a proper observability approach?

Monitoring techniques

To ensure a positive UX, DevOps teams need to collect data to understand how the transactions are performing; this may require many methods and monitoring tools to do so. Primary examples include real user monitoring (RUM), synthetic monitoring, application performance management (APM), and now distributed tracing. For a more comprehensive understanding, let’s look at each of these monitoring techniques.

APM is a broader term that refers to the process of monitoring, managing, detecting, and mitigating performance issues within an application.

APM was born in the era of big monolithic applications, but with the adoption of microservices running on containers, operations teams found many new “unknown unknowns” once moving code into production (at scale). A new approach was required to understand how user transactions were behaving in a highly distributed and ephemeral environment, hence the need for a distributed tracing approach.

Distributed tracing is a more recent method of tracking the flow of requests and responses between multiple microservices in a distributed system. This is especially important in microservice architectures, where a single user request may involve multiple microservices, making it difficult to identify performance bottlenecks.

With distributed tracing, each microservice generates a unique identifier for each incoming request and passes it along to the next microservice in the transaction. This allows developers to see the complete end-to-end flow of requests and responses and identify which microservice is causing performance issues. Distributed tracing also provides visibility into each microservice's processing time, as well as how long it takes to travel between microservices. It is a critical tool for ensuring the performance and reliability of cloud-native applications.

Both traditional APMs and distributed tracing technologies are mainly searching for sources of performance issues from the inside of the application. To properly understand the full UX, teams also need to monitor the outside world. They can achieve this by asking questions like, “Is my content distribution network (CDN) impacting performance? Or the domain name system (DNS)?”

This brings us to the need for RUM and synthetic monitoring.

RUM captures information in real time from users' browsers and sends it to the RUM solution for analysis. Using RUM, teams can identify performance issues that may impact UX in the real world.

Synthetic monitoring, on the other hand, uses automated scripts to simulate user interactions with the application. The goal of synthetic monitoring is to identify performance issues and to provide an early warning of potential problems before they impact real users. This monitoring technique can be performed from a specific location or multiple locations, mimicking the actions of a real user accessing the application. It can additionally run scheduled tests at specific intervals, allowing it to continuously monitor the performance of an application. The data collected by synthetic monitoring can be used to measure the availability, response time, and functionality of an application, as well as to identify potential performance issues.

Now imagine that you have all those tools. Does that mean that your application is observable? Are you sure that you are collecting the right data, with the right agent?

Observability

Observability should be the central element of IT modernization strategies that aim to support UX improvements.

Observability is the ability to understand the internal state of a system by measuring its external behavior. In today's complex and distributed systems, it is crucial for understanding and troubleshooting issues, as well as making informed decisions about a system's performance and capacity.

The origins of observability can be traced back to a single blog post shared by Twitter’s observability team in 2013. The post highlighted the difficulty faced in making the team’s distributed architecture “observable”, lighting the match that sparked a wildfire of industry debate. As discussed earlier, with the shift to microservices and other containers/serverless, traditional monitoring tools could no longer provide the expected visibility of these complex systems. The metrics and traces were missing.

How observability came about

Our good old monitoring tools were not ready to handle the volume of metrics generated by the large-scale shift to containers, especially at the level of cardinality that exploded when microservices and containers are combined.

Traditional APM tools and their sampling-based approach no longer allowed a clear understanding of the customer journey, hence the emergence of distributed tracing. As user expectations constantly increase, we can no longer afford to lose or overlook potentially crucial information that is not taken into account in the samples of a traditional APM.

The problem of “real-time” still remains, however. When containers can start, crash, and be restarted in a matter of seconds (or less!) by Kubernetes, what's the point of monitoring every minute? Collecting logs, metrics, and traces – in real-time and in an exhaustive way – becomes a necessity in a world where the entropy of our systems is constantly increasing.

This entropy also has an impact on the efficiency of machine learning (ML) algorithms for those who use AIOps technology to reduce alert noise and detect anomalies. We all want to improve UX, but not at the cost of greater team fatigue. Observability provides more data and insights for training and operating AI models than current monitoring approaches.

Observability is therefore a data problem: basic datasets (logs, metrics, and traces) must be collected without sampling and in real-time. But to transform this data into useful information, you also need to be able to correlate it, provide context, and give it meaning.

Most of the network operation centers (NOCs) I've known in my career have had 20 to 30 different tools – from specialized tools like Oracle Enterprise Manager for Database administrators (DBAs), to Nagios for the network, and many others. This required the installation of multiple agents on the servers/VMs to collect these logs, metrics, and traces. And each agent is often proprietary. The concern with this approach is that the maintenance of agents is quite heavy when you have a multitude of machines (imagine the number of agents it takes to install, update, or replace – in the case of changed of tools – in a hybrid multi-cloud environment, with legacy IT talking to microservices running in one or more clouds). Not only is this extremely difficult to manage, but the cost of processing and memory will not be trivial either.

How observability and OpenTelemetry work together

One of the key tools for achieving observability is OpenTelemetry (a.k.a. OTel). The open-source project was originally known as the OpenTracing project (joined by OpenCensus later), which was initiated by a group of engineers from Uber and Lightstep, among others. It is now governed by the OpenTelemetry Steering Committee, which is made up of representatives from various organizations and companies, including Google, Microsoft, Splunk, AWS, and more. The project is backed by the Cloud Native Computing Foundation (CNCF), which provides support and resources for the development and promotion of the project.

Today, it is a vendor-neutral framework for collecting, storing, and analyzing telemetry data. It provides a consistent and standardized way of instrumenting applications and services, enabling engineers to understand the health and performance of their systems across different environments and technologies.

Observability without OpenTelemetry doesn't make sense for several reasons:

Standardization: OpenTelemetry provides a common set of application programming interfaces (APIs) and protocols for instrumenting applications and services, which means that telemetry data can be collected, stored, and analyzed in a consistent manner across different environments and technologies. This standardization enables your engineers to easily understand and troubleshoot issues, regardless of the underlying infrastructure. Without OpenTelemetry, each application or service would likely have its own way of instrumenting and collecting telemetry data, resulting in a lack of consistency across different environments and technologies.
Vendor neutrality: OpenTelemetry is vendor-neutral. This allows you to use the tools and services you prefer, without being locked into a particular vendor or ecosystem. You can change your back-end monitoring tool without having to re-instrument your environment.
Rich data: OpenTelemetry provides a rich set of data – including metrics, traces, and logs – which enables you to understand the health and performance of your systems at different levels of granularity. This rich data enables DevOps teams to troubleshoot issues, identify performance bottlenecks, and make informed decisions about capacity and scalability.
Open source: OpenTelemetry is free to use and can be modified and extended to meet the specific needs of an organization. This also means that the community can contribute to the development of the framework and add new features and integrations.

OpenTelemetry is an essential tool and is now considered the standard for observability in cloud-native systems; it has become the de facto lightweight single agent supported by all major vendors.

Final thoughts

Hopefully, now we can see how OpenTelemetry, APM, RUM, distributed tracing, synthetic monitoring, and even AIOps – come together to help understand and improve UX.

By combining these tools, organizations can start to identify performance issues, understand root causes, and make data-driven decisions about changes to their applications that will result in a better UX.

Episode 01 What DevOps teams need to know in 2023

Episode 03 Running human-focused postmortems

The relationship between observability, OpenTelemetry, and UX

Posted in:

Written by:

Share:

In partnership with

Monitoring techniques

Observability

How observability came about

How observability and OpenTelemetry work together

Final thoughts

Related content

Breaking down the WordPress drama

Build an AI roadmap that actually delivers value

5 questions to ask when buying a feature management tool

How Envoy freed developers from DIY feature flags

5 signs you’re ready for feature management and experimentation

How to choose the right generative AI vendor

How to get started with GenAI

The best feature management and experimentation software 2024

The engineering leader's guide to becoming enterprise ready

Be careful with ‘open source’ AI

The cost of the great developer tool consolidation

Inside the tech overhaul of Big Brother’s voting system

Why Elixir is one of the hottest programming languages to learn

The case against building your own SSO and SCIM

The CrowdStrike disaster is a lesson about testing

Why Zig is one of the hottest programming languages to learn

How to build an effective technical strategy

Why OpenFeature is central to modern feature management

Understanding feature flags

What is retrieval-augmented generation (RAG) and are you ready for it?

How to standardize codebases across teams

WebAssembly is still waiting for its moment

Minimum viable architecture is the backbone of a successful product

A buyer’s checklist for AI coding assistants

5 mistakes to avoid when picking an AI coding assistant

The best AI coding assistants 2024

How to argue with the AI coding assistant skeptics

PostgreSQL: The database that quietly ate the world

Partner Content: The Engineering Leader’s Guide to Goals and Reporting

AI models can’t understand code. Does that matter?

6 questions to ask when buying a software developer metrics tool

How to combat generative AI security risks

How Zalando uses its own Tech Radar to make better technology choices

4 things you need to know from the latest Thoughtworks Tech Radar

9 women in AI you need to know about

AI and Kubernetes are pushing cloud costs out of control

How to write better AI prompts

A buyer’s checklist for software developer analytics tools

How to plan for and mitigate different types of tech debt

5 mistakes to avoid when choosing a software developer analytics tool

The best software development analytics tools 2024

Who holds the edge in the JavaScript framework wars?

11 generative AI programming tools for developers

Researchers say generative AI isn't replacing devs any time soon

Mastering tough technical decisions

Unlocking productivity with developer platforms

12 things to consider when assessing open source software

Choose a contextualized AI coding assistant

What developers need to know about generative AI in 2024

Leading open-source teams in large organizations

Whatever happened to Big Data?

A journey to tackle legacy code in online travel

6 steps to addressing legacy enterprise code

Learning to live with legacy code

How test coverage can improve code quality

What you need to know about Biden’s AI executive order

How OpenAI fought off security threats and GPU shortages to scale ChatGPT

Balancing build vs buy decisions in a post-boom world

Architecting for profit: A blueprint for modern cloud economics

3 strategies for maximizing your cloud savings

Building a cloud architecture that can scale to any challenge

How are engineering orgs achieving reliability in 2023?

Tech debt for engineering leaders: How a shortcut today impacts tomorrow

What AI has to offer: Using LLM tools in interviews

Tech debt traps to avoid

The 6 biggest generative AI risks for developers

7 generative AI productivity hacks for developers

SRE for engineering managers

Can platform engineering help you do more with less?