So, 3 months after starting a job at Honeycomb , I think I’m finally starting to understand what the heck Observability is all about. I’m still pretty new to it all, so consider these thoughts pretty idle in nature. Here’s what I’ve got so far.
Holy hell is there word soup in this space. How do people live like this?
- Distributed tracing
- Observability (the one I’m writing about now)
- Infrastructure monitoring
- Performance monitoring (different from APM? I can’t say)
- Service Maps
I know that every “space” as its own sea of vocabulary, but I really do think Observability is one of the most daunting ones I’ve seen. And there’s so many different technologies out there, each of which at least capable of spitting out a log, so someone in my position has to learn about it. Absolutely overwhelming!
My hat is off to every SRE who’s had to endure this barrage of word soup while getting stuff done every day. Y’all have much harder jobs than your average software engineer.
And as it turns out, there’s a bunch of other companies who’ve been in the “monitoring” and “analyze your logs” space for a long time now and they’ve sorta rebranded their existing products as “observability” because that’s a lot easier than making products faster, simpler, and more enjoyable to use.
Anyone who’s googled “observability” in the past years has undoubtedly come across this thing called “three pillars of observability” and also articles that talk about why that framing sucks (and oh, hi Danyel!)
There’s these things called traces, metrics, and logs, which are really just data about your programs that come in specific formats. My understanding is that traces are the most useful of the three, but all three are still important.
But these are just data structures and inputs. They’re interesting from a mechanical standpoint, but ultimately I care a lot more about solving my performance and reliability problems as quickly as possible. I’ll work to make sure I collect the right data, but at the end of the day, it’s about how fast I can analyze my data and solve whatever problem I’m having.
I found this whole “three pillars” thing (and the fight against it) pretty confusing until I realized that there’s sort of a parallel in a much more broadly-understood concept: Object-Oriented Programming!
I learned in school and from lots of places online that there are four principles of OOP :
But once I started actually using an OOP language in a big codebase, I learned that OOP is not actually about the so-called four principles of OOP. In fact, at least one of them (inheritance) is often derided as being the source of many code maintainability problems in large codebases!
Encapsulation via objects or generalizing stuff via polymorphism is only a goal if you have infinite free time. They are merely ways to achieve real goals, such as:
- Code maintainability
- Accurate domain modeling
- Flexibility of your code over time
And as I learned over the years, there’s more that goes into achieving these goals than just the OOP principles I learned in school. Your choice of IDE, debugger, and test framework will impact your overall development experience just as much (if not more) than how you use a language’s constructs to organize your systems.
When I started my job and googled the hell out of “observability”, I didn’t come across much that had to do with querying data. Maybe this was just Google being extra creepy, but it was mostly stuff published by Honeycomb that I’d end up on whenever I wanted to learn about anything more than the “three pillars”.
At Microsoft, I had no issue with data about the products I used being available. The problem was that querying that data with Kusto was such a pain that I’d always drag out the little “data fetch quests” as long as I could because I knew it would feel like an unproductive drag just to get a dumb number. First, the query language didn’t have great error recovery. Then the engine was super slow. And the way the data was laid out was weird as hell, but also not discoverable at all, so it took forever to answer anything other than a trivial telemetry question.
Traditional APM tools have a query language (that isn’t SQL) that you have to learn to be able to actually filter and transform your data into something useful. Query languages aren’t inherently bad; I don’t mind learning them. But it’s also a high bar to start getting any insight into what’s going on with your systems.
Honeycomb takes a different approach with a more visual query UI that’s actually pretty easy to learn how to use. And it’s fast. Being fast really matters a lot.
I think other vendors should focus more on making querying data as approachable as possible. Instead, I get the feeling that it’s treated like an afterthought. I think most vendors in the Observability space have an “underpants gnomes” approach to Observability:
- Collect traces, metrics, and logs
What’s step 2? Entirely unclear to me as an outsider whenever I read their guides on what Observability is or how to achieve it. It seems that some are resorting to “AI” to fill in that gap so they can charge a bunch of money to run your data through a rudimentary model.
Eventually I expect some form of machine learning to get good enough at pattern recognition and surface things you should look into. But I think we’re a long ways off from that being practical for everyday use.
For now, nothing beats a fast query engine. Anyone who works with databases knows this already, I suspect.
Since I’m still only 3 months into this space, I’m far from an expert on instrumenting a codebase for observability. But as far as I can tell, this shit is not easy.
I’m very used to writing “debuggable” code. Just declare some values/variables, click on your editor to set a breakpoint, hit the debug button, and BAM: you can see what values are, step through code execution to see how things change, and pretty quickly narrow in on the problem. And to aid future developers, it’s pretty simple to keep code “debuggable” in just about any programming language that as a debugger.
Comparatively, I regularly find myself intimidated by how much work is involved to make my system “debuggable” for observability tools. Sometimes I need to install an agent, and I get anxiety over installing, deploying, and managing agents. I might need to pull in a library that’s poorly documented, or has some assumptions about how my app is loaded that don’t apply because I’m using a different library or framework that doesn’t let me control my application lifecycle. Maybe I’m working with a proxy that’s spitting out a bunch of crap I don’t care about and now I need to figure out how to shut it up, except I still want to keep some of what it spits out. Another proxy? Or do I have to write some code to “plug in” to something else now? The list goes on. It’s hard.
Every vendor seems to have a solution of sorts that they claim makes it “easy” – but these tend to be proprietary, so you have to start over if you pick a different vendor. It seems likely that OpenTelemetry (OTel) is going to win out as the standard for collecting data, but it’s still so new that we’re regularly giving talks that talk about what OpenTelemetry even is, so it’s got a long way to go.
More than anything, I think the key to Observability getting outside the domain of SREs is to find ways to make it as easy to instrument code as it is to “make code debuggable”. As far as I can tell, nobody’s gotten close yet.
Since I’ve had the pleasure of starting a new role where I came in knowing nothing, lemme tell ya:
Analyzing systems, even with good tools, is really hard if you’re not already an expert in those systems.
This isn’t exactly unique to Observability tools. I’m lucky to be one of the few experts on the F# language and compiler , which means that for any perf issues, I know what kind of data to collect, which tool to use, In know where to start looking, which things “matter” and which don’t, if the data being collected is meaningful, etc. Anyone new to F#, or really even just F# performance analysis is not going to be productive at figuring out their performance bottlenecks.
As far as I can tell, if you’re not an expert on how your distributed systems (microservices?) are built, you’re going to have a hell of a time diagnosing correctness and performance issues. Then there’s the added complexity of needing to know how data is being organized in whatever observability tool you’re using.
Going back to the debugging analogy, you can actually use a debugger (like the one in Visual Studio) to learn about how your codebase works. Just set some breakpoints at various stages of your application, step through, and watch it all unfold. It’s amazing. And I haven’t seen Observability tooling offer up anything like that so far.
I think there’s ways you can know up front what kind of data is being collected and offer some reasonable jumping off points to query data further. That’s not an easy thing to build, though. I really hope I don’t throw my hands up in the air and say, “AIOps, fuck it”.
I wanted to end on more of a high note. Observability, once you “attain” it, is pretty fucking awesome.
I’m lucky that at Honeycomb, the engineering org has a high degree of “observability maturity”. There’s practically no question that you can’t get an answer to in sub-second times. I don’t know if I’ve had that come to jesus moment about Observability just yet, but I think I’m on the path to it. I wish every engineering organization could have observable systems like this.
I also think there’s some pretty neat stuff that could be done in the future. What does observability for testing and CI systems look like? Could observability data get pulled into an IDE? Lots of possibilities.