As part of our focus to deliver the leading Salesforce DevOps platform, we鈥檙e always looking for ways to improve our internal processes and continue to build a resilient system that serves our users. One of the ways we鈥檙e doing this is through our new observability team.
Our internal DevOps team does a great job supporting our platform, and has been working on these kinds of problems for some time. The new observability team was set up to increase our capability in this area. Folks from different areas of engineering joined the team, and they continue to rotate in and out of the team to gain skills, and contribute to this important service. Read on to learn more about what observability means at 91导航, and the impact it has.
What do we mean by observability at 91导航?
For us, observability centres around the ability to understand the inner workings of our system solely based on observing and interrogating it with external tools. That means we want to be able to understand the state of our application at any time, even if it鈥檚 in a state that we hadn鈥檛 previously predicted. Ideally, that would mean at any given time we don鈥檛 need to ship new custom code to diagnose an issue that we鈥檙e already seeing in the application.
What鈥檚 the goal of the observability team at 91导航?
The goal of the team is to increase our capability to operate our systems at scale, while handling internal code updates and external changes to the environment our code operates in (based on new customers or customers changing their usage patterns).
We roughly view the four main challenges to keeping a system stable under the force of internal and external changes as:
- Detecting there鈥檚 a problem with the system
- Triaging the problem and assigning an owner
- Figuring out the root cause of the issue
- Fixing the problem
Asking why the problem happened, and how we can make sure it doesn鈥檛 happen again, is an explicit part of our write-up of each incident. For us, it鈥檚 part of 鈥渇ixing the problem鈥, and it鈥檚 the way we do engineering at 91导航, but we realize that鈥檚 maybe not the way every company does it!
Our success at all four challenges can be improved by enhancing our observability. And so, a large part of the work the team has started doing is focussing on and improving our observability, therefore, aiming to improve our effectiveness when coping with those four problems.
We also have a remit to tackle all of the sociotechnical issues that arise when we鈥檙e tackling these problems. After all, there鈥檚 no point building a great observability tools if nobody knows how to use it! So, we engage with other teams to make sure we鈥檙e building the right things, help them get up to speed with it, share knowledge, and implement any feedback from using the platform.
How does the team work?
The team is made up of a mixture of temporary and permanent roles. This allows us to look long term and think about what the team might be doing in a year or two, and adapt as the business continues to grow.
Engineers from other teams do a three to six month secondment in the observability team, which is a great way to encourage learning and development in new areas for the rest of the engineering team.
Developers who join this team bring new ideas and experiences from their current roles, and take the new skills they鈥檝e learned back to their teams to make even more of a difference. Working in this collaborative transparent way is one of the things that makes 91导航鈥檚 engineering culture unique, and encourages professional development in the engineering team.
Any engineer can ask to join the team 鈥 it鈥檚 an explicit goal to help spread some of the knowledge from the observability team, so you definitely don鈥檛 need to be an expert already. However, we do have to make sure there鈥檚 a balance on the team at any one time, and across the rest of the engineering teams.
What projects have the observability team worked on?
One of the big projects we鈥檝e worked on is to start using OpenTelemetry tracing to improve our observability. We鈥檝e started using a tool called Honeycomb to complement our existing observability stack, and this has already led to lots of small improvements in our operational capacity. For example:
- We鈥檙e quickly able to identify that a specific code change had led to a massive increase in latency on one of our key endpoints.
- We鈥檝e been able to apply a number of small optimisations that allows the app to fundamentally operate faster.
- We now have more confidence around our use of queues, and where the slow points are.
The team has also been able to look over some of our database usage, and put on better observability tooling and techniques to improve how we look at, and optimise our database usage.
All these projects help us achieve our overall goal, to help the whole team have more confidence in their understanding of the system. Over time, this will enable us to not only make small fixes like the ones we鈥檙e working on now, but recover faster from larger failures, and move faster with new functionality because it鈥檒l be easier to verify at each stage of rollout that things are behaving as expected.
What鈥檚 it like working on the observability team?
[Julian Wreford]: 鈥淚鈥檝e really enjoyed working on the operational team, it鈥檚 been a bit of a change from my prior experience doing just software development, but I鈥檝e learnt so much by having the opportunity to really focus on the operational aspects behind running our application. It鈥檚 a cool mixture of ad-hoc response to immediate issues, and longer term planning around how we improve our general approach to running the application, and I鈥檝e really enjoyed that mixture.鈥
[Oli Lane]: 鈥淔ounding this team has been a real learning experience for us all 鈥 which is great. Having a team focused just on operational and observability concerns has given us the space to learn much more about how best to instrument, operate, and ultimately improve the services we run, and that鈥檚 been really fulfilling. From tackling issues, to paying down technical debt, to empowering other teams to get more insights into how the code they write runs in production, it鈥檚 easy to see the impact of the work we鈥檙e doing. And I often don鈥檛 know what I鈥檒l be working on when I get in each morning, which definitely keeps things fresh!鈥
Want to join the observability team?
Then why not take a look at our engineering roles, and you could be part of making a real difference at 91导航 and for our users.
