Why Team-Level Metrics Matter in Software Engineering
Key Takeaways
- If you care about things like impact, speed and quality then you need some form of metrics to work with so that you can begin to track your progress
- The DORA Metrics of Deployment Frequency (DF), Lead Time for Changes (LTC), Mean Time to Recovery (MTTR), and Change Failure Rate (CFR) are a useful starting point
- Tracking metrics at the team level allows the team to gain insight into their performance and experiment based on real feedback
- Metrics should also be consolidated across the whole delivery organization
- Metrics alone don’t tell the whole story – they need to be intpereted with insight and care
In a world where everything can have perspective, context and data, it doesn’t make sense to limit that to just part of your software development process. Your department’s work doesn’t stop when it’s submitted to git, and it doesn’t start when you get assigned your ticket. From the time that a work item first comes into focus, to when it slides into your production code like a tumbler into your existing solution, there are many areas where something can go right and just as many more ways where something might go wrong. Measuring those areas like any others in your pipeline is crucial to making improvements. We’re going to spend a little bit of time reviewing our terms and concepts, but then we’re going to dive into Jobber’s development process and discover how we:
- Made the process of QA much easier by integrated Just-In-Time QA builds for our development branches
- Streamlined our PR process to get work through our approvals and testing quicker
- Integrated new services for handling failures and outages
- And, discovered the reason why we weren’t getting our engineers enough time to put their heads down and just work, (spoilers, it was Meetings!), and why talking to your developers is as important as employing engineering metrics
The most commonly accepted industry standard for these measurements is the Four Keys set out by the DORA Team at Google of Deployment Frequency (DF), Lead Time for Changes (LTC), Mean Time to Recovery (MTTR), and Change Failure Rate (CFR). At its heart, these four metrics measure how frequently you deploy your code to production (DF), the time between the work being finished and being deployed (LT), how long it takes to recover from a serious production issue (MTTR), and how often your newly hotfixed code causes issues in production (CFR). In the abstract, they are key metrics across the general categories of Impact you are having on your customers, the speed at which you are doing so, and the consistency or quality of the services you are delivering. If you care about things like Impact, Speed and Quality then you need some form of metrics to work with so that you can begin to track your progress.
At Jobber, a provider of home service operations management software, we track these metrics and more as a Product Development department to ensure that we’re tracking changes to our progress in a measurable way. They help our teams be agile as they make changes, and be data-driven in their execution. If a team wants to try a new method of triaging bugs, or a new PR process, we can track that in real time against not only their previous performance, but also chart that same metric across the department as a whole, eliminating the risk of larger departmental noise ruining our data. Then those metrics on the individual or team tactical level roll up into groups, departments and eventually our entire organization. This gives us the fidelity to drill down into any layer we’d like, from the individual all the way to Jobber as a whole, and how we stack up against other organizations.
The Four Keys metrics are not the only thing we track as well but it’s always important to put a caveat on the data that you collect. Some of that noise at an individual, group or department level is very human in nature, and that context is invaluable. While we collect data on as many different facets as we can, we also recognize the human side of those metrics, and how a difficult project, switch in mission, or personal circumstance might affect one or more key metrics for an individual, team or even department.
That being said, Metrics (DORA/Four Keys and otherwise) have helped Jobber make a number of changes to our development process; including investing in build-on-demand CI/CD pipelines not only for our production environments but our developer environments as well, drastically impacting our LTC by getting test builds out to internal stakeholders and testers minutes after an engineer has proposed a fix. We’ve also streamlined our PR review process by cutting down on the steps required for us to push out hotfixes, new work and major releases, cutting down significantly on our DF. After reviewing some of our failure metrics, we integrated new services for dealing with outages and tech incidents, really improving our MTTR. Let’s dig into each of those examples a little deeper.
When we started examining our metrics more closely we realized that there were a number of improvements we could make to our development process. Specifically, in investigating our Lead Time to Change and Deployment frequency, we realized that a key step where we were behind our competition was our ability to deliver code to internal parties quickly and efficiently. We discovered that our time between a change being first ready for review, and being reviewed was much slower than companies of a similar size. This allowed us to look deeper, and we found that by getting our builds to our product owners, stakeholders, and others responsible for QA quicker would mean that we could tighten those loops and deploy much quicker. We rolled out on-demand Bitrise mobile builds for all of our new PRs, and that meant that it now only took 30 minutes to deliver builds to all interested parties with the contents of a change or a revision. This not only accelerated our feature development, but also had a meaningful impact on our MTTR and CFR metrics by simply getting code through our review process much faster.
When we looked at our Mean Time to Recovery and Change Failure Rate metrics we also discovered that we weren’t as efficient there as we would like. We were responsive in our responses to failure, but there was room to improve, especially in our communication and organization around incidents. We integrated Allma as an issue collaboration layer within our Slack channels, organizing and focusing communication around an incident. Before it was difficult for people to “hop-in” to help out with an issue, since it was typically scattered in a number of different places. Allma flows helped us to sort out those misconceptions and confusions by centralizing discussion of an active issue in one place, allowing many parties to jump in, monitor, or contribute to the resolution of an issue. In this as in the previous case, monitoring our metrics allowed us to find process and tooling changes in addition to specific technical or framework changes.
I want to zoom into a specific problem though that crept up in a really interesting way. When we looked through Jellyfish, our engineering measurement tool, we noticed a fundamental problem: our ICs (individual contributors) weren’t coding enough! We measure how many PRs our engineers push out as well as “Coding Days,” a rough approximation of how often during a day an engineer spends working on code versus the other parts of their day. We saw that over the year, our ICs were spending less and less time working on code and more and more time working on the other demands on their time. A simple and obvious solution of “just code more” jumps to the untrained mind, but like any problem with metrics or data, sometimes a signal is telling you something but measuring it can often result in a lot of noise. The best way to work through that noise is to zoom in from the zoomed out view that we often view our metrics into the personal experiences of your ICs, and that takes some tough conversations sometimes.
Underpinning those conversations has to be an element of trust, or else you won’t get any useful context out of it. At Jobber, while we examine as much data as we can to help inform our decisions we recognize that it’s at most only half of the equation, with the lived experiences of those being measured being the second half. A Jobberino is never measured by their data; it’s only used to contextualize what we’re seeing on our teams and groups and never used to define them or their work. This means that together we can look at metrics not as a measurement of someone’s worth to the company, but rather as the signals that might hint at a true problem. So when we sat down to make sense of the drop in our PRs, the first place we went was right to the source and started exploring the issue directly with our engineers, seeing what was keeping them from their important work building features and crushing bugs.
In what I’m sure is a familiar refrain to many who read this, when we dug into the data and the context around it, unsurprisingly meetings were the result. Specifically, meetings placed at inopportune times that would oftentimes break that all important technical flow. As you may or may not know, a significant amount of problems that engineers deal with will require at least 4 hours of time to solve. Therefore, meetings that break up that all important flow often meaningfully set back the resolution of those problems. Those problems are often the hardest to solve as well so you pay an opportunity cost there as well, as the most significant problems are the hardest set back. In this case, a relatively normal metric (the amount of time that our ICs are spending coding) was in this case burying a mountain of useful changes and information, and we’re now actively monitoring the large chunks of time our engineers have available, in addition to just roughly measuring their productive time. We would never have found that information if we had not first measured our engineering efforts, and definitely would not have had those critical contextual conversations had we not had environment of trust that allowed us to tackle the signal, not the noise.
Beyond the Four Keys lies all sorts of interesting metrics, as well. We also measure not only the amount of defects resolved, but also how many we’re closing per week. Or the amount of time that a PR takes before it’s closed (as well as the number of PRs reviewed and comments on those PRs!). We even measure how many times teams and departments update our internal documentation and wiki resources, how many times we reinvest back into other developers or documentation each week. It ultimately comes down to this: if you don’t track it, you can’t measure it. And if you don’t measure it, then you can’t improve it. Especially at the engineering manager level and beyond, as you tweak, modify and adjust your policies, processes and tools, you want to have that visibility into the success or failure of any particular change. We have no worries slapping OKRs onto our features, and we should have that same vim and vigor on tracking our contributions to the business of development. Just make sure to always get the boots on the ground approach before you see some magical trend-line in the data.