Our full library of resources
Implementing production readiness: challenges, solutions, and how to get developers on board
Production readiness is well-understood as the gold standard, but teams are often too busy with day-to-day tasks to make meaningful progress on this initiative. In this guide, we'll talk through how you can start building a culture of service maturity in your organization.
No matter what you’re building, you want to create a level of service maturity that streamlines developer workflows, reduces vulnerabilities, and ultimately delivers a secure and robust customer experience.
At your organization, this level of maturity might be called production readiness, application maturity, service maturity, operational excellence, or another name. Ultimately, all these terms encapsulate the idea that organizations need to build high-quality services that are reliable and secure. Although it goes by many names, we’ll call it production readiness in this guide.
Developers know that production readiness is important, but are often so busy with their day-to-day tasks that it falls by the wayside. Organizations often accumulate tech debt because of a lack of early attention to production readiness. And, unfortunately, this debt can lead to issues down the line.
That’s why it’s important to implement production readiness in your organization. In this guide, we’ll cover what production readiness is, why you should care about it, and some of the challenges that come with implementation, as well as ideas for how to solve them. We’ll also cover how to get developers on board with production readiness more broadly.
Why production readiness?
Production readiness is the idea that your production services meet your operating standards. That is, do they meet your standards for reliability, observability, security quality, and other areas that you’ve defined as important as you build your services?
Product readiness matters because we operate in complex technical environments where there’s no such thing as perfect software. Things will break. Even so, our goal as technical leaders is to minimize issues by removing the most common mistakes. When they do occur, Oftentimes, these preventable errors or incidents show issues that can then be improved. But even when improvements are made, they’re often local to a specific team or domain, meaning that you might experience these issues again elsewhere if you don’t have a process for documenting and sharing your best practices.
This leads to more incidents, more development work, and can potentially threaten your reputation and revenue. Not only that, but it also reduces developer velocity and can impact engineering morale. Ultimately, these issues burn money and time, all while being preventable.
The main challenges with production readiness
Although production readiness is important, that doesn’t mean it’s easy to implement. In fact, there are three main challenges that companies face: these are known as 1. The discovery problem, 2. The measurement problem, and 3. The cultural problem.
The discovery problem
Lots of services and systems with little visibility
Most organizations have a lot of services and systems. For example, most OpsLevel customers have 200+ microservices– even small startups with lean engineering teams regularly have services in the hundreds.
With all these services percolating, it can be difficult to get a holistic view and an accurate catalog of what actually exists. Without an accurate catalog, companies may lose track of services. Through re-orgs and employee turnover, services may wind up ownerless. Without a comprehensive view, engineering leadership lack insight, making it impossible to make strategic, proactive improvements. After all, you can't fix what you don't know about.
How to solve it
It may seem obvious, but before you can get your services and systems production ready, you need to know what services actually exist. The best way to do this is to create an automated service catalog, which will act as your foundation as you strive for higher standards.
This service catalog should be more than a simple list of services. You also need to include who owns each service and all the metadata associated with the service–it’s runbooks and documentation, links to other pieces of its toolchain, like deploys and observability dashboards.
It may be tempting to create such a catalog in a spreadsheet or on a Notion or Confluence page. Indeed, we’ve seen many teams go this route, building the solution by hand. Unfortunately, these types of catalogs quickly get out of date, as they require a fair amount of manual upkeep. This leaves development leaders and their teams without the complete picture they need.
Ultimately, your catalog will never be complete and up to date unless it's automated rather than a manual process.
The measurement problem
What your production readiness model looks like and how to evaluate it
Once your catalog is in place, the next step is to figure out how to implement, structure, and scale production readiness so that developers can engage with it. This requires measurement, as you can’t improve what you can’t measure.
But measurement is harder than it seems, especially since many teams not only struggle to figure out what production readiness actually means, but then rely on homegrown, manual checklists for evaluation. There are two main reasons measurement is difficult:
- Data collection. Although a checklist may seem simple, it can take a lot of effort to fill one in manually. After all, someone has to ask questions and track down information: For example: Are we on the latest version of our frameworks? Is our data encrypted to our standards? Answering these questions can take a lot of effort, and what constitutes “complete” may be inconsistent between teams.
- Evaluation. When you use your checklist before a service goes into production, it’s easy enough to use. But what happens in the future when new requirements have been added? How do you actually know that the services you have running in your production infrastructure are still compliant? When you’re relying on a manual checklist, this can be hard to track.
How to solve it
Many teams start with a checklist of every item they care about within their greater production infrastructure. Although this isn’t a bad place to start, it’s much like the spreadsheet solution for creating a service catalog in that it’s difficult to scale. A simple checklist won’t serve the entire engineering organization, especially as you grow.
Your best bet is to focus on automated checks, as you ultimately don’t want your developers to be responsible for manually checking each box. For production readiness, the goal should be declaratively storing your checks and having your services be automatically evaluated. When done right, you’ll be able to seamlessly integrate the measurement that you care about with your service catalog.
At OpsLevel, we use automated checks to categorize services in three tiers: bronze, silver, and gold. Not only can leaders set standards for each category, but they can also go to their service catalog and gain an understanding of the health of each service.
The cultural problem
Getting your team on board with production readiness
Once you’ve created a service catalog and implemented a system for measurement, how can you then get your developers on board? Unlike the first two challenges with production readiness, this last one is a cultural problem, which makes it particularly hard to solve. You don’t want to expend resources to set up a solution if no one in your organization is willing to implement it and use it.
How to solve it
When it comes to incentivizing teams to work on production readiness, three approaches work well. However, the approach that will work best for your team depends on its unique engineering team and culture. These are not one-size-fits-all. And, regardless of the approach that you take, invest as much as you can in automation.
Automating much of the process will make it easy for engineers to improve their systems. It will reduce the lift required by development teams to do the work you ask of them. Ultimately, if what you’re asking is relatively clearly defined and low effort, you’re likely to see more success. Here are the three approaches we recommend:
1. The people-oriented approach
The people-oriented approach ties production readiness directly into your existing goal-setting cycle.
If production readiness is something that you use to measure manager and team performance, it'll be more top of mind and you'll directly reward teams and people when they do well on these dimensions.
You can also add reporting on the health of services into regular operational review meetings. OpsLevel customers have seen success by sharing underperforming services with failing checks in regular meetings, as it gives visibility to these issues and flags them as important to leadership.
If you take a people-oriented approach, you need to make sure there's constant visibility to engineering leadership on the progress that teams are making on operational maturity. This is something that should be brought back to them regularly, whether it’s through meetings, regular reporting, OKRs, or another channel.
2. The process-oriented approach
The process-oriented approach is to carve out regular time where each team is directly focused on service maturity work. This works well with more autonomous teams. It can be done in a number of different ways depending on how you do software development and how your teams function.
For example, if you're working with a sprint model, you might say that 20% of the points in each of our sprints are going to be specifically for production readiness work. Or, you might decide that one team member is going to be solely dedicated to production readiness during the sprint. Perhaps you dedicate 100% of every fifth or sixth sprint to production readiness work.
You can also create a production readiness rubric ahead of time that shows the organization’s priorities. Teams will know the next most important change that they should make to their services because they'll see that. The changes required will be very clear to them.
3. The systems-oriented approach
The systems-oriented approach integrates production readiness directly into your delivery processes. We've seen some organizations hesitant to roll out continuous deployment that have then discovered that adding production readiness checks can really give them the confidence to be able to turn on continuous deployment.
This type of approach ensures that every service meets a minimum level of maturity. It helps teams move faster and can be a powerful motivator, as well.
This is a more challenging approach than taking a people or process-oriented tack, but we’ve seen many customers successfully embrace it.
Solving challenges and embracing production readiness
Ultimately, you want to build more secure and compliant services while having visibility into your current environment. You want to be production ready at all times.
There are certainly challenges, but they can be overcome. To recap, you want…
- to fix your discovery issues by having an automated service catalog,
- to solve those measurement problems by implementing service maturity with a continuous and automated system of checks
- align people, processes, and systems around these checks
To get started on the path to production readiness, request an OpsLevel demo today.
About the Host
"The improvement in quality across our entire engineering system in less than a year on OpsLevel is already immense"
David de Regt – Senior Director, Foundations Engineering at Outreach
Browse More Resources
The great software debate: Buying vs building solutions
In this week's episode, we're joined by Kyle Rockman, Platform Engineering Lead at OpsLevel, to discuss how companies can develop a framework to approach the build vs. buy decision.
How and when to use campaigns
Here, we’ll walk you through when and why to use a campaign along with common examples so you can start running your own campaigns.
How to set up your service maturity rubric
The OpsLevel Rubric is designed to help you ensure production readiness and measure overall service maturity against checks that you set across various levels. Here, we’ll explain how the rubric works in OpsLevel, and how you could set yours up.