Production Readiness in Depth: A Guide and Checklist

Our software ecosystems grow more complex every year. With new frameworks, dependencies, and technologies to help automate or simplify every step of the development life cycle, keeping track of requirements that provide reliability and security can become difficult. And that’s why production readiness reviews and checklists help eliminate cognitive load. They let you focus more on features or potential failure points.

So how can you get your team or organization started with production readiness? Read on to learn more. And then see a sample checklist that provides a great start for your teams.

Your Guide to Production Readiness

First, let’s discuss the benefits of production readiness. Then, we’ll look at opportunities for rolling out production readiness in your organization. Spoiler: all applications can benefit from incorporating production readiness, even those that are already in production!

So let’s get started.

Benefits of Production Readiness

Production readiness checklists aren’t just another task on the to-do list. These reviews can tell us how well service teams are ready for production launch.

Production readiness is an investment. It’s not free, but by far the biggest return is improved customer experience. If your app is down, slow, or hacked, you will quickly lose revenue and customer trust. If your service doesn’t have the right resiliency, security, or observability to support customers, then you’ll want to know sooner rather than later.

Production readiness checklists can reduce the cognitive load of having to remember all the different vulnerability and failure points we need to consider in our complex landscape.

Production readiness reviews also offer an opportunity for teams to take advantage of automation or tooling that simplifies the process while improving quality. Requiring automated testing or code scans streamlines code reviews and removes manual testing.

Also, as mentioned in the intro, production readiness checklists can reduce the cognitive load of having to remember all the different vulnerability and failure points we need to consider in our complex landscape.

Opportunities for Production Readiness

When first rolling out production readiness concepts, consider what teams or products may benefit the most.

For example, with services that haven’t launched in production, you can introduce a review prior to launch. However, make sure to leave time for the team to incorporate necessary changes or documentation in order to complete the mandatory items on the list.

For an application that’s just starting up, begin early and build in a production readiness mindset from day one. While the team builds out their services, they can refer to the production readiness checklist to see if there are design or tool considerations that they can incorporate. Even better is automating parts of how new services are created so they automatically adhere to the production readiness checklist.

Another opportunity for rolling out a production readiness process involves incidents. If there’s a service that struggles with incidents or security vulnerabilities, production readiness can help get them back on track quickly, providing guidelines and time to implement changes that will benefit the service.

For existing services in production and with few incidents, the process can roll out gradually. The team will assess and complete the checklist over time. As another opportunity for existing apps, consider upcoming marketing launches or a known surge in demand, like Black Friday sales. Here, you’ve found another opportunity to introduce the concept to your teams.

Once you’ve exploited those opportunities, you can then move to make production readiness a continuous process. You can monitor your checklist continually to ensure you’re improving—or at least not degrading—your production readiness.

Production Readiness Checklist

Next, let’s look at what a production readiness checklist contains.

Depending on your application needs, you may have different requirements for production readiness, just like you have different needs for service-level objectives (SLOs). So, for a small internal service that’s not mission-critical, you don’t need the same level of operational readiness as for a customer-facing application that pays the bills.

The Checklist

Finally, we get to see a sample checklist that you can use with your services.

As with anything software-related, consider what your customers and your systems require. You may have different production readiness needs based on your application tier. Having said that, this provides a great start for you. We split it into different categories, and you can further break down some tasks into fine-grained items if needed.

General

  1. Ownership: Service owners are identified. Contact information and methods are provided.
  2. Onboarding: Integration instructions for APIs are documented.
  3. Defined service-level indicators (SLIs) / service-level objectives (SLOs) / service-level agreements (SLAs): The SLIs and SLOs are documented and accessible. If applicable, you’ve also documented the SLAs.

Disaster Recovery

  1. Disaster recovery (DR): DR plans have been documented and tested.
  2. Backups: Backups of data occur regularly.
  3. Redundancy: Services should include at least two instances and could require deployment in multiple regions or locations.

Automate and Scale Production Readiness

See OpsLevel in action and learn how we can supercharge your production readiness checklists!

Request a Demo

Deployment

  1. Deployment strategy: The automated deployment strategy has been documented. For example, strategies include blue-green, canary, or others to create safer zero-downtime deployments.
  2. Continuous integration: When engineers commit their changes, the system kicks off automated builds, tests, and deployment to a lower-level environment.
  3. Continuous delivery: Deploying to production involves nothing more than approval and a click of a button. Changelogs and release notes indicate what changes exist in each environment.
  4. Static code analysis: Code is automatically scanned, formatted, or linted according to coding standards.

Operations

  1. On-call policy: The service has an on-call system that pages the owning team for incidents. Ideally, this involves tools like PagerDuty or Squadcast.
  2. Incident management: The incident management and escalation processes have been documented. This includes processes for postmortem and long-term remediation.
  3. Runbooks: Runbooks have been written and are accessible, with known failure scenarios. You update runbooks whenever a new scenario is uncovered.
  4. Logging: The service utilizes centralized logs, and the logs can be accessed easily.
  5. Metrics: At a minimum, the Four Golden Signals are available for the service.
  6. Tracing: The application transactions can be traced, using the appropriate tools and sampling configuration for the service.

Testing

  1. Unit tests: Unit tests execute at every code push, automatically.
  2. Integration tests: If appropriate, automated integration tests execute and pass successfully.
  3. End-to-end or acceptance tests: Automated end-to-end or acceptance tests run as part of the continuous integration / continuous deployment (CI/CD pipeline). If manual testing is required, test results are documented.
  4. Broken tests: Failing tests break the build.

Resiliency

  1. Load testing: Load tests are automated or occur on a regular cadence. You document and publish the results.
  2. Stress testing: Stress tests are automated or occur on a regular cadence. You document and publish the results.
  3. Chaos engineering: Once the applications have proven the ability to stand up to load and stress, chaos engineering is integrated to identify weak points and opportunities to reduce failures.

Security

  1. Authentication/authorization: Each service or application requires proper authentication and authorization.
  2. Secrets management: Secrets are secured properly in a vault or secret store. Tools like truffleHog or git-secrets scan code to identify potential secrets.
  3. Static application security testing (SAST): Static code analysis tools like Checkmarx or Snyk monitor code in the CI/CD pipeline. The build breaks any time there are security vulnerabilities above a certain threshold. Thresholds are set based on service needs.
  4. Dynamic application security testing (DAST) / penetration (pen) testing: Automated DAST runs at appropriate intervals. Manual DAST or pen testing runs according to the security requirements of the service or company. As a note, some companies require DAST or pen testing prior to large changes or launches. Others run them quarterly. Your production readiness checklist should include the appropriate cadence for your situation.
  5. Dependency scan: All dependencies are using the latest or patched versions. For this, consider automating the scan using tools like FOSSA or Nexus Vulnerability Scanner to validate versions and licenses.

Governance, Risk, and Compliance (GRC)

  1. GRC documentation: GRC checklists have been completed as required. Many companies have a separate GRC system available. In that case, this checklist indicates its completion and documentation.
  2. Confidentiality, integrity, availability (CIA) rating: The CIA rating of the service has been documented and published.

Anything Else?

As your organization learns more about its applications and your tech stack’s specific needs, you’ll want to revise the checklist. Perhaps you need additional checks specific to either front- or back-end applications. Or additional items specific to your deployment pipeline. On the other hand, you may also find opportunities to remove checklist items that don’t add value. As with code, you should iterate and revise the checklist to ensure that it meets your company’s changing needs.

As your organization learns more about its applications and your tech stack’s specific needs, you’ll want to revise the checklist.

Where Should We Keep Our Checklist?

For the last section, let’s cover where you’ll want to store or update your production readiness checklist.

Some companies will include the checklist inside the GitHub repo of each repository as a markdown (.md) file. This is a quick and easy way of keeping the checklist close to the code, making it easy to update. However, depending on who has access to the repository or GitHub itself, it may not be easily accessible.

Alternatively, some companies rely on tools like Excel to track production readiness. This may be accessible to all but becomes difficult to audit updates and track changes. And it runs the risk of someone deleting things accidentally. Hey, it happens.

Another alternative includes tools designed with production readiness in mind, like OpsLevel. With OpsLevel checks, you can design your checklist right into your service catalog. It’s visible to everyone and can be updated easily. Also, it lives right next to other service info like runbooks, monitoring and logging tools, and SLI/SLO information.

Parting Thoughts

No matter where your checklist resides or what you include, consider making production readiness reviews part of your company norms and processes.

And while you’re here, request a demo to see if OpsLevel can help with your production readiness and service ownership needs.

This post was written by Sylvia Fronczak. Sylvia is a software developer and SRE manager that has worked in various industries with various software methodologies. She’s currently focused on design practices that the whole team can own, understand, and evolve over time.

Previous Post: Validating Kubernetes Best Practices
Next Post: Ultra-Fast Thumbnail Generation with Jekyll and libvips

Learn how to grow your microservice architecture without the chaos.

Not ready for a demo? Stay in the loop with our newsletter.