OpsLevel Logo
Product

Visibility

Catalog

Keep an automated record of truth

Integrations

Unify your entire tech stack

OpsLevel AI

Restoring knowledge & generating insight

Standards

Scorecards

Measure and improve software health

Campaigns

Action on cross-cutting initiatives with ease

Checks

Get actionable insights

Developer Autonomy

Service Templates

Spin up new services within guardrails

Self-service Actions

Empower devs to do more on their own

Knowledge Center

Tap into API & Tech Docs in one single place

Featured Resource

OpsLevel's new MCP Server powers your AI Assistant with real-time context
OpsLevel's new MCP Server powers your AI Assistant with real-time context
Read more
Use Cases

Use cases

Improve Standards

Set and rollout best practices for your software

Drive Ownership

Build accountability and clarity into your catalog

Developer Experience

Free up your team to focus on high-impact work

Featured Resource

Production readiness checklist: An in-depth guide
Production readiness checklist: An in-depth guide
Read more
Customers
Our customers

We support leading engineering teams to deliver high-quality software, faster.

More customers
Hudl
Hudl goes from Rookie to MVP with OpsLevel
Read more
Hudl
Keller Williams
Keller Williams’ software catalog becomes a vital source of truth
Read more
Keller Williams
Duolingo
How Duolingo automates service creation and maintenance to tackle more impactful infra work
Read more
Duolingo
Resources
Our resources

Explore our library of helpful resources and learn what your team can do with OpsLevel.

All resources

Resource types

Blog

Resources, tips, and the latest in engineering insights

Guide

Practical resources to roll out new programs and features

Demo

Videos of our product and features

Events

Live and on-demand conversations

Interactive Demo

See OpsLevel in action

Pricing

Flexible and designed for your unique needs

Docs
Log In
Book a demo
Log In
Book a demo

Just launched: OpsLevel MCP

‍

Share this
Table of contents
 link
 
Resources
Blog

Production readiness checklist: An in-depth guide

Insights
Standardization
Platform engineer
Production readiness checklist: An in-depth guide
Megan Dorcey
|
June 30, 2025

Our software ecosystems grow more complex every year. With new frameworks, dependencies, AI integrations, and technologies to help automate or simplify every step of the development life cycle, keeping track of requirements that provide reliability and security can become difficult. And that’s why production readiness reviews and checklists help eliminate cognitive load. They let you focus more on features or potential failure points.Keep reading to learn the basics of production readiness and review a sample checklist to get started.

Automate your production readiness with OpsLevel's internal developer portal

‍

Benefits of Production Readiness

Production readiness checklists aren’t just another task on the to-do list. These reviews can tell us how well service teams are ready for production launch.

Production readiness is an investment. It’s not free, but by far the biggest return is improved customer experience. If your app is down, slow, or hacked, you will quickly lose revenue and customer trust. If your service doesn’t have the right resiliency, security, or observability to support customers, then you’ll want to know sooner rather than later.

Production readiness checklists can reduce the cognitive load of having to remember all the different vulnerability and failure points we need to consider in our complex landscape.

Production readiness reviews also offer an opportunity for teams to take advantage of automation or tooling that simplifies the process while improving quality. Requiring automated testing or code scans streamlines code reviews and removes manual testing.

Production readiness checklists can also reduce the cognitive load of having to remember all the different vulnerability and failure points we need to consider in our complex landscape.

Opportunities for production readiness

When first rolling out production readiness concepts, consider what teams or products may benefit the most.

For example, with services that haven’t launched in production, you can introduce a review prior to launch. However, make sure to leave time for the team to incorporate necessary changes or documentation in order to complete the mandatory items on the list.

For an application that’s just starting up, begin early and build in a production readiness mindset from day one. While the team builds out their services, they can refer to the production readiness checklist to see if there are design or tool considerations that they can incorporate. You can even automate parts of how new services are created so they automatically adhere to the production readiness checklist.

Another opportunity for rolling out a production readiness process involves incidents. If there’s a service that struggles with incidents or security vulnerabilities, production readiness can help get them back on track quickly, providing guidelines and time to implement changes that will benefit the service.

For existing services in production and with few incidents, the process can roll out gradually. The team will assess and complete the checklist over time. 

Once you’ve exploited those opportunities, you can then move to make production readiness a continuous process. You can monitor your checklist continually to ensure you’re improving, or at least not degrading, your production readiness.

Production readiness checklist

Let’s look at what a production readiness checklist contains.

Depending on your application needs, you may have different requirements for production readiness, just like you have different needs for service-level objectives (SLOs). For a small internal service that’s not mission-critical, you don’t need the same level of operational readiness as for a customer-facing application that pays the bills.

As with anything software-related, consider what your customers and your systems require. You may have different production readiness needs based on your application tier. We split the checklist into different categories, and you can further break down some tasks into fine-grained items if needed.

General

  1. Ownership: Service owners are identified. Contact information and methods are provided.
  2. Onboarding: Integration instructions for APIs are documented.
  3. Defined service-level indicators (SLIs) / service-level objectives (SLOs) / service-level agreements (SLAs): The SLIs and SLOs are documented and accessible. If applicable, you’ve also documented the SLAs.
  4. AI usage transparency: Clearly document where and how AI or ML is used in the service, including external dependencies like LLM APIs or internal models.

Disaster recovery

  1. Disaster recovery (DR): DR plans have been documented and tested.
  2. Backups: Backups of data occur regularly.
  3. Redundancy: Services should include at least two instances and could require deployment in multiple regions or locations.
  4. AI model recovery: Backup and version AI models or prompt templates. Document how to restore critical models or retrain them from clean datasets if needed.

Deployment

  1. Deployment strategy: The automated deployment strategy has been documented. For example, strategies include blue-green, canary, or others to create safer zero-downtime deployments.
  2. Continuous integration: When engineers commit their changes, the system kicks off automated builds, tests, and deployment to a lower-level environment.
  3. Continuous delivery: Deploying to production involves nothing more than approval and a click of a button. Changelogs and release notes indicate what changes exist in each environment.
  4. Static code analysis: Code is automatically scanned, formatted, or linted according to coding standards.
  5. AI model versioning and promotion: AI models or prompt logic are versioned, reviewed, and promoted across environments through a defined approval process.

Operations

  1. On-call policy: The service has an on-call system that pages the owning team for incidents.
  2. Incident management: The incident management and escalation processes have been documented. This includes processes for postmortem and long-term remediation.
  3. Runbooks: Runbooks have been written and are accessible, with known failure scenarios. You update runbooks whenever a new scenario is uncovered.
  4. Logging: The service utilizes centralized logs, and the logs can be accessed easily.
  5. Metrics: At a minimum, the Four Golden Signals are available for the service.
  6. Tracing: The application transactions can be traced, using the appropriate tools and sampling configuration for the service.
  7. AI observability: Track model-specific metrics (i.e., latency, accuracy, drift) and log inputs/outputs for debugging (with redaction as needed).

Testing

  1. Unit tests: Unit tests execute at every code push, automatically.
  2. Integration tests: If appropriate, automated integration tests execute and pass successfully.
  3. End-to-end or acceptance tests: Automated end-to-end or acceptance tests run as part of the continuous integration / continuous deployment (CI/CD pipeline). If manual testing is required, test results are documented.
  4. Broken tests: Failing tests break the build.
  5. AI model evaluation: Evaluate AI models for performance, safety, bias, or hallucination. Include baseline tests for accuracy, fairness, or expected behavior.
  6. Prompt/response validation: Validate LLM prompts and responses to ensure safe, scoped output. Include regression tests for AI prompt templates.

Resiliency

  1. Load testing: Load tests are automated or occur on a regular cadence. You document and publish the results.
  2. Stress testing: Stress tests are automated or occur on a regular cadence. You document and publish the results.
  3. Chaos engineering: Once the applications have proven the ability to stand up to load and stress, chaos engineering is integrated to identify weak points and opportunities to reduce failures.
  4. AI fallback strategies: Test fallback logic for AI components (i.e., default responses, cached outputs) if models or APIs are unavailable or misbehave.

Security

  1. Authentication/authorization: Each service or application requires proper authentication and authorization.
  2. Secrets management: Secrets are secured properly in a vault or secret store. 
  3. Static application security testing (SAST): Static code analysis tools monitor code in the CI/CD pipeline. The build breaks any time there are security vulnerabilities above a certain threshold. Thresholds are set based on service needs.
  4. Dynamic application security testing (DAST) / penetration (pen) testing: Automated DAST runs at appropriate intervals. Manual DAST or pen testing runs according to the security requirements of the service or company. As a note, some companies require DAST or pen testing prior to large changes or launches. Others run them quarterly. Your production readiness checklist should include the appropriate cadence for your situation.
  5. Dependency scan: All dependencies are using the latest or patched versions. For this, consider automating the scan using tools like FOSSA or Nexus Vulnerability Scanner to validate versions and licenses.
  6. Prompt injection protection: Validate and sanitize user inputs passed into AI prompts. Use allowlists, structured inputs, or escape mechanisms to prevent prompt injection.
  7. Model misuse protection: For generative AI, include safeguards against generating insecure code, private data, or unsafe outputs.

Governance, risk, and compliance (GRC)

  1. GRC documentation: GRC checklists have been completed as required. Many companies have a separate GRC system available. In that case, this checklist indicates its completion and documentation.
  2. Confidentiality, integrity, availability (CIA) rating: The CIA rating of the service has been documented and published.
  3. AI compliance checks: Ensure AI use complies with internal policies and external regulations (i.e., EU AI Act, CPRA, SOC 2). Log model changes and approvals for audit purposes.
  4. Responsible AI usage: Include review and documentation of AI-related risks, especially for user-facing generative applications or automated decision systems.

Revising the production readiness checklist 

As your organization learns more about its applications and your tech stack’s specific needs, you’ll want to revise the checklist. Perhaps you need additional checks specific to either front- or back-end applications. Or additional items specific to your deployment pipeline. On the other hand, you may also find opportunities to remove checklist items that don’t add value. As with code, you should iterate and revise the checklist to ensure that it meets your company’s changing needs.

Where should you keep your production readiness checklist?

Some companies will include the checklist inside the GitHub repo of each repository as a markdown (.md) file. This is a quick and easy way of keeping the checklist close to the code, making it easy to update. However, depending on who has access to the repository or GitHub itself, it may not be easily accessible.

Alternatively, some companies rely on tools like Excel to track production readiness. This may be accessible to all but becomes difficult to audit updates and track changes. And it runs the risk of someone deleting things accidentally. Hey, it happens.

Another alternative includes tools designed with production readiness in mind, like OpsLevel. With OpsLevel checks, you can design your checklist right into your service catalog. It’s visible to everyone and can be updated easily. Also, it lives right next to other service info like runbooks, monitoring and logging tools, and SLI/SLO information.

Make production readiness routine

Production readiness isn’t just a launch checklist, it’s your safety net for scaling reliable, secure, and AI-powered services. As your stack evolves, so should your readiness process. With OpsLevel, you can operationalize that process across every service, automate checks, and keep your teams focused on what matters: building great software.

Ready to simplify production readiness? Request a demo and see how OpsLevel makes it part of your everyday workflow.

‍

More resources

AI coding assistants are everywhere, but are developers really using them?
Blog
AI coding assistants are everywhere, but are developers really using them?

AI coding tools are at maximum hype, but are teams actually getting value from this new technology?

Read more
Fast code, firm control: An AI coding adoption overview for leaders
Blog
Fast code, firm control: An AI coding adoption overview for leaders

AI is writing your code; are you ready?

Read more
March Product Updates
Blog
March Product Updates

Some of the big releases from the month of March.

Read more
Product
Software catalogMaturityIntegrationsSelf-serviceKnowledge CenterBook a meeting
Company
About usCareersContact usCustomersPartnersSecurity
Resources
DocsEventsBlogPricingDemoGuide to Internal Developer PortalsGuide to Production Readiness
Comparisons
OpsLevel vs BackstageOpsLevel vs CortexOpsLevel vs Atlassian CompassOpsLevel vs Port
Subscribe
Join our newsletter to stay up to date on features and releases.
By subscribing you agree to with our Privacy Policy and provide consent to receive updates from our company.
SOC 2AICPA SOC
© 2024 J/K Labs Inc. All rights reserved.
Terms of Use
Privacy Policy
Responsible Disclosure
By using this website, you agree to the storing of cookies on your device to enhance site navigation, analyze site usage, and assist in our marketing efforts. View our Data Processing Agreement for more information.
Okay!