Production readiness checklist: An in-depth guide
Our software ecosystems grow more complex every year. With new frameworks, dependencies, AI integrations, and technologies to help automate or simplify every step of the development life cycle, keeping track of requirements that provide reliability and security can become difficult. And that’s why production readiness reviews and checklists help eliminate cognitive load. They let you focus more on features or potential failure points.Keep reading to learn the basics of production readiness and review a sample checklist to get started.
Automate your production readiness with OpsLevel's internal developer portal
Benefits of Production Readiness
Production readiness checklists aren’t just another task on the to-do list. These reviews can tell us how well service teams are ready for production launch.
Production readiness is an investment. It’s not free, but by far the biggest return is improved customer experience. If your app is down, slow, or hacked, you will quickly lose revenue and customer trust. If your service doesn’t have the right resiliency, security, or observability to support customers, then you’ll want to know sooner rather than later.
Production readiness checklists can reduce the cognitive load of having to remember all the different vulnerability and failure points we need to consider in our complex landscape.
Production readiness reviews also offer an opportunity for teams to take advantage of automation or tooling that simplifies the process while improving quality. Requiring automated testing or code scans streamlines code reviews and removes manual testing.
Production readiness checklists can also reduce the cognitive load of having to remember all the different vulnerability and failure points we need to consider in our complex landscape.
Opportunities for production readiness
When first rolling out production readiness concepts, consider what teams or products may benefit the most.
For example, with services that haven’t launched in production, you can introduce a review prior to launch. However, make sure to leave time for the team to incorporate necessary changes or documentation in order to complete the mandatory items on the list.
For an application that’s just starting up, begin early and build in a production readiness mindset from day one. While the team builds out their services, they can refer to the production readiness checklist to see if there are design or tool considerations that they can incorporate. You can even automate parts of how new services are created so they automatically adhere to the production readiness checklist.
Another opportunity for rolling out a production readiness process involves incidents. If there’s a service that struggles with incidents or security vulnerabilities, production readiness can help get them back on track quickly, providing guidelines and time to implement changes that will benefit the service.
For existing services in production and with few incidents, the process can roll out gradually. The team will assess and complete the checklist over time.
Once you’ve exploited those opportunities, you can then move to make production readiness a continuous process. You can monitor your checklist continually to ensure you’re improving, or at least not degrading, your production readiness.
Production readiness checklist
Let’s look at what a production readiness checklist contains.
Depending on your application needs, you may have different requirements for production readiness, just like you have different needs for service-level objectives (SLOs). For a small internal service that’s not mission-critical, you don’t need the same level of operational readiness as for a customer-facing application that pays the bills.
As with anything software-related, consider what your customers and your systems require. You may have different production readiness needs based on your application tier. We split the checklist into different categories, and you can further break down some tasks into fine-grained items if needed.
General
- Ownership: Service owners are identified. Contact information and methods are provided.
- Onboarding: Integration instructions for APIs are documented.
- Defined service-level indicators (SLIs) / service-level objectives (SLOs) / service-level agreements (SLAs): The SLIs and SLOs are documented and accessible. If applicable, you’ve also documented the SLAs.
- AI usage transparency: Clearly document where and how AI or ML is used in the service, including external dependencies like LLM APIs or internal models.
Disaster recovery
- Disaster recovery (DR): DR plans have been documented and tested.
- Backups: Backups of data occur regularly.
- Redundancy: Services should include at least two instances and could require deployment in multiple regions or locations.
- AI model recovery: Backup and version AI models or prompt templates. Document how to restore critical models or retrain them from clean datasets if needed.
Deployment
- Deployment strategy: The automated deployment strategy has been documented. For example, strategies include blue-green, canary, or others to create safer zero-downtime deployments.
- Continuous integration: When engineers commit their changes, the system kicks off automated builds, tests, and deployment to a lower-level environment.
- Continuous delivery: Deploying to production involves nothing more than approval and a click of a button. Changelogs and release notes indicate what changes exist in each environment.
- Static code analysis: Code is automatically scanned, formatted, or linted according to coding standards.
- AI model versioning and promotion: AI models or prompt logic are versioned, reviewed, and promoted across environments through a defined approval process.
Operations
- On-call policy: The service has an on-call system that pages the owning team for incidents.
- Incident management: The incident management and escalation processes have been documented. This includes processes for postmortem and long-term remediation.
- Runbooks: Runbooks have been written and are accessible, with known failure scenarios. You update runbooks whenever a new scenario is uncovered.
- Logging: The service utilizes centralized logs, and the logs can be accessed easily.
- Metrics: At a minimum, the Four Golden Signals are available for the service.
- Tracing: The application transactions can be traced, using the appropriate tools and sampling configuration for the service.
- AI observability: Track model-specific metrics (i.e., latency, accuracy, drift) and log inputs/outputs for debugging (with redaction as needed).
Testing
- Unit tests: Unit tests execute at every code push, automatically.
- Integration tests: If appropriate, automated integration tests execute and pass successfully.
- End-to-end or acceptance tests: Automated end-to-end or acceptance tests run as part of the continuous integration / continuous deployment (CI/CD pipeline). If manual testing is required, test results are documented.
- Broken tests: Failing tests break the build.
- AI model evaluation: Evaluate AI models for performance, safety, bias, or hallucination. Include baseline tests for accuracy, fairness, or expected behavior.
- Prompt/response validation: Validate LLM prompts and responses to ensure safe, scoped output. Include regression tests for AI prompt templates.
Resiliency
- Load testing: Load tests are automated or occur on a regular cadence. You document and publish the results.
- Stress testing: Stress tests are automated or occur on a regular cadence. You document and publish the results.
- Chaos engineering: Once the applications have proven the ability to stand up to load and stress, chaos engineering is integrated to identify weak points and opportunities to reduce failures.
- AI fallback strategies: Test fallback logic for AI components (i.e., default responses, cached outputs) if models or APIs are unavailable or misbehave.
Security
- Authentication/authorization: Each service or application requires proper authentication and authorization.
- Secrets management: Secrets are secured properly in a vault or secret store.
- Static application security testing (SAST): Static code analysis tools monitor code in the CI/CD pipeline. The build breaks any time there are security vulnerabilities above a certain threshold. Thresholds are set based on service needs.
- Dynamic application security testing (DAST) / penetration (pen) testing: Automated DAST runs at appropriate intervals. Manual DAST or pen testing runs according to the security requirements of the service or company. As a note, some companies require DAST or pen testing prior to large changes or launches. Others run them quarterly. Your production readiness checklist should include the appropriate cadence for your situation.
- Dependency scan: All dependencies are using the latest or patched versions. For this, consider automating the scan using tools like FOSSA or Nexus Vulnerability Scanner to validate versions and licenses.
- Prompt injection protection: Validate and sanitize user inputs passed into AI prompts. Use allowlists, structured inputs, or escape mechanisms to prevent prompt injection.
- Model misuse protection: For generative AI, include safeguards against generating insecure code, private data, or unsafe outputs.
Governance, risk, and compliance (GRC)
- GRC documentation: GRC checklists have been completed as required. Many companies have a separate GRC system available. In that case, this checklist indicates its completion and documentation.
- Confidentiality, integrity, availability (CIA) rating: The CIA rating of the service has been documented and published.
- AI compliance checks: Ensure AI use complies with internal policies and external regulations (i.e., EU AI Act, CPRA, SOC 2). Log model changes and approvals for audit purposes.
- Responsible AI usage: Include review and documentation of AI-related risks, especially for user-facing generative applications or automated decision systems.
Revising the production readiness checklist
As your organization learns more about its applications and your tech stack’s specific needs, you’ll want to revise the checklist. Perhaps you need additional checks specific to either front- or back-end applications. Or additional items specific to your deployment pipeline. On the other hand, you may also find opportunities to remove checklist items that don’t add value. As with code, you should iterate and revise the checklist to ensure that it meets your company’s changing needs.
Where should you keep your production readiness checklist?
Some companies will include the checklist inside the GitHub repo of each repository as a markdown (.md) file. This is a quick and easy way of keeping the checklist close to the code, making it easy to update. However, depending on who has access to the repository or GitHub itself, it may not be easily accessible.
Alternatively, some companies rely on tools like Excel to track production readiness. This may be accessible to all but becomes difficult to audit updates and track changes. And it runs the risk of someone deleting things accidentally. Hey, it happens.
Another alternative includes tools designed with production readiness in mind, like OpsLevel. With OpsLevel checks, you can design your checklist right into your service catalog. It’s visible to everyone and can be updated easily. Also, it lives right next to other service info like runbooks, monitoring and logging tools, and SLI/SLO information.
Make production readiness routine
Production readiness isn’t just a launch checklist, it’s your safety net for scaling reliable, secure, and AI-powered services. As your stack evolves, so should your readiness process. With OpsLevel, you can operationalize that process across every service, automate checks, and keep your teams focused on what matters: building great software.
Ready to simplify production readiness? Request a demo and see how OpsLevel makes it part of your everyday workflow.