OpsLevel Logo
Product

Visibility

Catalog

Keep an automated record of truth

Integrations

Unify your entire tech stack

AI Engine

Restoring knowledge & generating insight

Standards

Scorecards

Measure and improve software health

Campaigns

Action on cross-cutting initiatives with ease

Checks

Get actionable insights

Developer Autonomy

Service Templates

Spin up new services within guardrails

Self-service Actions

Empower devs to do more on their own

Knowledge Center

Tap into API & Tech Docs in one single place

Featured Resource

March Product Updates
March Product Updates
Read more
Use Cases

Use cases

Improve Standards

Set and rollout best practices for your software

Drive Ownership

Build accountability and clarity into your catalog

Developer Experience

Free up your team to focus on high-impact work

Featured Resource

The Ultimate Guide to Microservices Versioning Best Practices
The Ultimate Guide to Microservices Versioning Best Practices
Read more
Customers
Our customers

We support leading engineering teams to deliver high-quality software, faster.

More customers
Hudl
Hudl goes from Rookie to MVP with OpsLevel
Read more
Hudl
Keller Williams
Keller Williams’ software catalog becomes a vital source of truth
Read more
Keller Williams
Duolingo
How Duolingo automates service creation and maintenance to tackle more impactful infra work
Read more
Duolingo
Resources
Our resources

Explore our library of helpful resources and learn what your team can do with OpsLevel.

All resources

Resource types

Blog

Resources, tips, and the latest in engineering insights

Guide

Practical resources to roll out new programs and features

Demo

Videos of our product and features

Events

Live and on-demand conversations

Interactive Demo

See OpsLevel in action

Pricing

Flexible and designed for your unique needs

Docs
Log In
Book a demo
Log In
Book a demo
No items found.
Share this
Table of contents
 link
 
Resources
Blog

Understanding and Managing Configuration Drift

Insights
DevX
Developer
Understanding and Managing Configuration Drift
OpsLevel
|
July 9, 2022

For most enterprises, microservices and agile methodologies tend to go together. So, when you adopt a microservice architecture, you’re embracing more than just a new paradigm for building services. You’re also committing to deploying new code and configurations more often.

More deploys mean more change, and no matter how hard you try, code and configuration changes will slip through the change management cracks.

You end up with systems that are out of sync with your deployment pipelines, release packages, and source control. That means you don’t completely understand what’s running in production. That’s a condition you never want to find yourself in. It leads to mysterious outages, unexpected regressions, and unhappy customers.

The consequences of configuration drift can be serious. It exposes your systems to potential data loss and extended outages.

When configuration falls out of sync, we refer to it as configuration drift. Let’s look at the different ways this happens, how we can avoid it, and how to fix it.

What is Configuration Drift?

First, let’s define what configuration drift is and why it occurs.

Configuration drift is when production infrastructure configurations fall out of sync with their expected state. For example, when primary and secondary networking systems have different configurations, they have “drifted” apart from each other. Or, when a software application’s configuration file differs from its latest package, it has drifted from its expected settings.

The consequences of configuration drift can be serious. It exposes your systems to potential data loss and extended outages. If a router fails and the secondary takes over with a different configuration, it may not function correctly, so a failover situation becomes a failure scenario.

If an engineer repairs a software application in production with a manual configuration change and doesn’t capture it in source control for the next release, the next release will write over the fix and reintroduce the problem.

As bad as it sounds, configuration drift is a fact of life. Changes happen, and even the most robust management system isn’t faultless.

Configuration drift is one of the primary reasons why disaster recovery and high availability systems fail. While you should make every possible effort to prevent it, you need to put procedures in place for discovering drift and recovering from it when it happens.

How Does Configuration Drift Happen?

The short answer to how configuration drift happens is “when someone subverts or skips the deployment process.” But it’s an oversimplified response and assumes that a sound deployment process is in place.

Manual Configuration Changes

There’s an outage in production. Nearly every engineer’s been in this position, and all of them want to do the right thing: fix it as soon as humanly possible.

Sometimes the fix is a simple configuration change. A port forward is missing on a firewall. You need to toggle an application setting because of a new client. A buffer value is suddenly too small because of increased client traffic.

The fastest way to fix those problems is to update the config and restart the process or let it re-read its configuration. Problem solved!

But now, you need to reflect that change in your failover systems and source control. If that doesn’t happen, you have configuration drift.

Failed or Incomplete Deployments

If a configuration update isn’t deployed to all your production systems, they fall out of sync with their expected state and, depending on the problem, with each other.

This problem can occur when a configuration change isn’t included in a new software release due to a failed merge or a packaging error. Or it can happen when a deployment fails, and some systems receive the change while others do not. Either way, you have configuration drift.

Identifying and Preventing Configuration Drift

So, how do you go about preventing configuration drift? How do you know when it’s happening? These questions go hand in hand because the answers depend on how you manage your environments and how you want to manage changes.

Configuration Drift and Configuration State

We’ve referred to updating your configuration in source control a few times so far. That’s because we hope that you’ve already adopted configuration as code. If you haven’t, it’s time.

The term configuration drift implies that your configuration has an expected state and one or more systems no longer match it. This means that you need an expected state stored somewhere, and there’s no better place than source control.

Avoiding Configuration Drift

The easiest way to manage manual changes is to never make them. This sounds glib, but it’s not a bad approach, either. You can get all or part of the way there with the right tools and processes.

Manual changes to infrastructure aren’t necessary if you manage your system changes via infrastructure as code (iaC) tools like Puppet, Chef, Ansible, or Salt. You can also take steps to make manual changes impossible for all or most engineers by locking down your system permissions.

For application code, you can avoid manual changes with Continuous Integration/Continuous Deployment (CI/CD) pipelines that make deploying change fast, simple, and reproducible. Pipelines also help avoid failed deployments since they make it easier to detect errors.

Completely eliminating manual changes is probably impossible for most enterprises, though. IaC technology is powerful and mature, but it rarely covers all circumstances. Even though the technology covers a lot of ground, getting the processes and culture in place is difficult.

The key is to make manual changes unnecessary in as many cases as possible. If doing the right thing is easy, engineers will choose it over the more difficult option every time.

Eliminating Drift

If you’re not interested in finding drift before you fix it, you can periodically destroy and rebuild your environments. This may sound like a good option if you have a robust CI/CD pipeline and automated system tools. You may already be doing this.

The easiest way to manage manual changes is to never make them.

But, as we said above, configuration drift happens. Wiping out a manual change might result in reintroducing a problem that someone fixed earlier, but failed to record.

Identifying Drift

If you’re interested in capturing drift and evaluating the changes before deleting them, you need a way to proactively look for changes and highlight them. This gives you a chance to see if the drifted changes should be committed back to your master copy.

If your current configuration state is stored in source control, it’s reproducible. So, with the right tools, you can compare it to production and find the changes. The IaC tools we mentioned above can do much of this work for you.

Managing Configuration Drift

In this post, we started out by defining configuration drift. We saw how it’s a common problem that can cause serious downtime and result in impactful losses. Then we moved on to the two most common causes of configuration drift. Manual changes and deployment problems are the most common causes of drift, so we discussed several methods for avoiding those problems and detecting the configuration issues they often cause.

Configuration drift is part of managing a large set of applications and infrastructure, but you can manage it with the right tools and procedures. OpsLevel can help with tools to track your microservices and integrate with your source control to track changes to configuration files. If you’re ready to start tracking your configuration changes and manage configuration drift, request your OpsLevel demo today

This post was written by Eric Goebelbecker. Eric has worked in the financial markets in New York City for 25 years, developing infrastructure for market data and financial information exchange (FIX) protocol networks. He loves to talk about what makes teams effective (or not so effective!).

More resources

March Product Updates
Blog
March Product Updates

Some of the big releases from the month of March.

Read more
How Generative AI Is Changing Software Development: Key Insights from the DORA Report
Blog
How Generative AI Is Changing Software Development: Key Insights from the DORA Report

Discover the key findings from the 2024 DORA Report on Generative AI in Software Development. Learn how OpsLevel’s AI-powered tools enhance productivity, improve code quality, and simplify documentation, while helping developers avoid common pitfalls of AI adoption.

Read more
Introducing OpsLevel AI: Finding Your Lost Engineering Knowledge
Blog
Introducing OpsLevel AI: Finding Your Lost Engineering Knowledge

Read more
Product
Software catalogMaturityIntegrationsSelf-serviceKnowledge CenterBook a meeting
Company
About usCareersContact usCustomersPartnersSecurity
Resources
DocsEventsBlogPricingDemoGuide to Internal Developer PortalsGuide to Production Readiness
Comparisons
OpsLevel vs BackstageOpsLevel vs CortexOpsLevel vs Atlassian CompassOpsLevel vs Port
Subscribe
Join our newsletter to stay up to date on features and releases.
By subscribing you agree to with our Privacy Policy and provide consent to receive updates from our company.
SOC 2AICPA SOC
© 2024 J/K Labs Inc. All rights reserved.
Terms of Use
Privacy Policy
Responsible Disclosure
By using this website, you agree to the storing of cookies on your device to enhance site navigation, analyze site usage, and assist in our marketing efforts. View our Data Processing Agreement for more information.
Okay!