
Most GitOps rollouts don't fail because of tooling. They fail because teams bring an imperative mindset into a declarative system.
I see this when teams that have spent a decade running push-based deployments try to replicate that same model inside ArgoCD or Flux. Instead of asking "How would I approach this from scratch?" they ask "How do I make this work like it used to?" That question leads to workarounds that undermine the very advantages GitOps offers: reproducibility, audit trails, and disaster recovery that isn't a high-stakes guessing game.
The litmus test is simple. Can your team confidently answer "What happens if I destroy everything and rebuild from Git?" If yes, your implementation is in good shape. If not, these best practices are where to focus.
What follows comes from years of helping companies like GameStop and UKi scale GitOps across multiple clusters, teams, and compliance boundaries. The principles of GitOps hold up fine at any scale; what breaks is the infrastructure around them.
The repo structure that works for one team and one cluster will start cracking once a second or third team comes on board. PRs pile up, engineers accidentally overwrite each other's configs, and nobody can tell who owns what.
This is one of the first things I look at when we’re brought in to help scale a GitOps setup, because a bad repo structure creates coordination overhead that worsens with every new team. If your engineers are spending more time resolving merge conflicts and waiting on PR approvals than shipping, the repo is the bottleneck.
There's no universally correct answer here, and anyone selling you one is probably selling a platform.
Monorepos provide visibility and make centralized policy enforcement easier, but they struggle with the weight of multiple teams committing at different frequencies. Multi-repo setups give teams autonomy, but you need stronger conventions and tooling to maintain consistency.
What I've found works best for enterprise teams is separating repos by concern type. Application config, platform config, and policy definitions have different ownership models, different change frequencies, and different review requirements. Mixing them in the same repo creates friction that compounds over time.
When we worked with UKi, a cybersecurity training platform operating across both commercial AWS and GovCloud, separating platform config from application config early on turned out to be one of the highest-leverage decisions in the engagement. When FedRamp compliance requirements later forced changes to the platform layer, application teams didn't have to touch their repos. That separation saved weeks of cross-team coordination.
Branch-per-environment sounds logical in a planning meeting, but breaks down in practice.
A scenario I’ve seen play out repeatedly:
Your staging branch has an API timeout set to 30 seconds, but production needs 60. Someone changes the retry logic in staging, tests it, and tries to merge into the production branch. They hit a merge conflict that has nothing to do with their actual change, as the timeout had diverged months ago across a dozen config files buried in branch history.
Trunk-based development with directory-per-environment avoids this entirely. You can see the state of any environment without switching context, promoting changes is explicit, and drift between environments becomes visible instead of hiding inside branch history that nobody reviews.
This gets overlooked until someone is paged in the middle of the night and spends ten minutes in Slack trying to figure out which repo owns the broken deployment.
Consistent naming conventions, logical directory structures, and clear ownership labels matter far more at scale than teams expect. At a minimum, do these three things:
The goal: when something breaks at 2 AM, the on-call engineer can trace the alert to the owning team and relevant config in under a minute. If that path takes longer, your repo structure needs work.
The entire GitOps model is built on Git being the source of truth, but secrets obviously can't be stored as plaintext. This is where teams first discover that adopting GitOps tools and actually operating in a GitOps model are two different things.
Before reaching for tooling, first define the exact problem your secrets workflow is solving. Compliance? Operational speed? Reducing blast radius when credentials leak? Too many teams pick tools based on what's easiest to set up rather than what scales, and the root cause is almost always skipping this question.
Most teams handle secrets reactively: base64-encoded Kubernetes Secrets, .gitignore for local files, and a verbal agreement that no one commits credentials. When that inevitably breaks, they reach for one of two common tools:
Sealed Secrets is usually first. It works for small setups, but the encryption key lives inside each cluster, which creates problems at scale:
SOPS, when used with ArgoCD or Flux, is more flexible (supporting AWS KMS, GCP KMS, age, and PGP), but it pushes encryption responsibility to individual developers. Everyone has to encrypt locally before committing, which becomes a bottleneck and an error surface as the team grows.
Both tools follow the same approach: encrypting secrets and storing them in git. That keeps everything together and is simpler to start with, but once you’re managing multiple teams or dealing with compliance requirements, the overhead starts to work against you.
For organizations operating at scale, the pattern I keep coming back to is storing secret references in Git while keeping the actual values in AWS Secrets Manager or Vault. External Secrets Operator handles sync into clusters at runtime, which means Git sees the audit history without holding sensitive data.
This approach scales more cleanly because:
For cross-account permissions, OIDC with IRSA eliminates long-lived credentials.
When we built this for UKi's cybersecurity environment, we backed secrets with AWS KMS, so rotation happens in the secrets manager. This was part of a broader effort to position UKi for FedRamp certification, which requires demonstrating airtight secrets management, audit trails, and access controls across its GovCloud infrastructure.
The principles remain consistent regardless of tooling: least privilege access, automated rotation, complete audit trails, and plaintext never touching Git.
Configuration drift will happen at enterprise scale. Someone will run kubectl edit during a production incident, an autoscaler will adjust replica counts, or a hotfix will bypass the PR flow because the site is down and the approvers are asleep.
None of that is surprising, but what catches teams off guard is how quickly undetected drift accumulates and breaks the core promise of GitOps—that Git is the source of truth.
When your declared state and actual state diverge, disaster recovery becomes unreliable, compliance audits become tougher, and debugging production issues requires investigating multiple sources of truth.
The most effective approach to preventing drift is to make Git the only way changes reach production. Lock down direct kubectl apply, console edits, and any CI scripts that push changes outside the GitOps flow. Your GitOps controller should be the only entity with write access to production resources.
That said, incidents don’t wait for PR approvals. When the site is down at 3 AM, someone’s going to fix it directly with break-glass credentials. That's fine, and pretending otherwise means the workarounds happen without any tracking.
What matters is having a clear process for what happens during and after the incident:
Drift typically doesn’t occur from a lack of detection tooling. ArgoCD and Flux both flag when actual state diverges from declared state. The problem is that those signals are buried in dashboards that only platform engineers check, if anyone checks them at all.
Detection only works if the alerts reach the people who can act on them, and if there's a clear policy for how quickly drift gets resolved. For most enterprise teams I've worked with, that means:
The operational question for leadership is, "Do we have visibility into how often our teams are working outside of Git, and do we have a policy for bringing those changes back?" If the answer to either is no, drift is accumulating whether you see it or not.
If your team keeps bypassing Git to make the same kinds of changes, that's worth paying attention to. Frequent drift in the same area usually points to a gap in your declared state: a missing config option, an incomplete manifest, or a workflow that's too slow for real operational pace.
I don't think the goal is being purely declarative or purely imperative. It's knowing where the line is. What parts of the system should be declared and owned by tooling? What parts still need custom logic? And when the ecosystem catches up, you delete the script and simply move it back to declarative.
The best GitOps implementations evolve because they treat drift as input. When you repeatedly see the same manual change, improve the declared state to account for it. That's how you turn a recurring problem into a permanent fix.
When you're managing a handful of applications, governance can live in PR reviews and institutional knowledge. Someone senior catches the misconfiguration, leaves a comment, and it gets fixed before merge.
That falls apart when you're running dozens of teams, hundreds of applications, and auditors are asking for evidence rather than assurances.
I've found that the biggest infrastructure mistakes don't happen because teams choose the wrong trade-off in the moment. They happen because teams assume the trade-off that matters today will still be the one that matters in a year.
Policy-as-code is how you build governance that adapts when priorities shift, because they always do.
The switch from "we review things in PRs" to "our policies enforce themselves" is one of the highest-leverage moves an enterprise team can make. Instead of relying on someone senior to catch that a deployment is missing resource limits or running a privileged container, you codify those rules so they're enforced automatically.
Open Policy Agent (with Gatekeeper) and Kyverno both handle this well, and the choice depends on your platform team’s preferences. What matters at the leadership level is the coverage model:
Neither alone is sufficient at scale. Pipeline enforcement without admission control means a direct kubectl apply can bypass every rule you've written. Admission control without pipeline enforcement means developers won't discover their config is non-compliant until deployment, slowing everyone down.
Combining both gives consistent, auditable enforcement that doesn't depend on who's reviewing the code or how busy they are that week.
This is worth calling out because it’s a gap in almost every enterprise GitOps setup I’ve audited. Teams invest significant effort in encrypting and managing application secrets, then deploy those secrets through CI pipelines running on AWS access keys or service account tokens that were set once and haven’t been rotated since.
OIDC federation solves this by replacing static credentials with short-lived tokens. Instead of storing long-lived keys that can leak or be forgotten, your CI provider authenticates directly with your cloud provider through identity federation.
Replacing long-lived CI credentials with OIDC is a high-impact, low-effort security improvement in a GitOps pipeline, and I recommend it to virtually every enterprise team we work with.
Policies should live in Git and go through the same PR review process as everything else. This gives you a proper audit trail for policy changes and prevents the shadow governance problem, where rules exist only in someone's head or buried in a wiki page that hasn't been updated in a year.
I also recommend keeping policies in a separate repo from application config. Policies are typically owned by a platform or security team and change on a different cadence than the services they govern. Mixing the two creates the same ownership confusion we talked about in the repository structure section.
Most GitOps documentation assumes a single cluster, which is a fine starting point, but not where enterprise teams live. The reality is usually multiple clusters spread across environments, regions, or compliance boundaries.
Running GitOps on each cluster is the easy part. Keeping them all consistent with each other and promoting changes reliably without duplicating configuration everywhere is where it gets hard.
Running multiple clusters is fundamentally a business decision, not just a technical one. You do it for compliance boundaries, regional requirements, or fault isolation. Instead of focusing on which architectural pattern to use, understand what you're trying to achieve with the separation and whether your architecture supports that goal.
There are a few common patterns for managing multiple clusters, and choosing one deliberately saves you from painful re-architecture later:
Hub-and-spoke runs a central management cluster that handles deployments to all remote clusters. You get centralized visibility and control, but the hub becomes a single point of dependency. If it goes down, deployments stop everywhere.
Decentralized runs independent controllers in each cluster. It offers better fault isolation, and each cluster can operate autonomously—but you need stronger conventions to maintain consistency since there's no single dashboard showing you everything.
Hybrid separates controllers by compliance or business boundary while unifying the repo structure and promotion flow across environments.
The right choice depends on your team structure, security boundaries, and operational maturity. Both ArgoCD and Flux support all of these patterns. The important thing is making the decision early and intentionally rather than letting each cluster evolve its own setup organically.
Promoting changes from dev to staging to production is where most multi-environment setups introduce risk. When promotion means copying config files between directories or cherry-picking commits, you're one tired engineer away from shipping the wrong version to production.
The pattern that works at scale is separating what stays the same across all environments from what actually differs. Changes land in a base configuration, and environment-specific overlays handle the differences (region, resource sizing, compliance-specific labels) through Kustomize patches or Helm values files. Promoting a change means updating an overlay reference, not rewriting config.
Adding automated gates to the promotion path, liketest suites that must pass before promotion proceeds and approval steps for production changes, reduces the manual surface while keeping control where it matters.
Once you're past a handful of clusters, configuring each one by hand guarantees inconsistency. Helm charts or Kustomize bases should define the standard, with overlays reserved for things that genuinely differ between clusters: region, resource sizing, compliance-specific labels.
ArgoCD's ApplicationSets and Flux's Kustomization targeting both let you deploy the same config across many clusters from a single definition. This drastically reduces the maintenance surface and makes it much harder for individual clusters to silently drift into snowflake territory.
This is exactly what we did with UKi. When we started, they had a single environment that was built manually and wasn't reproducible. If it failed, recovery was uncertain. We moved them to declarative, reproducible environments, and what had been weeks of provisioning work became a repeatable process. UKi's co-founder, Dr. Scott Wells, says it best:
"I initially thought we were 3 years out of being able to build this ourselves because of the complexity. I was amazed when I realized how quickly things were moving along."
We had a working proof of concept in 2.5 months and an MVP in 6, with a 97% reduction in deployment time.
The practices in this article all trace back to the same insight: enterprise GitOps breaks not because the tools are wrong, but because the decisions around those tools were made too early, too quickly, and became painful to change later.
If your team can confidently rebuild everything from Git, promote changes across environments without manual copying, rotate secrets without redeploying, and prove compliance through code rather than conversations, your GitOps implementation is in solid shape.
If any of those feel shaky, you're in the phase where most teams Pelotech works with reach out. We've worked through these exact problems for organizations ranging from growth-stage startups to regulated enterprises and large-scale operations.
I'll also be the first to tell you when something isn't worth doing. Sometimes the best move is to simplify, defer, or kill a project entirely. But if your GitOps architecture stretches shipping timelines and the cost of getting it wrong is high, that's the kind of problem we solve daily.