5 GitOps Best Practices for Enterprises: Scaling Secrets, Drift, and Multi-Cluster Management

Learn enterprise GitOps best practices for secrets management, drift detection, policy-as-code, and multi-cluster scaling from Kubernetes-certified engineers.

Author:

Pelotech

Published on:

March 3, 2026

Most GitOps rollouts don't fail because of tooling. They fail because teams bring an imperative mindset into a declarative system.

I see this when teams that have spent a decade running push-based deployments try to replicate that same model inside ArgoCD or Flux. Instead of asking "How would I approach this from scratch?" they ask "How do I make this work like it used to?" That question leads to workarounds that undermine the very advantages GitOps offers: reproducibility, audit trails, and disaster recovery that isn't a high-stakes guessing game.

The litmus test is simple. Can your team confidently answer "What happens if I destroy everything and rebuild from Git?" If yes, your implementation is in good shape. If not, these best practices are where to focus.

What follows comes from years of helping companies like GameStop and UKi scale GitOps across multiple clusters, teams, and compliance boundaries. The principles of GitOps hold up fine at any scale; what breaks is the infrastructure around them.

1. Design Your Repo Structure for Multiple Teams

The repo structure that works for one team and one cluster will start cracking once a second or third team comes on board. PRs pile up, engineers accidentally overwrite each other's configs, and nobody can tell who owns what.

This is one of the first things I look at when we’re brought in to help scale a GitOps setup, because a bad repo structure creates coordination overhead that worsens with every new team. If your engineers are spending more time resolving merge conflicts and waiting on PR approvals than shipping, the repo is the bottleneck.

Pelotech GitOps Visuals
Why Mixed Repos Break as Teams Scale
The problem isn't the tools — it's that application config, platform config, and policy definitions have different owners, different change frequencies, and different review requirements. Mixing them creates friction that compounds with every new team.
Mixed repo — one team
📁 gitops-repo/
📁 apps/
deployment.yaml
service.yaml
📁 infra/
cluster-config.yaml
ingress-controller.yaml
📁 policies/
resource-limits.yaml
network-policy.yaml
PRs pile up. Team A's app deploy blocks Team B's infra change — they're queuing on the same repo.
Nobody knows who owns what. A policy change buried in an app PR gets merged without platform review.
Compliance forces a platform change. Every app team has to touch their manifests too — weeks of coordination.
Separated by concern — multiple teams
📁 app-config/ App teams
services/ · deployments/ · ingress/
Ships dozens of times a day. Each team owns their own path. No cross-team conflicts.
📁 platform-config/ Platform team
cluster/ · networking/ · secrets-operator/
Changes infrequently. FedRAMP forces a platform update — app teams don't touch this repo.
📁 policy/ Security / compliance
gatekeeper/ · kyverno/ · rbac/
Different owners, different review requirements. Lives here, not buried in an app PR.
"
If you slow one platform engineer down 10%, but everyone else gets 10 times faster — that's the trade you make every time. Repo structure decisions should optimise for organisational speed, not individual convenience.

Monorepo or Multi-Repo Depends on Team Structure

There's no universally correct answer here, and anyone selling you one is probably selling a platform. 

Monorepos provide visibility and make centralized policy enforcement easier, but they struggle with the weight of multiple teams committing at different frequencies. Multi-repo setups give teams autonomy, but you need stronger conventions and tooling to maintain consistency.

What I've found works best for enterprise teams is separating repos by concern type. Application config, platform config, and policy definitions have different ownership models, different change frequencies, and different review requirements. Mixing them in the same repo creates friction that compounds over time.

When we worked with UKi, a cybersecurity training platform operating across both commercial AWS and GovCloud, separating platform config from application config early on turned out to be one of the highest-leverage decisions in the engagement. When FedRamp compliance requirements later forced changes to the platform layer, application teams didn't have to touch their repos. That separation saved weeks of cross-team coordination.

Use Trunk-Based Development, Not Branch-Per-Environment

Branch-per-environment sounds logical in a planning meeting, but breaks down in practice.

A scenario I’ve seen play out repeatedly: 

Your staging branch has an API timeout set to 30 seconds, but production needs 60. Someone changes the retry logic in staging, tests it, and tries to merge into the production branch. They hit a merge conflict that has nothing to do with their actual change, as the timeout had diverged months ago across a dozen config files buried in branch history.

Trunk-based development with directory-per-environment avoids this entirely. You can see the state of any environment without switching context, promoting changes is explicit, and drift between environments becomes visible instead of hiding inside branch history that nobody reviews.

Design for the 2 AM Call

This gets overlooked until someone is paged in the middle of the night and spends ten minutes in Slack trying to figure out which repo owns the broken deployment.

Consistent naming conventions, logical directory structures, and clear ownership labels matter far more at scale than teams expect. At a minimum, do these three things:

  • Encode ownership directly into paths
  • Add CODEOWNERS files so the right reviewers are tagged automatically
  • Label Kubernetes resources with team identifiers 

The goal: when something breaks at 2 AM, the on-call engineer can trace the alert to the owning team and relevant config in under a minute. If that path takes longer, your repo structure needs work.

2. Store Secret References in Git

The entire GitOps model is built on Git being the source of truth, but secrets obviously can't be stored as plaintext. This is where teams first discover that adopting GitOps tools and actually operating in a GitOps model are two different things.

Before reaching for tooling, first define the exact problem your secrets workflow is solving. Compliance? Operational speed? Reducing blast radius when credentials leak? Too many teams pick tools based on what's easiest to set up rather than what scales, and the root cause is almost always skipping this question.

Pelotech GitOps Visuals
Choosing a Secrets Management Approach
How the three most common patterns compare as your GitOps setup scales
Tool Best fit Multi-cluster Main limitation at scale Pelotech recommendation
Sealed Secrets
Small teams
1–2 clusters, low compliance overhead Encryption key lives per-cluster. Same credential requires a separate SealedSecret per cluster. Compromised key exposes all secrets committed to that cluster. Good starting point. Migrate before you hit your second cluster.
SOPS + ArgoCD/Flux
Mid-size teams
Teams with strong developer discipline; AWS KMS, GCP KMS, or age backends ~ Encryption responsibility falls on each developer locally before commit. Becomes a bottleneck and error surface as the team grows. Workable with strong conventions. Harder to enforce consistently across large teams.
External Secrets Operator
Enterprise
Multi-cluster, regulated environments; AWS Secrets Manager or Vault backend Higher initial setup cost. Requires a mature secrets manager (Vault, AWS SM). More moving parts to operate. Our default recommendation for enterprises. Secrets exist once; Git tracks references, not values.
Key principle across all approaches: Git should store secret references, never plaintext values. Pair with least-privilege IAM, automated rotation, and OIDC federation to eliminate long-lived CI credentials.

What Teams Start With (And When They Stop Working)

Most teams handle secrets reactively: base64-encoded Kubernetes Secrets, .gitignore for local files, and a verbal agreement that no one commits credentials. When that inevitably breaks, they reach for one of two common tools: 

Sealed Secrets is usually first. It works for small setups, but the encryption key lives inside each cluster, which creates problems at scale:

  • Deploying the same credential to ten clusters requires ten separate SealedSecret objects
  • Promoting from staging to production means re-encrypting
  • A compromised cluster key makes secrets committed to that cluster readable in Git history

SOPS, when used with ArgoCD or Flux, is more flexible (supporting AWS KMS, GCP KMS, age, and PGP), but it pushes encryption responsibility to individual developers. Everyone has to encrypt locally before committing, which becomes a bottleneck and an error surface as the team grows.

Both tools follow the same approach: encrypting secrets and storing them in git. That keeps everything together and is simpler to start with, but once you’re managing multiple teams or dealing with compliance requirements, the overhead starts to work against you.

What Works Best for Enterprise Teams

For organizations operating at scale, the pattern I keep coming back to is storing secret references in Git while keeping the actual values in AWS Secrets Manager or Vault. External Secrets Operator handles sync into clusters at runtime, which means Git sees the audit history without holding sensitive data.

This approach scales more cleanly because:

  • Secrets exist once in a central store, not duplicated per cluster or re-encrypted per environment
  • Rotation happens in the secrets manager without requiring Git commits or redeployments
  • Access control lives in IAM policies rather than encryption key distribution
  • Audit trails are clean, with Git tracking what’s referenced and secrets manager tracking access

For cross-account permissions, OIDC with IRSA eliminates long-lived credentials.

When we built this for UKi's cybersecurity environment, we backed secrets with AWS KMS, so rotation happens in the secrets manager. This was part of a broader effort to position UKi for FedRamp certification, which requires demonstrating airtight secrets management, audit trails, and access controls across its GovCloud infrastructure.

The principles remain consistent regardless of tooling: least privilege access, automated rotation, complete audit trails, and plaintext never touching Git.

3. Detect and Prevent Drift Before It Compounds

Configuration drift will happen at enterprise scale. Someone will run kubectl edit during a production incident, an autoscaler will adjust replica counts, or a hotfix will bypass the PR flow because the site is down and the approvers are asleep.

None of that is surprising, but what catches teams off guard is how quickly undetected drift accumulates and breaks the core promise of GitOps—that Git is the source of truth. 

When your declared state and actual state diverge, disaster recovery becomes unreliable, compliance audits become tougher, and debugging production issues requires investigating multiple sources of truth.

Make Git the Only Path to Production

The most effective approach to preventing drift is to make Git the only way changes reach production. Lock down direct kubectl apply, console edits, and any CI scripts that push changes outside the GitOps flow. Your GitOps controller should be the only entity with write access to production resources.

That said, incidents don’t wait for PR approvals. When the site is down at 3 AM, someone’s going to fix it directly with break-glass credentials. That's fine, and pretending otherwise means the workarounds happen without any tracking. 

What matters is having a clear process for what happens during and after the incident:

  • During the incident, a senior engineer uses break-glass credentials, makes the fix directly in-cluster, and documents the change in incident channels
  • Within a defined window afterward, the change gets committed back to Git, the PR links to the incident postmortem, and the GitOps controller reconciles to confirm that Git matches reality again
Pelotech GitOps Visuals
The Break-Glass Incident Loop
Exceptions are inevitable. What matters is making them trackable and closing the loop back to Git.
🚨 During the Incident
1
Senior engineer uses break-glass credentials — direct cluster access, bypassing GitOps controller
2
Fix applied directly in-cluster — site recovers; change exists only in actual state, not declared state
3
Change documented in incident channel — timestamp, what changed, who made it
✅ After the Incident (defined window)
4
Change committed back to Git — PR links to incident postmortem; creates the audit trail
5
GitOps controller reconciles — confirms actual state matches declared state; drift alert clears
6
Recurring pattern? — escalate as architecture issue; improve declared state to prevent next time
Drift alert routing
Route drift alerts to the same channels as incident alerts — not buried in a dashboard nobody checks.
Resolution SLA
Production drift: 1 business day maximum. Lower environments: automated correction preferred.
Repeated drift signal
Same manual change twice = gap in declared state. Improve the manifest; don't just close the ticket.

Create Clear Policies for Drift Alerts and Resolution

Drift typically doesn’t occur from a lack of detection tooling. ArgoCD and Flux both flag when actual state diverges from declared state. The problem is that those signals are buried in dashboards that only platform engineers check, if anyone checks them at all.

Detection only works if the alerts reach the people who can act on them, and if there's a clear policy for how quickly drift gets resolved. For most enterprise teams I've worked with, that means:

  • Drift alerts go to the same channels as incident alerts
  • Production drift has a defined resolution window, usually one business day
  • Lower environments use automated correction, so drift is caught early
  • Recurring drift gets escalated as an architecture issue

The operational question for leadership is, "Do we have visibility into how often our teams are working outside of Git, and do we have a policy for bringing those changes back?" If the answer to either is no, drift is accumulating whether you see it or not.

Treat Recurring Drift as a Signal, Not Just a Problem To Fix

If your team keeps bypassing Git to make the same kinds of changes, that's worth paying attention to. Frequent drift in the same area usually points to a gap in your declared state: a missing config option, an incomplete manifest, or a workflow that's too slow for real operational pace.

I don't think the goal is being purely declarative or purely imperative. It's knowing where the line is. What parts of the system should be declared and owned by tooling? What parts still need custom logic? And when the ecosystem catches up, you delete the script and simply move it back to declarative.

The best GitOps implementations evolve because they treat drift as input. When you repeatedly see the same manual change, improve the declared state to account for it. That's how you turn a recurring problem into a permanent fix.

4. Enforce Governance in Code

When you're managing a handful of applications, governance can live in PR reviews and institutional knowledge. Someone senior catches the misconfiguration, leaves a comment, and it gets fixed before merge.

That falls apart when you're running dozens of teams, hundreds of applications, and auditors are asking for evidence rather than assurances.

I've found that the biggest infrastructure mistakes don't happen because teams choose the wrong trade-off in the moment. They happen because teams assume the trade-off that matters today will still be the one that matters in a year.

Policy-as-code is how you build governance that adapts when priorities shift, because they always do.

Turn Infrastructure Rules Into Enforceable Policy

The switch from "we review things in PRs" to "our policies enforce themselves" is one of the highest-leverage moves an enterprise team can make. Instead of relying on someone senior to catch that a deployment is missing resource limits or running a privileged container, you codify those rules so they're enforced automatically.

Open Policy Agent (with Gatekeeper) and Kyverno both handle this well, and the choice depends on your platform team’s preferences. What matters at the leadership level is the coverage model:

  • Enforce policies in the CI pipeline before merge, so problems get caught before they ever hit Git. This keeps your repo clean and reduces the volume of issues that need to be caught later.
  • Enforce policies in-cluster at admission time, so anything that slips through the pipeline gets blocked before it runs in production. This is your safety net.

Neither alone is sufficient at scale. Pipeline enforcement without admission control means a direct kubectl apply can bypass every rule you've written. Admission control without pipeline enforcement means developers won't discover their config is non-compliant until deployment, slowing everyone down.

Combining both gives consistent, auditable enforcement that doesn't depend on who's reviewing the code or how busy they are that week.

Eliminate Long-Lived Credentials in CI/CD

This is worth calling out because it’s a gap in almost every enterprise GitOps setup I’ve audited. Teams invest significant effort in encrypting and managing application secrets, then deploy those secrets through CI pipelines running on AWS access keys or service account tokens that were set once and haven’t been rotated since.

OIDC federation solves this by replacing static credentials with short-lived tokens. Instead of storing long-lived keys that can leak or be forgotten, your CI provider authenticates directly with your cloud provider through identity federation.

Replacing long-lived CI credentials with OIDC is a high-impact, low-effort security improvement in a GitOps pipeline, and I recommend it to virtually every enterprise team we work with.

Version Your Policies Alongside Your Config

Policies should live in Git and go through the same PR review process as everything else. This gives you a proper audit trail for policy changes and prevents the shadow governance problem, where rules exist only in someone's head or buried in a wiki page that hasn't been updated in a year.

I also recommend keeping policies in a separate repo from application config. Policies are typically owned by a platform or security team and change on a different cadence than the services they govern. Mixing the two creates the same ownership confusion we talked about in the repository structure section.

5. Get Multi-Cluster Management Right Early

Most GitOps documentation assumes a single cluster, which is a fine starting point, but not where enterprise teams live. The reality is usually multiple clusters spread across environments, regions, or compliance boundaries. 

Running GitOps on each cluster is the easy part. Keeping them all consistent with each other and promoting changes reliably without duplicating configuration everywhere is where it gets hard.

Running multiple clusters is fundamentally a business decision, not just a technical one. You do it for compliance boundaries, regional requirements, or fault isolation. Instead of focusing on which architectural pattern to use, understand what you're trying to achieve with the separation and whether your architecture supports that goal.

Pick an Architectural Pattern

There are a few common patterns for managing multiple clusters, and choosing one deliberately saves you from painful re-architecture later:

Hub-and-spoke runs a central management cluster that handles deployments to all remote clusters. You get centralized visibility and control, but the hub becomes a single point of dependency. If it goes down, deployments stop everywhere.

Decentralized runs independent controllers in each cluster. It offers better fault isolation, and each cluster can operate autonomously—but you need stronger conventions to maintain consistency since there's no single dashboard showing you everything.

Hybrid separates controllers by compliance or business boundary while unifying the repo structure and promotion flow across environments. 

The right choice depends on your team structure, security boundaries, and operational maturity. Both ArgoCD and Flux support all of these patterns. The important thing is making the decision early and intentionally rather than letting each cluster evolve its own setup organically.

Pelotech GitOps Visuals
Visual 3 — Section 5: Multi-Cluster Management
Which Multi-Cluster Pattern Fits Your Situation?
Each pattern makes a different trade-off. The right choice depends on what you're optimising for.
Hub-and-Spoke
Decentralized
Hybrid
Architecture
Mgmt Cluster Cluster A Cluster B Cluster C
Git Repo Cluster A ctrl Cluster B ctrl Cluster C ctrl
Git Repo Hub (Commercial) Hub (GovCloud) Cluster A Cluster B Cluster C Cluster D
Core trade-off
Centralised control, but the hub going down stops deployments everywhere.
Each cluster runs independently, but keeping them consistent takes strong conventions.
Hard separation at compliance boundaries with a shared repo and promotion flow.
Choose this if
You want operational simplicity and can tolerate hub dependency
Fault isolation is your priority and your platform team is mature
You have hard compliance boundaries — e.g. GovCloud alongside commercial
The decision that matters most: pick one deliberately before your second cluster goes live. Re-architecting later — when teams are already working across an ad-hoc setup — is the expensive version of this choice.

Standardize Environment Promotion

Promoting changes from dev to staging to production is where most multi-environment setups introduce risk. When promotion means copying config files between directories or cherry-picking commits, you're one tired engineer away from shipping the wrong version to production.

The pattern that works at scale is separating what stays the same across all environments from what actually differs. Changes land in a base configuration, and environment-specific overlays handle the differences (region, resource sizing, compliance-specific labels) through Kustomize patches or Helm values files. Promoting a change means updating an overlay reference, not rewriting config.

Adding automated gates to the promotion path, liketest suites that must pass before promotion proceeds and approval steps for production changes, reduces the manual surface while keeping control where it matters.

Define the Config Standard, Then Enforce With Templates

Once you're past a handful of clusters, configuring each one by hand guarantees inconsistency. Helm charts or Kustomize bases should define the standard, with overlays reserved for things that genuinely differ between clusters: region, resource sizing, compliance-specific labels.

ArgoCD's ApplicationSets and Flux's Kustomization targeting both let you deploy the same config across many clusters from a single definition. This drastically reduces the maintenance surface and makes it much harder for individual clusters to silently drift into snowflake territory.

This is exactly what we did with UKi. When we started, they had a single environment that was built manually and wasn't reproducible. If it failed, recovery was uncertain. We moved them to declarative, reproducible environments, and what had been weeks of provisioning work became a repeatable process. UKi's co-founder, Dr. Scott Wells, says it best:

"I initially thought we were 3 years out of being able to build this ourselves because of the complexity. I was amazed when I realized how quickly things were moving along."

We had a working proof of concept in 2.5 months and an MVP in 6, with a 97% reduction in deployment time.

Enterprise GitOps Breaks Differently. Fix It Accordingly.

The practices in this article all trace back to the same insight: enterprise GitOps breaks not because the tools are wrong, but because the decisions around those tools were made too early, too quickly, and became painful to change later.

If your team can confidently rebuild everything from Git, promote changes across environments without manual copying, rotate secrets without redeploying, and prove compliance through code rather than conversations, your GitOps implementation is in solid shape.

If any of those feel shaky, you're in the phase where most teams Pelotech works with reach out. We've worked through these exact problems for organizations ranging from growth-stage startups to regulated enterprises and large-scale operations.

I'll also be the first to tell you when something isn't worth doing. Sometimes the best move is to simplify, defer, or kill a project entirely. But if your GitOps architecture stretches shipping timelines and the cost of getting it wrong is high, that's the kind of problem we solve daily. 

Let's talk about your specific architecture.

Let’s Get Started

Ready to tackle your challenges and cut unnecessary costs?
Let’s talk about the right solutions for your business.
Contact us