Terraform State In A Team: The Setup That Stops Two Engineers From Corrupting Prod

The team has a terraform.tfstate file in the repo. Engineer A runs terraform apply. So does engineer B, on a slightly different version of the code. Both push their state files. The merged state file is now inconsistent with what’s actually in AWS: some resources tracked, some not, some duplicated. The next apply proposes deleting half the production infrastructure.

Local state is fine for one person. For a team, you need remote state with locking: state stored in a shared backend (S3, Terraform Cloud, GCS), with a lock that prevents two applys from running simultaneously. About 10 lines of HCL, and it eliminates an entire category of disaster.

This post is the working setup, the workspaces-vs-directories debate, and the four habits that keep multi-engineer Terraform sane.

Remote state with locking

The S3 + DynamoDB pattern is the AWS-native standard:

terraform {
  backend "s3" {
    bucket         = "company-tf-state"
    key            = "prod/terraform.tfstate"
    region         = "us-east-1"
    encrypt        = true
    dynamodb_table = "tf-state-lock"
  }
}

Setup once (in a separate project, the chicken-and-egg problem):

resource "aws_s3_bucket" "tf_state" {
  bucket = "company-tf-state"
}

resource "aws_s3_bucket_versioning" "tf_state" {
  bucket = aws_s3_bucket.tf_state.id
  versioning_configuration { status = "Enabled" }
}

resource "aws_dynamodb_table" "tf_state_lock" {
  name         = "tf-state-lock"
  billing_mode = "PAY_PER_REQUEST"
  hash_key     = "LockID"
  attribute {
    name = "LockID"
    type = "S"
  }
}

The S3 bucket stores the state; versioning protects you from accidental overwrites. The DynamoDB table is the lock: at any moment, only one apply can hold the lock. The second engineer running apply gets:

Error: Error acquiring the state lock

…and waits, instead of corrupting state.

For GCP: GCS bucket. For Azure: Storage Account. The pattern is identical.

Terraform Cloud as a managed alternative

Terraform Cloud (free tier covers small teams) gives you remote state, locking, run logs, and access controls without the S3+DynamoDB setup:

terraform {
  cloud {
    organization = "your-org"
    workspaces { name = "prod" }
  }
}

Pros: no infrastructure to maintain, run history is visible to the team, role-based access. Cons: external dependency, free tier is genuinely small.

For most teams the S3+DynamoDB approach is fine. For teams with multiple environments and many engineers, Terraform Cloud is worth the price.

State files per environment, not per resource

A common mistake: one giant state file with everything (prod, staging, dev, all services). The first time you need to refactor a single service, the state file is monolithic and terraform plan takes 10 minutes.

The right structure:

infra/
├── prod/
│   ├── network/      ← own state file
│   ├── eks/          ← own state file
│   ├── rds/          ← own state file
│   └── apps/
│       ├── api/      ← own state file
│       └── worker/   ← own state file
├── staging/
│   └── ...

Each directory has its own backend config and state. Changes to one component don’t risk another. Plans are fast.

The trade-off: cross-component references require explicit terraform_remote_state data sources or output sharing. That’s fine; it makes dependencies explicit.

Workspaces vs directories

Terraform’s “workspaces” feature lets one config target multiple environments (terraform workspace select staging). Tempting, controversial.

The case for workspaces: less duplication, single source.

The case against: easy to apply staging changes to prod by accident (you’re “in” the staging workspace but you forgot to switch). Conditional logic on workspace name (var.env == "prod") gets ugly.

For most teams, directories (one folder per environment) is safer. Workspaces are useful for ephemeral environments (per-PR previews) where the lifecycle is short and the safety risk is low.

The four habits that keep IaC sane

1. Always run plan before apply. Read the plan output. Confirm only the resources you expect are being modified. CI should make this automatic: plan on PRs, apply only on main after review.

2. Pin provider versions.

terraform {
  required_version = ">= 1.5.0"
  required_providers {
    aws = {
      source  = "hashicorp/aws"
      version = "~> 5.30.0"
    }
  }
}

Without pins, terraform init happily upgrades to a new major version that changes resource schemas. Pin tightly; bump deliberately.

3. Don’t manually edit state. terraform state rm, terraform state mv exist but are dangerous. If a resource drifted (someone changed it in the cloud console), prefer terraform import to bring it back into state, not state edits.

4. Use modules for repetition. A reusable module is one place to fix bugs. Inline copy-paste is many places to fix bugs.

module "api_service" {
  source = "../modules/ecs-service"
  name   = "api"
  image  = "ghcr.io/company/api:abc123"
  cpu    = 1024
  memory = 2048
}

Build a small library of internal modules (ECS service, RDS, Lambda) so adding a new service is 5 lines.

CI/CD for Terraform

A working pipeline:

# .github/workflows/tf.yml
on:
  pull_request:
    paths: ['infra/**']
  push:
    branches: [main]
    paths: ['infra/**']

jobs:
  plan:
    if: github.event_name == 'pull_request'
    runs-on: ubuntu-latest
    steps:
    - uses: actions/checkout@v4
    - uses: hashicorp/setup-terraform@v3
    - name: terraform plan
      working-directory: infra/prod/api
      run: |
        terraform init
        terraform plan -out=tfplan -no-color | tee plan.txt
    - name: comment on PR
      uses: actions/github-script@v7
      with:
        script: |
          const fs = require('fs');
          const plan = fs.readFileSync('infra/prod/api/plan.txt', 'utf8');
          github.rest.issues.createComment({
            issue_number: context.issue.number,
            owner: context.repo.owner,
            repo: context.repo.repo,
            body: '```\n' + plan.slice(0, 60000) + '\n```',
          });

  apply:
    if: github.event_name == 'push'
    runs-on: ubuntu-latest
    needs: plan
    environment: prod    # require manual approval
    steps:
    - uses: actions/checkout@v4
    - uses: hashicorp/setup-terraform@v3
    - name: terraform apply
      working-directory: infra/prod/api
      run: |
        terraform init
        terraform apply -auto-approve

PRs get a plan posted to the PR. Merging triggers an apply that requires manual approval (via GitHub Environments). No engineer runs apply from a laptop.

Atlantis and Terraform Cloud automate this pattern further. For small teams, the GitHub Actions version above is enough.

Drift detection

Resources change outside Terraform: manual fixes during incidents, AWS console edits, automatic policy adjustments. After enough drift, terraform plan proposes “fixing” things that were intentionally changed.

A scheduled drift-detection run catches this:

# Run weekly: plan-only, alert if anything to change.
- run: terraform plan -detailed-exitcode
  # exit 0 = no changes, 2 = changes proposed

Integrate the alert with your incident system. The team can decide: import the change into state, or revert it.

Don’t put secrets in tfvars

# DON'T
variable "db_password" {
  default = "actual-password"
}

State file contains the value, in plaintext, in the S3 bucket. Anyone with bucket access has the password.

Instead:

# DO: pull from a secrets manager at runtime.
data "aws_secretsmanager_secret_version" "db_password" {
  secret_id = "prod/db/password"
}

resource "aws_db_instance" "main" {
  password = data.aws_secretsmanager_secret_version.db_password.secret_string
}

The state file may still record the value (depends on resource type). Treat the state file as sensitive: encrypt at rest, restrict access.

Modules vs Terragrunt

Terragrunt wraps Terraform with conventions for DRY environment configs and easier remote state setup. It is excellent for teams managing many environments. For a small team starting out, plain Terraform with directories is enough.

If you adopt Terragrunt later, the migration is straightforward. Terragrunt is a layer on top, not a replacement.

The takeaway

Local Terraform state is for solo work. For a team: remote state with locking (S3 + DynamoDB or Terraform Cloud), separate state files per environment / component, plan-on-PR / apply-on-merge in CI, modules for reuse, drift detection on a cron. Pin provider versions, don’t edit state by hand, keep secrets out of tfvars.

The setup takes a day. It pays for itself the first week somebody else on the team tries to apply infrastructure changes simultaneously.

A note from Yojji

The kind of infrastructure-as-code discipline that scales from one engineer to twenty (remote state, locking, modules, CI gates) is the kind of long-haul DevOps engineering Yojji’s teams put into the cloud platforms they ship for clients.

Yojji is an international custom software development company founded in 2016, with teams across Europe, the US, and the UK. They specialize in the JavaScript ecosystem, cloud platforms (AWS, Azure, GCP), and infrastructure operations, including the Terraform structure and process that decides whether your infra stays manageable as the team grows.