kb/cicd ── CI/CD ── intermediate

Drift detection, scheduled plan, and alerting

Drift is the gap between your HCL and what actually exists in the cloud. Someone edits a security group by hand in the Console and never reverts it. Another team's apply overwrites your IAM policy. AWS injects default tags. All of that is drift. The baseline pattern: a cron job in CI runs `terraform plan -detailed-exitcode`; exit code 2 means drift, and the job sends an alert to Slack. For broader coverage, driftctl, AWS Config, and Snyk IaC add cataloguing and attribution.

view as markdownaka: terraform-drift, terraform-drift-detection, scheduled-terraform-plan

What drift is

Terraform treats state as the single source of truth. In practice, the cloud moves on its own:

  • Manual edits in the AWS Console: "I'll tweak this security group temporarily and roll it back later." They never roll it back.
  • Another team touches the same resource. An IAM role shared between two projects gets its policy overwritten by a foreign apply.
  • AWS changes things under the hood: default tags, new fields on data sources, automatically created resources.
  • A provider upgrade starts seeing an attribute it previously ignored.

Running terraform plan after a refresh shows the difference. That is drift.

How to catch it

The minimal pattern is a cron job with -detailed-exitcode:

yaml
# .github/workflows/drift.yml
on:
  schedule:
    - cron: "0 6 * * *"   # every day at 06:00 UTC
jobs:
  drift:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: hashicorp/setup-terraform@v3
      - uses: aws-actions/configure-aws-credentials@v4
        with:
          role-to-assume: arn:aws:iam::ACCOUNT:role/tf-drift-readonly
          aws-region: us-east-1
      - run: terraform init
      - id: plan
        run: |
          set +e
          terraform plan -detailed-exitcode -no-color -out=drift.tfplan
          echo "exitcode=$?" >> "$GITHUB_OUTPUT"
          set -e
      - if: steps.plan.outputs.exitcode == '2'
        name: Notify Slack
        run: |
          terraform show -no-color drift.tfplan > drift.txt
          payload=$(jq -Rs --arg text "Drift detected in linuxlab/terraform-infra" \
            '{text: $text, attachments: [{text: .}]}' < drift.txt)
          curl -X POST -H 'Content-Type: application/json' \
            --data "$payload" "$SLACK_WEBHOOK_URL"

The logic:

  1. terraform plan -detailed-exitcode returns exit 0 (clean), 1 (error), or 2 (drift exists).
  2. On exit 2, collect the plan output and post it to Slack.

The drift job's IAM role must be read-only (ReadOnlyAccess or a custom equivalent). It never applies anything; it only inspects.

Read-only IAM role for drift

A cron job with apply permissions is dangerous. If the pipeline breaks and the drift-detection job triggers an auto-apply, you can cause unintended changes.

The correct approach:

hcl
resource "aws_iam_role" "tf_drift_readonly" {
  name = "tf-drift-readonly"
  assume_role_policy = jsonencode({
    Version = "2012-10-17"
    Statement = [{
      Effect = "Allow"
      Principal = { Federated = aws_iam_openid_connect_provider.github.arn }
      Action = "sts:AssumeRoleWithWebIdentity"
      Condition = {
        StringLike = {
          "token.actions.githubusercontent.com:sub" = "repo:linuxlab/terraform-infra:ref:refs/heads/main"
        }
      }
    }]
  })
}
resource "aws_iam_role_policy_attachment" "tf_drift_readonly" {
  role       = aws_iam_role.tf_drift_readonly.name
  policy_arn = "arn:aws:iam::aws:policy/ReadOnlyAccess"
}

With ReadOnlyAccess, the role can refresh and diff. It cannot change anything.

What to do with detected drift

Drift detected does not mean auto-fix. Your response depends on the type.

Drift typeAction
Someone edited the Console "temporarily"Revert manually via the Console or with apply.
Another team's apply overwrote a resourceCoordinate with that team and split ownership.
AWS default tagsAdd the tags to HCL or use ignore_changes.
Provider upgrade exposed new fieldsPin the provider version; upgrade deliberately.
A legitimate change that should be codifiedUpdate HCL and commit.

Anti-pattern: having the cron job auto-apply drift. That is "fixing by remote control" and can delete ad-hoc changes that no one has explained to you yet.

False positives

The most common "drifts" that are not actually drift:

  • Computed attributes the provider recalculates on every refresh. For example, aws_db_instance.endpoint sometimes shifts on refresh. Fix: lifecycle { ignore_changes = [endpoint] }.

  • Tags injected by default_tags in the provider. If your provider block has default_tags { tags = { Owner = "team" } }, those tags are appended to everything. Older resources that predate them will show as drift.

  • Order-sensitive fields. aws_security_group.ingress has a defined order inside Terraform, but the provider sometimes returns rules in a different order. Many teams replace the inline ingress block with individual aws_security_group_rule resources to sidestep the problem.

  • Time-based values. time_static.created_at can shift on refresh if you read it through a computed expression. Leave time_static alone; use static_value or store the value directly in state.

These cases accumulate. At some point the team starts ignoring the drift job entirely, and the signal is lost. The fix is periodic HCL cleanup, targeted ignore_changes entries, and straightening out the schema, not suppressing alerts.

driftctl and alternatives

For serious drift detection, terraform plan alone is not enough. It only sees what is already in state. Resources that exist in AWS but not in state (created by hand or by another team) are invisible to Terraform.

driftctl (Snyk, open-source):

bash
driftctl scan --from tfstate+s3://my-tf-state/main.tfstate

It compares AWS reality against state and reports "unmanaged resources" (present in the cloud, absent from state) and "deleted resources" (present in state, gone from the cloud, the same thing terraform refresh would catch).

AWS Config takes a different approach: AWS itself tracks resource configuration and alerts on changes, independently of Terraform. It goes deeper and can surface changes that terraform refresh will never show, such as ELB target group settings.

Snyk IaC (paid) sits on top of git and the cloud, finds drift, and proposes a PR to fix the HCL.

Cron frequency

EnvironmentFrequency
DevOnce per day or after business hours
StageOnce per day
ProdOnce per day at minimum; once per hour for critical stacks

Running the cron every five minutes is overkill: it hammers STS and runs into AWS rate limits. Once per day catches drift within 24 hours at worst, which is acceptable for most teams.

Reporting

Slack is the standard. Other options:

  • GitHub Issue: auto-create an issue with the plan output. The team triages in the issue tracker.
  • PagerDuty or OpsGenie: for production-critical stacks where drift means calling the on-call engineer.
  • Datadog or Grafana: a "drift events count" gauge graphed over time.

A practical rule of thumb: dev and stage send alerts to the team's Slack channel; prod sends a Slack alert plus a GitHub issue for traceability.

Pitfalls

  • The drift job locks shared state. If an auto-apply pipeline runs at the same time, both jobs compete for the state lock. Schedule the drift cron outside business hours, or work from a read-only state snapshot.

  • AWS rate limits. A large state with thousands of resources makes thousands of API calls during refresh. If your stack is that large, stagger the cron runs to stay under the Throttling threshold. Alternatively, break the stack into smaller state files (see tf-large-scale-state).

  • driftctl only covers what it knows. Not every provider resource type is supported. If your state contains an unusual resource type, driftctl may ignore it or label it "unsupported".

  • Alert fatigue from "drift detected". Five false positives every day and the team turns the alert off. Then real drift happens and no one sees it. Fix the noise, do not suppress the signal.

  • Refresh is not full drift detection. terraform refresh only examines attributes in the provider schema. If AWS adds a field the provider does not yet know about, that drift passes unnoticed. Periodic provider upgrades matter.

  • terraform refresh as a standalone command is outdated. Older pipelines ran terraform refresh && terraform plan. Since 1.5, refresh is deprecated as a separate command; plan already performs the refresh step. Do not complicate the pipeline.

See also in LinuxLab

  • cmd-cron-crontab: a simple on-premises way to run drift checks without GitHub Actions, suitable for a self-hosted runner or a dev environment.
  • systemd-timers: a more modern alternative with journal logs and unit dependencies, preferable to cron for a production agent.

§ commands

bash
terraform plan -detailed-exitcode -no-color

Canonical drift check. Exit 0 = clean, 2 = drift, 1 = error.

bash
terraform plan -refresh-only

Refresh only, no comparison against HCL. Shows what changed in the cloud relative to state.

bash
driftctl scan --from tfstate+s3://bucket/state

External tool that finds resources outside of state. Complements terraform plan.

bash
aws configservice get-compliance-summary-by-resource-type

AWS Config view of drift at the AWS level, above and independent of Terraform.

§ see also