What drift is
Terraform treats state as the single source of truth. In practice, the cloud moves on its own:
- Manual edits in the AWS Console: "I'll tweak this security group temporarily and roll it back later." They never roll it back.
- Another team touches the same resource. An IAM role shared between two
projects gets its policy overwritten by a foreign
apply. - AWS changes things under the hood: default tags, new fields on data sources, automatically created resources.
- A provider upgrade starts seeing an attribute it previously ignored.
Running terraform plan after a refresh shows the difference. That is drift.
How to catch it
The minimal pattern is a cron job with -detailed-exitcode:
# .github/workflows/drift.yml
on:
schedule:
- cron: "0 6 * * *" # every day at 06:00 UTC
jobs:
drift:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- uses: hashicorp/setup-terraform@v3
- uses: aws-actions/configure-aws-credentials@v4
with:
role-to-assume: arn:aws:iam::ACCOUNT:role/tf-drift-readonly
aws-region: us-east-1
- run: terraform init
- id: plan
run: |
set +e
terraform plan -detailed-exitcode -no-color -out=drift.tfplan
echo "exitcode=$?" >> "$GITHUB_OUTPUT"
set -e
- if: steps.plan.outputs.exitcode == '2'
name: Notify Slack
run: |
terraform show -no-color drift.tfplan > drift.txt
payload=$(jq -Rs --arg text "Drift detected in linuxlab/terraform-infra" \
'{text: $text, attachments: [{text: .}]}' < drift.txt)curl -X POST -H 'Content-Type: application/json' \
--data "$payload" "$SLACK_WEBHOOK_URL"
The logic:
terraform plan -detailed-exitcodereturns exit 0 (clean), 1 (error), or 2 (drift exists).- On exit 2, collect the plan output and post it to Slack.
The drift job's IAM role must be read-only (ReadOnlyAccess or a custom equivalent). It never applies anything; it only inspects.
Read-only IAM role for drift
A cron job with apply permissions is dangerous. If the pipeline breaks and the drift-detection job triggers an auto-apply, you can cause unintended changes.
The correct approach:
resource "aws_iam_role" "tf_drift_readonly" {name = "tf-drift-readonly"
assume_role_policy = jsonencode({Version = "2012-10-17"
Statement = [{Effect = "Allow"
Principal = { Federated = aws_iam_openid_connect_provider.github.arn }Action = "sts:AssumeRoleWithWebIdentity"
Condition = { StringLike = {"token.actions.githubusercontent.com:sub" = "repo:linuxlab/terraform-infra:ref:refs/heads/main"
}
}
}]
})
}
resource "aws_iam_role_policy_attachment" "tf_drift_readonly" {role = aws_iam_role.tf_drift_readonly.name
policy_arn = "arn:aws:iam::aws:policy/ReadOnlyAccess"
}
With ReadOnlyAccess, the role can refresh and diff. It cannot change anything.
What to do with detected drift
Drift detected does not mean auto-fix. Your response depends on the type.
| Drift type | Action |
|---|---|
| Someone edited the Console "temporarily" | Revert manually via the Console or with apply. |
| Another team's apply overwrote a resource | Coordinate with that team and split ownership. |
| AWS default tags | Add the tags to HCL or use ignore_changes. |
| Provider upgrade exposed new fields | Pin the provider version; upgrade deliberately. |
| A legitimate change that should be codified | Update HCL and commit. |
Anti-pattern: having the cron job auto-apply drift. That is "fixing by remote control" and can delete ad-hoc changes that no one has explained to you yet.
False positives
The most common "drifts" that are not actually drift:
-
Computed attributes the provider recalculates on every refresh. For example,
aws_db_instance.endpointsometimes shifts on refresh. Fix:lifecycle { ignore_changes = [endpoint] }. -
Tags injected by
default_tagsin the provider. If your provider block hasdefault_tags { tags = { Owner = "team" } }, those tags are appended to everything. Older resources that predate them will show as drift. -
Order-sensitive fields.
aws_security_group.ingresshas a defined order inside Terraform, but the provider sometimes returns rules in a different order. Many teams replace the inline ingress block with individualaws_security_group_ruleresources to sidestep the problem. -
Time-based values.
time_static.created_atcan shift on refresh if you read it through a computed expression. Leavetime_staticalone; usestatic_valueor store the value directly in state.
These cases accumulate. At some point the team starts ignoring the drift job
entirely, and the signal is lost. The fix is periodic HCL cleanup, targeted
ignore_changes entries, and straightening out the schema, not suppressing
alerts.
driftctl and alternatives
For serious drift detection, terraform plan alone is not enough. It only sees
what is already in state. Resources that exist in AWS but not in state (created
by hand or by another team) are invisible to Terraform.
driftctl (Snyk, open-source):
driftctl scan --from tfstate+s3://my-tf-state/main.tfstate
It compares AWS reality against state and reports "unmanaged resources" (present
in the cloud, absent from state) and "deleted resources" (present in state, gone
from the cloud, the same thing terraform refresh would catch).
AWS Config takes a different approach: AWS itself tracks resource configuration
and alerts on changes, independently of Terraform. It goes deeper and can surface
changes that terraform refresh will never show, such as ELB target group
settings.
Snyk IaC (paid) sits on top of git and the cloud, finds drift, and proposes a PR to fix the HCL.
Cron frequency
| Environment | Frequency |
|---|---|
| Dev | Once per day or after business hours |
| Stage | Once per day |
| Prod | Once per day at minimum; once per hour for critical stacks |
Running the cron every five minutes is overkill: it hammers STS and runs into AWS rate limits. Once per day catches drift within 24 hours at worst, which is acceptable for most teams.
Reporting
Slack is the standard. Other options:
- GitHub Issue: auto-create an issue with the plan output. The team triages in the issue tracker.
- PagerDuty or OpsGenie: for production-critical stacks where drift means calling the on-call engineer.
- Datadog or Grafana: a "drift events count" gauge graphed over time.
A practical rule of thumb: dev and stage send alerts to the team's Slack channel; prod sends a Slack alert plus a GitHub issue for traceability.
Pitfalls
-
The drift job locks shared state. If an auto-apply pipeline runs at the same time, both jobs compete for the state lock. Schedule the drift cron outside business hours, or work from a read-only state snapshot.
-
AWS rate limits. A large state with thousands of resources makes thousands of API calls during refresh. If your stack is that large, stagger the cron runs to stay under the
Throttlingthreshold. Alternatively, break the stack into smaller state files (see tf-large-scale-state). -
driftctl only covers what it knows. Not every provider resource type is supported. If your state contains an unusual resource type, driftctl may ignore it or label it "unsupported".
-
Alert fatigue from "drift detected". Five false positives every day and the team turns the alert off. Then real drift happens and no one sees it. Fix the noise, do not suppress the signal.
-
Refresh is not full drift detection.
terraform refreshonly examines attributes in the provider schema. If AWS adds a field the provider does not yet know about, that drift passes unnoticed. Periodic provider upgrades matter. -
terraform refreshas a standalone command is outdated. Older pipelines ranterraform refresh && terraform plan. Since 1.5,refreshis deprecated as a separate command;planalready performs the refresh step. Do not complicate the pipeline.
See also in LinuxLab
- cmd-cron-crontab: a simple on-premises way to run drift checks without GitHub Actions, suitable for a self-hosted runner or a dev environment.
- systemd-timers: a more modern alternative with journal logs and unit dependencies, preferable to cron for a production agent.