Drift detection, scheduled plan, and alerting

What drift is

Terraform treats state as the single source of truth. In practice, the cloud moves on its own:

Manual edits in the AWS Console: "I'll tweak this security group temporarily and roll it back later." They never roll it back.
Another team touches the same resource. An IAM role shared between two projects gets its policy overwritten by a foreign apply.
AWS changes things under the hood: default tags, new fields on data sources, automatically created resources.
A provider upgrade starts seeing an attribute it previously ignored.

Running terraform plan after a refresh shows the difference. That is drift.

How to catch it

The minimal pattern is a cron job with -detailed-exitcode:

yaml

# .github/workflows/drift.yml

on:

  schedule:

    - cron: "0 6 * * *"   # every day at 06:00 UTC

jobs:

  drift:

    runs-on: ubuntu-latest

    steps:

      - uses: actions/checkout@v4

      - uses: hashicorp/setup-terraform@v3

      - uses: aws-actions/configure-aws-credentials@v4

        with:

          role-to-assume: arn:aws:iam::ACCOUNT:role/tf-drift-readonly

          aws-region: us-east-1

      - run: terraform init

      - id: plan

        run: |

          set +e

          terraform plan -detailed-exitcode -no-color -out=drift.tfplan

          echo "exitcode=$?" >> "$GITHUB_OUTPUT"

          set -e

      - if: steps.plan.outputs.exitcode == '2'

        name: Notify Slack

        run: |

          terraform show -no-color drift.tfplan > drift.txt

          payload=$(jq -Rs --arg text "Drift detected in linuxlab/terraform-infra" \

            '{text: $text, attachments: [{text: .}]}' < drift.txt)

          curl -X POST -H 'Content-Type: application/json' \

            --data "$payload" "$SLACK_WEBHOOK_URL"

The logic:

terraform plan -detailed-exitcode returns exit 0 (clean), 1 (error), or 2 (drift exists).
On exit 2, collect the plan output and post it to Slack.

The drift job's IAM role must be read-only (ReadOnlyAccess or a custom equivalent). It never applies anything; it only inspects.

Read-only IAM role for drift

A cron job with apply permissions is dangerous. If the pipeline breaks and the drift-detection job triggers an auto-apply, you can cause unintended changes.

The correct approach:

hcl

resource "aws_iam_role" "tf_drift_readonly" {

  name = "tf-drift-readonly"

  assume_role_policy = jsonencode({

    Version = "2012-10-17"

    Statement = [{

      Effect = "Allow"

      Principal = { Federated = aws_iam_openid_connect_provider.github.arn }

      Action = "sts:AssumeRoleWithWebIdentity"

      Condition = {

        StringLike = {

          "token.actions.githubusercontent.com:sub" = "repo:linuxlab/terraform-infra:ref:refs/heads/main"

}]

})

resource "aws_iam_role_policy_attachment" "tf_drift_readonly" {

  role       = aws_iam_role.tf_drift_readonly.name

  policy_arn = "arn:aws:iam::aws:policy/ReadOnlyAccess"

With ReadOnlyAccess, the role can refresh and diff. It cannot change anything.

What to do with detected drift

Drift detected does not mean auto-fix. Your response depends on the type.

Drift type	Action
Someone edited the Console "temporarily"	Revert manually via the Console or with `apply`.
Another team's apply overwrote a resource	Coordinate with that team and split ownership.
AWS default tags	Add the tags to HCL or use `ignore_changes`.
Provider upgrade exposed new fields	Pin the provider version; upgrade deliberately.
A legitimate change that should be codified	Update HCL and commit.

Anti-pattern: having the cron job auto-apply drift. That is "fixing by remote control" and can delete ad-hoc changes that no one has explained to you yet.

False positives

The most common "drifts" that are not actually drift:

Computed attributes the provider recalculates on every refresh. For example, aws_db_instance.endpoint sometimes shifts on refresh. Fix: lifecycle { ignore_changes = [endpoint] }.
Tags injected by default_tags in the provider. If your provider block has default_tags { tags = { Owner = "team" } }, those tags are appended to everything. Older resources that predate them will show as drift.
Order-sensitive fields. aws_security_group.ingress has a defined order inside Terraform, but the provider sometimes returns rules in a different order. Many teams replace the inline ingress block with individual aws_security_group_rule resources to sidestep the problem.
Time-based values. time_static.created_at can shift on refresh if you read it through a computed expression. Leave time_static alone; use static_value or store the value directly in state.

These cases accumulate. At some point the team starts ignoring the drift job entirely, and the signal is lost. The fix is periodic HCL cleanup, targeted ignore_changes entries, and straightening out the schema, not suppressing alerts.

driftctl and alternatives

For serious drift detection, terraform plan alone is not enough. It only sees what is already in state. Resources that exist in AWS but not in state (created by hand or by another team) are invisible to Terraform.

driftctl (Snyk, open-source):

bash

driftctl scan --from tfstate+s3://my-tf-state/main.tfstate

It compares AWS reality against state and reports "unmanaged resources" (present in the cloud, absent from state) and "deleted resources" (present in state, gone from the cloud, the same thing terraform refresh would catch).

AWS Config takes a different approach: AWS itself tracks resource configuration and alerts on changes, independently of Terraform. It goes deeper and can surface changes that terraform refresh will never show, such as ELB target group settings.

Snyk IaC (paid) sits on top of git and the cloud, finds drift, and proposes a PR to fix the HCL.

Cron frequency

Environment	Frequency
Dev	Once per day or after business hours
Stage	Once per day
Prod	Once per day at minimum; once per hour for critical stacks

Running the cron every five minutes is overkill: it hammers STS and runs into AWS rate limits. Once per day catches drift within 24 hours at worst, which is acceptable for most teams.

Reporting

Slack is the standard. Other options:

GitHub Issue: auto-create an issue with the plan output. The team triages in the issue tracker.
PagerDuty or OpsGenie: for production-critical stacks where drift means calling the on-call engineer.
Datadog or Grafana: a "drift events count" gauge graphed over time.

A practical rule of thumb: dev and stage send alerts to the team's Slack channel; prod sends a Slack alert plus a GitHub issue for traceability.

Pitfalls

The drift job locks shared state. If an auto-apply pipeline runs at the same time, both jobs compete for the state lock. Schedule the drift cron outside business hours, or work from a read-only state snapshot.
AWS rate limits. A large state with thousands of resources makes thousands of API calls during refresh. If your stack is that large, stagger the cron runs to stay under the Throttling threshold. Alternatively, break the stack into smaller state files (see tf-large-scale-state).
driftctl only covers what it knows. Not every provider resource type is supported. If your state contains an unusual resource type, driftctl may ignore it or label it "unsupported".
Alert fatigue from "drift detected". Five false positives every day and the team turns the alert off. Then real drift happens and no one sees it. Fix the noise, do not suppress the signal.
Refresh is not full drift detection. terraform refresh only examines attributes in the provider schema. If AWS adds a field the provider does not yet know about, that drift passes unnoticed. Periodic provider upgrades matter.
terraform refresh as a standalone command is outdated. Older pipelines ran terraform refresh && terraform plan. Since 1.5, refresh is deprecated as a separate command; plan already performs the refresh step. Do not complicate the pipeline.