lesson ── terraform-production ── ~14 мин ── 6 шагов

Drift detection, scheduled plan, and alerting

The end of the production track. Drift is the gap between your HCL and the cloud (someone edits by hand, default tags, a stray apply by another team). You catch it with a scheduled job in CI: terraform plan -detailed-exitcode, exit 2 means drift. In this lesson you build a baseline, break it through aws-cli, and watch plan detect the drift.

интерактивный sandbox

Поднимется пара контейнеров: terraform 1.9 и localstack 3.8 в одной сети. В браузере откроется терминал, можно сразу terraform init. Каждый шаг проверяется автоматически. TTL 45 минут, без регистрации.

запустить sandbox →

stack ── terraform · localstack · 1 GB RAM · самоуничтожается через 45 мин простоя

Шаги

Build the baseline infra

bash

cd /home/student/tf-drift

cat > main.tf <<'EOF'

resource "aws_s3_bucket" "drift_demo" {

  bucket = "linuxlab-drift-demo"

  tags = {

    ManagedBy = "terraform"

    Owner     = "student"

EOF

terraform init -no-color > /dev/null

terraform apply -auto-approve -no-color

The bucket is created, state matches reality. That is the baseline.

✓ Baseline is in place. State == cloud.

02
detailed-exitcode with no drift
bash
set +e
terraform plan -detailed-exitcode -no-color > /dev/null 2>&1
code=$?
set -e
echo "exit: $code"
It should be 0, clean, no drift. This is what we expect in production when nothing is broken.

If you add --refresh=false now, the exit code stays the same but the check is weaker (see the pitfalls).
✓ Plan is clean. No drift, a clean cron pass.

Break reality through aws-cli

"Someone went into the Console and edited the tags":

bash

aws --endpoint-url=http://localstack:4566 \

  s3api put-bucket-tagging \

  --bucket linuxlab-drift-demo \

  --tagging 'TagSet=[

    {Key=ManagedBy,Value=manual},

    {Key=Hacker,Value=was-here}

]'

aws --endpoint-url=http://localstack:4566 \

  s3api get-bucket-tagging --bucket linuxlab-drift-demo

The real bucket now has:

ManagedBy: manual (was terraform)
Hacker: was-here (new)
Owner gone

Terraform state knows none of this.

✓ Drift introduced. Terraform will detect it next.

04
plan -detailed-exitcode == 2
bash
set +e
terraform plan -detailed-exitcode -no-color 2>&1 | tail -20
code=$?
set -e
echo "exit: $code"
It should show a diff (the tags differ) and exit 2. That is the drift signal.

In CI:
- exit 0 → clean, no action.
- exit 1 → an error (state corrupt, provider failing, etc.).
- exit 2 → drift; alert Slack/PD/issue.
You can read the diff in detail:
bash
terraform plan -no-color -out=drift.tfplan 2>&1 | grep -A20 "drift_demo" | head -40
You can see exactly what diverged, Terraform wants to return the tags to the HCL description.
✓ Drift caught. exit 2, the signal for a cron alert.

A scheduled-drift script

A production shell script for the scheduled job:

bash

cat > drift-check.sh <<'EOF'

#!/usr/bin/env bash

set -uo pipefail

cd /home/student/tf-drift

terraform init -input=false -no-color > /dev/null

set +e

terraform plan \

  -detailed-exitcode \

  -input=false \

  -no-color \

  -lock-timeout=2m \

  -out=drift.tfplan

code=$?

set -e

case $code in

0)

    echo "drift-check: clean, no changes"

    exit 0

;;

2)

    echo "drift-check: DRIFT DETECTED"

    terraform show -no-color drift.tfplan > drift.txt

    # Here a webhook to Slack would go:

    # curl -X POST -H 'Content-Type: application/json' \

    #   --data "{\"text\": \"Drift detected:\n$(cat drift.txt | head -50)\"}" \

    #   "$SLACK_WEBHOOK_URL"

    echo "--- begin drift ---"

    head -30 drift.txt

    echo "--- end drift ---"

    exit 1  # CI treats drift as a failure

;;

*)

    echo "drift-check: ERROR (exit $code)"

    exit $code

;;

esac

EOF

chmod +x drift-check.sh

./drift-check.sh 2>&1 | tail -30

echo "script exit: $?"

It should show DRIFT DETECTED and exit 1 (because there is drift).

✓ The cron script is ready. In GitHub Actions this runs through a schedule cron.

The same thing on OpenTofu

OpenTofu keeps the CLI and state compatible with Terraform for the commands in this step: migration usually goes through mv .terraform .terraform.bak; tofu init -upgrade. On a first switch, though, back up the state and do a run on a feature branch, the differences cluster in the newer features (variables in backend, state encryption, OCI registry-backed modules). See tf-opentofu-parity for the full matrix.

→ OpenTofu parity

06
Reconcile vs ignore, what to do with drift
There are two strategies:

1. Reconcile, apply returns the cloud to the HCL:
bash
terraform apply -auto-approve drift.tfplan
This destroys the Hacker:was-here tag and restores ManagedBy:terraform and Owner:student. Fits when the drift is unwanted.

2. Update HCL, the cloud is right:
bash
# do nothing to the cloud, and in HCL add:
# tags = { ManagedBy = "manual", Owner = "student" }
This is for when "someone edited the Console" but the change is wanted; you legalize it in HCL. Then apply -refresh-only to sync state.

3. Ignore, there is drift, but it does not matter:
hcl
lifecycle {
ignore_changes = [tags["Hacker"]]
}
Terraform stops reporting the Hacker tag. Use this when that tag is set by another system (k8s-operator, AWS Config) and has nothing to do with Terraform.

Reconcile it:
bash
terraform apply -auto-approve -no-color > /dev/null
aws --endpoint-url=http://localstack:4566 \
s3api get-bucket-tagging --bucket linuxlab-drift-demo
The tags are back to the HCL version.
✓ Drift reconciled. The production track is done.
When HCL does not cover everything that exists in the cloud
terraform plan sees drift only for resources that are in state. If someone created a bucket by hand, it exists in the cloud, not in state, and plan will not see it.

For that, driftctl:
bash
# install
curl -L https://github.com/snyk/driftctl/releases/latest/download/driftctl_linux_amd64 \
-o /usr/local/bin/driftctl
chmod +x /usr/local/bin/driftctl
# scan
driftctl scan \
--from tfstate+file://terraform.tfstate \
--output console
It shows:
- Managed resources that have drift.
- Unmanaged, in the cloud but not in state.
- Deleted, in state but not in the cloud.
This is broader than terraform plan. In production the stack is: a cron with terraform plan -detailed-exitcode (often) + driftctl scan (less often, for example once a week).

The AWS alternative, AWS Config: a cloud-native service that logs every configuration change. Use it when you want cross-account / cross-team visibility, and an audit trail matters more than a terraform-specific signal.

See tf-drift-detection.
- → Drift detection theory
- → Plan-as-artifact

Что ты узнал

terraform plan -detailed-exitcode, exit 0 (clean), 1 (error), 2 (drift). A cron job in CI runs it once a day or hour, and on 2 it alerts in Slack/PD/issue. The plan job reads state, it does not write, and the IAM role is read-only.

команды

terraform plan -detailed-exitcode -no-colorthe canonical drift check. The exit code is the verdict.
terraform plan -refresh-onlyrefresh only, no comparison with HCL, what changed in state from the cloud.
aws s3api put-bucket-tagging --bucket X --tagging '...'an example of what a stray apply does, it creates drift.

концепции

· detailed-exitcode 2, drift; 1, an error, not drift
· A read-only role matters, the drift job must never apply by accident
· False positives wear the team down; clear them with ignore_changes