Workflow and CI/CD

What happens between plan and apply. Why fmt and validate belong in the pipeline. How to reach AWS from GitHub Actions without long-lived keys. Drift detection, require-approval, splitting state by environment. Questions for DevOps/SRE roles.

6 вопросов · ~23 мин чтения

#plan-vs-apply-semantics

juniorчасто

What does `terraform plan` do versus `apply`? What does apply do with a plan?

Что отвечать

Plan reads state plus HCL, queries the provider (refresh), computes the diff, and prints what it intends to do and why. It changes nothing in the provider. Apply takes an already computed plan (from the file `-out=plan.tfplan` or on the fly through the interactive prompt) and runs the change set. If apply runs without `-out`, it makes a plan under the hood and then applies, which is convenient locally. In CI the right way is plan, then artifact, then apply from the artifact. That way review sees exactly what ships.

Что хотят услышать

A senior should: - name the key difference: plan is side-effect free (apart from refresh), apply mutates and updates state - explain `-out=plan.tfplan`: a binary artifact that pins a snapshot of state plus the change set, and apply runs exactly that change set - say that state can change between plan and apply (someone went into the Console and changed it), so an apply on a stale plan can fail with "expected state did not match" - mention `terraform show -json plan.tfplan` for parsing and post-processing in CI (cost estimation, OPA policy)

Подводные камни

✗ Running `apply -auto-approve` in CI without a saved plan. What the reviewer saw and what shipped can differ
✗ Forgetting that a refresh between plan and apply can find new drift, and an apply on a stale plan fails
✗ Saving the plan file as a public artifact. It holds state secrets in plain text

Follow-up

? Why use `-out` when you can run `terraform apply` directly?
? What happens if apply tries to apply a stale plan?
? How do you protect the tfplan artifact in CI? What in it is sensitive?

Глубина в базе знаний

terraform plan: see what Terraform is about to do
terraform apply: apply a plan to a real cloud
[[tf-plan-apply-ci]]

#fmt-validate-in-pipeline

juniorчасто

What does `fmt` do, what does `validate` do, and why both in CI?

Что отвечать

`fmt` formats HCL: indentation, alignment, quotes. It does not check meaning. `validate` parses HCL and checks internal consistency: types, references to variables or resources that do not exist, required arguments. It does not connect to the provider and does not refresh, so it is fast. CI needs both: `fmt -check` blocks a PR with unformatted code, and `validate` catches typos before someone waits 10 minutes on a plan only to see "unknown variable foo."

Что хотят услышать

The candidate should: - separate the roles: fmt is style, validate is syntax plus basic semantics - say that `fmt -check -recursive` is for CI and plain `fmt` is for local fix-on-save - note that validate does not catch errors like "that instance type does not exist in AWS"; you learn that only at plan, through the provider - mention `tflint` as deeper static analysis: it knows provider specifics (valid instance types, deprecated arguments)

Подводные камни

✗ Putting only `validate` in CI without `fmt -check`. Code in mixed styles lands in the repo, and the diffs get noisy
✗ Thinking validate catches ALL errors. It does not catch logic ones: an endless count, a wrong region in the provider
✗ Running `validate` in every subfolder separately. On a large repo that is heavy; `terraform validate` after `init` is more correct

Follow-up

? How does `tflint` go deeper than `terraform validate`?
? Why use `fmt -recursive`, and why is it not recursive by default?
? What do you need to do before `validate` so it does not complain about providers?

Глубина в базе знаний

#oidc-aws-no-static-keys

seniorчасто

How do you reach AWS from GitHub Actions without long-lived keys?

Что отвечать

OIDC. GitHub hands the runner a short-lived JWT with claims about the repository and branch. In AWS you set up an Identity Provider (token.actions.github.com) and an IAM role with a trust policy that accepts a JWT with specific claims (`sub = repo:owner/repo:ref:refs/heads/main`). The runner calls `aws-actions/configure-aws-credentials@v4`, the action exchanges the JWT for `AssumeRoleWithWebIdentity` and gets temporary credentials. They live for the length of the job. AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY in Secrets are no longer needed.

Что хотят услышать

A senior should: - explain the trust policy with a specific `sub` claim: only a PR from a specific branch or org/repo is accepted, everything else gets a 403 - name `permissions: id-token: write` at the job level; without it GitHub will not issue the JWT - say that temporary credentials live only for the run, so a leak from the logs gives no long-term access - mention the harmful pattern `sub = repo:owner/repo:*`, which accepts a JWT from any branch and any PR, including forks; the right way is to limit it to a branch or environment

Подводные камни

✗ Giving a trust policy with a wildcard `sub = repo:owner/repo:*`. Any PR from a fork can assume the role (if CI runs on PRs)
✗ Forgetting `permissions: id-token: write` and getting an empty JWT, so the AWS call fails with UnauthorizedOperation
✗ Using the default audience: `sts.amazonaws.com` expects a specific audience claim, otherwise AssumeRoleWithWebIdentity rejects it

Follow-up

? What should the `sub` claim contain for the trust policy to accept it?
? How does `id-token: write` differ from the other permissions?
? How do you scope the role to a specific environment in GitHub?

Глубина в базе знаний

[[tf-oidc-aws]]
[[tf-plan-apply-ci]]

#drift-detection-cron

intermediateиногда

How do you set up drift detection in CI? What do you monitor?

Что отвечать

A cron job in GitHub Actions once an hour: `terraform init && terraform plan -detailed-exitcode -lock-timeout=5m`. Exit 0 means no changes, exit 2 means there are changes, exit 1 means an error. On exit 2 you send an alert to Slack/PagerDuty with a link to the run logs. The alert lets the owner know something changed in the infrastructure without a PR. You can respond two ways: apply through a normal PR to bring state back to the HCL, or absorb the drift by updating the HCL.

Что хотят услышать

A senior should: - name `-detailed-exitcode` as the building block; without it, parsing stdout is fragile - separate noise (cloud-managed attrs: ASG capacity, a lambda function ARN after blue/green) from real drift (a security group opened a port, a tag was removed) - mention `ignore_changes` for attributes that change normally outside Terraform - cover Terraform Cloud / TFE drift detection as the out-of-the-box option with a UI and notifications - note that drift on shared modules is a signal that several roots touch one resource, so the ownership needs sorting out

Подводные камни

✗ Standing up drift detection with no exclusion list. Alerts pour in every hour, and engineers start ignoring them (alert fatigue)
✗ Using `plan` with `-lock=false`. You compete with a real apply and may see a transient state
✗ Running drift detection on staging more often than on prod. That is a rare combination, usually it is the other way around

Follow-up

? What is cloud-managed drift, and how do you tell it apart from real drift?
? How does drift detection in Terraform Cloud differ from a cron job in GitHub Actions?
? How do you set up alert routing to avoid fatigue?

Глубина в базе знаний

[[tf-drift-detection]]
terraform plan: see what Terraform is about to do
[[tf-plan-apply-ci]]

#state-per-env-and-isolation

intermediateчасто

How do you split state by environment: dev / stage / prod?

Что отвечать

Three approaches. One is workspaces: one root, separate state files named per environment. Simple, but one set of HCL covers every environment, which makes different topologies hard. Two is separate root directories (`envs/dev/`, `envs/prod/`), with shared code through child modules. The control is explicit and module versions can differ, but the backend config is duplicated. Three is terragrunt or similar: a generator that expands the environment config into a root on the fly. Most production teams pick (2) or (3); workspaces stay for feature branches inside dev.

Что хотят услышать

A senior should: - separate the three approaches and name the strength of each - say that prod needs different settings from dev (multi-az, RDS size, backup policy), and workspaces fit poorly, because conditional branching on `terraform.workspace` breeds `if-else` in the HCL - mention that each environment is its own backend (its own bucket for state) so the blast radius stays bounded - note that variables across environments are better through .tfvars files or remote secrets than through `terraform.workspace` switching

Подводные камни

✗ Using workspaces for prod/staging. The conditional logic spreads, and the diff is hard to review
✗ Putting every environment's state in one bucket under different keys. You hand stage access to prod state through IAM by accident
✗ Duplicating HCL instead of using child modules. The first mistake is in one environment, then they drift apart

Follow-up

? When are workspaces actually a good fit?
? How does terragrunt help with a multi-env layout?
? How do you avoid duplicating the backend config across env directories?

Глубина в базе знаний

#pipeline-approval-and-secrets

seniorиногда

How is a typical Terraform CI built: PR, plan, approval, apply?

Что отвечать

The flow: a PR opens, CI runs `init`, `fmt -check`, `validate`, `tflint`/`checkov`, and `terraform plan -out=tfplan` under a read-only role, the plan result is posted to a PR comment or a sticky thread, the reviewer sees the change set and approves, and the merge runs apply from that same tfplan artifact under a write role with a manual approval gate. Secrets do not live in the repo; CI gets an OIDC role with least privilege; the tfplan is kept as an artifact with 7-14 day retention.

Что хотят услышать

A senior should: - name the role split between a plan role (read-only) and an apply role (write). If a PR does something odd, the read-only role won't let it break anything - say that the apply gate should be a manual approval with specific approvers, not "automerge on passing tests" - separate prod from non-prod: stage can auto-apply on merge to main, prod is always manual - mention that a plan comment in the PR speeds up review many times over; Atlantis, Spacelift, and Terraform Cloud do it out of the box - note that the tfplan artifact should live until apply and no longer, because beyond that it becomes a leaked snapshot of state

Подводные камни

✗ Using one role for plan and apply. A PR from an outside contributor gets access to write operations
✗ Not posting the plan in the PR. The reviewer approves blind
✗ Keeping the tfplan around as a public artifact. It is a readable snapshot of state with secrets

Follow-up

? What is wrong with Terraform Cloud when you need full isolation?
? How do you post a plan comment in a PR on plain GitHub Actions?
? What are the minimal permissions for the plan role in AWS?

Глубина в базе знаний

[[tf-plan-apply-ci]]
[[tf-oidc-aws]]
[[tf-policy-as-code]]

← все кластеры

Workflow and CI/CD

6 вопросов · ~23 мин чтения

#plan-vs-apply-semantics

juniorчасто

What does `terraform plan` do versus `apply`? What does apply do with a plan?

Что отвечать

Что хотят услышать

Подводные камни

✗ Running `apply -auto-approve` in CI without a saved plan. What the reviewer saw and what shipped can differ
✗ Forgetting that a refresh between plan and apply can find new drift, and an apply on a stale plan fails
✗ Saving the plan file as a public artifact. It holds state secrets in plain text

Follow-up

? Why use `-out` when you can run `terraform apply` directly?
? What happens if apply tries to apply a stale plan?
? How do you protect the tfplan artifact in CI? What in it is sensitive?

Глубина в базе знаний

terraform plan: see what Terraform is about to do
terraform apply: apply a plan to a real cloud
[[tf-plan-apply-ci]]

#fmt-validate-in-pipeline

juniorчасто

What does `fmt` do, what does `validate` do, and why both in CI?

Что отвечать

Что хотят услышать

Подводные камни

✗ Putting only `validate` in CI without `fmt -check`. Code in mixed styles lands in the repo, and the diffs get noisy
✗ Thinking validate catches ALL errors. It does not catch logic ones: an endless count, a wrong region in the provider
✗ Running `validate` in every subfolder separately. On a large repo that is heavy; `terraform validate` after `init` is more correct

Follow-up

? How does `tflint` go deeper than `terraform validate`?
? Why use `fmt -recursive`, and why is it not recursive by default?
? What do you need to do before `validate` so it does not complain about providers?

Глубина в базе знаний

#oidc-aws-no-static-keys

seniorчасто

How do you reach AWS from GitHub Actions without long-lived keys?

Что отвечать

Что хотят услышать

Подводные камни

✗ Giving a trust policy with a wildcard `sub = repo:owner/repo:*`. Any PR from a fork can assume the role (if CI runs on PRs)
✗ Forgetting `permissions: id-token: write` and getting an empty JWT, so the AWS call fails with UnauthorizedOperation
✗ Using the default audience: `sts.amazonaws.com` expects a specific audience claim, otherwise AssumeRoleWithWebIdentity rejects it

Follow-up

? What should the `sub` claim contain for the trust policy to accept it?
? How does `id-token: write` differ from the other permissions?
? How do you scope the role to a specific environment in GitHub?

Глубина в базе знаний

[[tf-oidc-aws]]
[[tf-plan-apply-ci]]

#drift-detection-cron

intermediateиногда

How do you set up drift detection in CI? What do you monitor?

Что отвечать

Что хотят услышать

Подводные камни

✗ Standing up drift detection with no exclusion list. Alerts pour in every hour, and engineers start ignoring them (alert fatigue)
✗ Using `plan` with `-lock=false`. You compete with a real apply and may see a transient state
✗ Running drift detection on staging more often than on prod. That is a rare combination, usually it is the other way around

Follow-up

? What is cloud-managed drift, and how do you tell it apart from real drift?
? How does drift detection in Terraform Cloud differ from a cron job in GitHub Actions?
? How do you set up alert routing to avoid fatigue?

Глубина в базе знаний

[[tf-drift-detection]]
terraform plan: see what Terraform is about to do
[[tf-plan-apply-ci]]

#state-per-env-and-isolation

intermediateчасто

How do you split state by environment: dev / stage / prod?

Что отвечать

Что хотят услышать

Подводные камни

✗ Using workspaces for prod/staging. The conditional logic spreads, and the diff is hard to review
✗ Putting every environment's state in one bucket under different keys. You hand stage access to prod state through IAM by accident
✗ Duplicating HCL instead of using child modules. The first mistake is in one environment, then they drift apart

Follow-up

? When are workspaces actually a good fit?
? How does terragrunt help with a multi-env layout?
? How do you avoid duplicating the backend config across env directories?

Глубина в базе знаний

#pipeline-approval-and-secrets

seniorиногда

How is a typical Terraform CI built: PR, plan, approval, apply?

Что отвечать

Что хотят услышать

Подводные камни

✗ Using one role for plan and apply. A PR from an outside contributor gets access to write operations
✗ Not posting the plan in the PR. The reviewer approves blind
✗ Keeping the tfplan around as a public artifact. It is a readable snapshot of state with secrets

Follow-up

? What is wrong with Terraform Cloud when you need full isolation?
? How do you post a plan comment in a PR on plain GitHub Actions?
? What are the minimal permissions for the plan role in AWS?

Глубина в базе знаний

[[tf-plan-apply-ci]]
[[tf-oidc-aws]]
[[tf-policy-as-code]]