DevOps · #terraform#iac#infrastructure

Terraform基础设施即代码实战

2024.06.26 7 min 3.0k
// 目录 · contents

前言

基础设施即代码(Infrastructure as Code, IaC)是DevOps的核心实践之一。Terraform作为最流行的IaC工具,支持多云环境的基础设施管理。本文将从HCL语法基础到生产级最佳实践进行全面讲解。

Terraform架构

graph TB
    subgraph Workflow["Terraform工作流"]
        Write["编写 .tf 文件"] --> Init["terraform init<br>初始化Provider"]
        Init --> Plan["terraform plan<br>生成执行计划"]
        Plan --> Apply["terraform apply<br>应用变更"]
        Apply --> State["terraform.tfstate<br>状态文件"]
    end

    subgraph Providers["Providers"]
        AWS["AWS Provider"]
        GCP["GCP Provider"]
        Azure["Azure Provider"]
        K8s["Kubernetes Provider"]
    end

    Init --> Providers
    Apply --> |"API调用"| Cloud["Cloud Resources"]
    State --> |"记录映射"| Cloud

核心概念

graph LR
    Config[".tf配置文件"] --> |"terraform plan"| Plan["执行计划<br>(+create, ~update, -destroy)"]
    Plan --> |"terraform apply"| Resources["云资源"]
    Resources --> |"记录状态"| State["State文件"]
    State --> |"terraform plan时对比"| Config

HCL语法基础

Provider配置

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
# versions.tf - Provider版本锁定
terraform {
required_version = ">= 1.6.0"

required_providers {
aws = {
source = "hashicorp/aws"
version = "~> 5.0"
}
kubernetes = {
source = "hashicorp/kubernetes"
version = "~> 2.25"
}
}

# 远程状态存储
backend "s3" {
bucket = "mycompany-terraform-state"
key = "production/terraform.tfstate"
region = "ap-northeast-1"
dynamodb_table = "terraform-locks"
encrypt = true
}
}

provider "aws" {
region = var.aws_region

default_tags {
tags = {
Environment = var.environment
ManagedBy = "terraform"
Project = var.project_name
}
}
}

变量与输出

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
# variables.tf
variable "aws_region" {
description = "AWS region"
type = string
default = "ap-northeast-1"
}

variable "environment" {
description = "Environment name"
type = string
validation {
condition = contains(["dev", "staging", "production"], var.environment)
error_message = "Environment must be one of: dev, staging, production."
}
}

variable "vpc_cidr" {
description = "VPC CIDR block"
type = string
default = "10.0.0.0/16"
}

variable "private_subnets" {
description = "Private subnet CIDRs"
type = list(string)
default = ["10.0.1.0/24", "10.0.2.0/24", "10.0.3.0/24"]
}

variable "cluster_config" {
description = "EKS cluster configuration"
type = object({
name = string
version = string
node_count = number
instance_types = list(string)
enable_logging = bool
})
default = {
name = "main"
version = "1.29"
node_count = 3
instance_types = ["m6i.large"]
enable_logging = true
}
}

# outputs.tf
output "vpc_id" {
description = "VPC ID"
value = aws_vpc.main.id
}

output "cluster_endpoint" {
description = "EKS cluster endpoint"
value = aws_eks_cluster.main.endpoint
sensitive = true
}

output "subnet_ids" {
description = "Private subnet IDs"
value = aws_subnet.private[*].id
}

资源定义

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
# vpc.tf - VPC网络基础设施
resource "aws_vpc" "main" {
cidr_block = var.vpc_cidr
enable_dns_hostnames = true
enable_dns_support = true

tags = {
Name = "${var.project_name}-${var.environment}-vpc"
}
}

# 使用count创建多个子网
resource "aws_subnet" "private" {
count = length(var.private_subnets)

vpc_id = aws_vpc.main.id
cidr_block = var.private_subnets[count.index]
availability_zone = data.aws_availability_zones.available.names[count.index]

tags = {
Name = "${var.project_name}-private-${count.index + 1}"
"kubernetes.io/role/internal-elb" = "1"
}
}

# 使用for_each创建安全组规则
resource "aws_security_group" "app" {
name_prefix = "${var.project_name}-app-"
vpc_id = aws_vpc.main.id

dynamic "ingress" {
for_each = var.app_ports
content {
from_port = ingress.value.port
to_port = ingress.value.port
protocol = ingress.value.protocol
cidr_blocks = ingress.value.cidr_blocks
description = ingress.value.description
}
}

egress {
from_port = 0
to_port = 0
protocol = "-1"
cidr_blocks = ["0.0.0.0/0"]
}

lifecycle {
create_before_destroy = true
}
}

# NAT Gateway
resource "aws_nat_gateway" "main" {
allocation_id = aws_eip.nat.id
subnet_id = aws_subnet.public[0].id

tags = {
Name = "${var.project_name}-nat"
}

depends_on = [aws_internet_gateway.main]
}

Data Sources

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
# 查询现有资源
data "aws_availability_zones" "available" {
state = "available"
}

data "aws_ami" "ubuntu" {
most_recent = true
owners = ["099720109477"] # Canonical

filter {
name = "name"
values = ["ubuntu/images/hvm-ssd/ubuntu-jammy-22.04-amd64-server-*"]
}

filter {
name = "virtualization-type"
values = ["hvm"]
}
}

data "aws_caller_identity" "current" {}

data "aws_eks_cluster_auth" "main" {
name = aws_eks_cluster.main.name
}

模块化设计

模块结构

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
infrastructure/
├── modules/
│ ├── vpc/
│ │ ├── main.tf
│ │ ├── variables.tf
│ │ ├── outputs.tf
│ │ └── README.md
│ ├── eks/
│ │ ├── main.tf
│ │ ├── variables.tf
│ │ ├── outputs.tf
│ │ └── iam.tf
│ └── rds/
│ ├── main.tf
│ ├── variables.tf
│ └── outputs.tf
├── environments/
│ ├── dev/
│ │ ├── main.tf
│ │ ├── terraform.tfvars
│ │ └── backend.tf
│ ├── staging/
│ │ ├── main.tf
│ │ ├── terraform.tfvars
│ │ └── backend.tf
│ └── production/
│ ├── main.tf
│ ├── terraform.tfvars
│ └── backend.tf
└── modules.tf
graph TB
    subgraph Environments["环境配置"]
        Dev["dev/main.tf"]
        Staging["staging/main.tf"]
        Prod["production/main.tf"]
    end

    subgraph Modules["可复用模块"]
        VPC["modules/vpc"]
        EKS["modules/eks"]
        RDS["modules/rds"]
    end

    Dev --> VPC
    Dev --> EKS
    Staging --> VPC
    Staging --> EKS
    Staging --> RDS
    Prod --> VPC
    Prod --> EKS
    Prod --> RDS

模块定义

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
# modules/vpc/main.tf
resource "aws_vpc" "this" {
cidr_block = var.vpc_cidr
enable_dns_hostnames = true
enable_dns_support = true

tags = merge(var.tags, {
Name = "${var.name}-vpc"
})
}

resource "aws_subnet" "private" {
for_each = { for idx, cidr in var.private_subnet_cidrs : idx => cidr }

vpc_id = aws_vpc.this.id
cidr_block = each.value
availability_zone = var.azs[each.key]

tags = merge(var.tags, {
Name = "${var.name}-private-${each.key}"
Tier = "private"
})
}

# modules/vpc/variables.tf
variable "name" {
type = string
}

variable "vpc_cidr" {
type = string
}

variable "private_subnet_cidrs" {
type = list(string)
}

variable "azs" {
type = list(string)
}

variable "tags" {
type = map(string)
default = {}
}

# modules/vpc/outputs.tf
output "vpc_id" {
value = aws_vpc.this.id
}

output "private_subnet_ids" {
value = [for s in aws_subnet.private : s.id]
}

模块调用

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
# environments/production/main.tf
module "vpc" {
source = "../../modules/vpc"

name = "production"
vpc_cidr = "10.0.0.0/16"
private_subnet_cidrs = ["10.0.1.0/24", "10.0.2.0/24", "10.0.3.0/24"]
azs = ["ap-northeast-1a", "ap-northeast-1c", "ap-northeast-1d"]

tags = local.common_tags
}

module "eks" {
source = "../../modules/eks"

cluster_name = "production"
cluster_version = "1.29"
vpc_id = module.vpc.vpc_id
subnet_ids = module.vpc.private_subnet_ids
node_count = 5
instance_types = ["m6i.xlarge"]

tags = local.common_tags
}

module "rds" {
source = "../../modules/rds"

name = "production"
engine_version = "15.4"
instance_class = "db.r6g.xlarge"
allocated_storage = 100
vpc_id = module.vpc.vpc_id
subnet_ids = module.vpc.private_subnet_ids
multi_az = true

tags = local.common_tags
}

locals {
common_tags = {
Environment = "production"
Project = "myproject"
ManagedBy = "terraform"
}
}

状态管理

远程状态存储

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
# 创建S3后端存储基础设施
resource "aws_s3_bucket" "terraform_state" {
bucket = "mycompany-terraform-state"

lifecycle {
prevent_destroy = true
}
}

resource "aws_s3_bucket_versioning" "terraform_state" {
bucket = aws_s3_bucket.terraform_state.id
versioning_configuration {
status = "Enabled"
}
}

resource "aws_s3_bucket_server_side_encryption_configuration" "terraform_state" {
bucket = aws_s3_bucket.terraform_state.id
rule {
apply_server_side_encryption_by_default {
sse_algorithm = "aws:kms"
}
}
}

# DynamoDB用于状态锁
resource "aws_dynamodb_table" "terraform_locks" {
name = "terraform-locks"
billing_mode = "PAY_PER_REQUEST"
hash_key = "LockID"

attribute {
name = "LockID"
type = "S"
}
}

状态操作

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
# 查看当前状态
terraform state list
terraform state show aws_vpc.main

# 移动资源(重命名)
terraform state mv aws_vpc.main aws_vpc.primary

# 导入现有资源
terraform import aws_vpc.main vpc-12345678

# 移除资源(不删除实际资源)
terraform state rm aws_vpc.legacy

# 拉取远程状态
terraform state pull > state_backup.json

# 刷新状态
terraform refresh
sequenceDiagram
    participant Dev as 开发者A
    participant Lock as DynamoDB Lock
    participant State as S3 State
    participant Cloud as AWS

    Dev->>Lock: 获取状态锁
    Lock-->>Dev: 锁定成功
    Dev->>State: 读取当前状态
    State-->>Dev: terraform.tfstate
    Dev->>Cloud: API调用(创建/修改/删除)
    Cloud-->>Dev: 操作结果
    Dev->>State: 更新状态文件
    Dev->>Lock: 释放锁

Workspaces

1
2
3
4
5
6
7
8
9
10
11
12
# 创建workspace
terraform workspace new staging
terraform workspace new production

# 切换workspace
terraform workspace select production

# 列出workspace
terraform workspace list

# 在配置中使用workspace
# terraform.workspace 返回当前workspace名称
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
# 基于workspace的配置差异
locals {
env_config = {
dev = {
instance_type = "t3.small"
node_count = 2
multi_az = false
}
staging = {
instance_type = "t3.medium"
node_count = 3
multi_az = false
}
production = {
instance_type = "m6i.large"
node_count = 5
multi_az = true
}
}

config = local.env_config[terraform.workspace]
}

resource "aws_instance" "app" {
count = local.config.node_count
instance_type = local.config.instance_type
ami = data.aws_ami.ubuntu.id

tags = {
Name = "app-${terraform.workspace}-${count.index}"
Environment = terraform.workspace
}
}

高级特性

Lifecycle规则

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
resource "aws_instance" "app" {
ami = data.aws_ami.ubuntu.id
instance_type = "m6i.large"

lifecycle {
# 先创建新资源再销毁旧资源
create_before_destroy = true

# 禁止销毁(需要先移除此规则才能destroy)
prevent_destroy = true

# 忽略外部变更
ignore_changes = [
tags["LastModified"],
user_data,
]

# 替换触发器
replace_triggered_by = [
aws_security_group.app.id,
]
}
}

Moved块(重构)

1
2
3
4
5
6
7
8
9
10
11
# 重命名资源时,使用moved块避免destroy+create
moved {
from = aws_instance.web
to = aws_instance.app
}

# 从count迁移到for_each
moved {
from = aws_subnet.private[0]
to = aws_subnet.private["az-a"]
}

条件表达式与循环

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
# 条件创建
resource "aws_cloudwatch_log_group" "app" {
count = var.enable_logging ? 1 : 0
name = "/app/${var.environment}"
}

# for表达式
locals {
# 列表转换
upper_names = [for name in var.names : upper(name)]

# Map转换
tag_map = { for k, v in var.raw_tags : lower(k) => v }

# 过滤
production_instances = [
for instance in aws_instance.app :
instance.id
if instance.tags["Environment"] == "production"
]
}

# for_each遍历
resource "aws_iam_user" "users" {
for_each = toset(var.user_names)
name = each.value
}

CI/CD集成

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
# .github/workflows/terraform.yaml
name: Terraform

on:
pull_request:
paths: ['infrastructure/**']
push:
branches: [main]
paths: ['infrastructure/**']

jobs:
plan:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4

- uses: hashicorp/setup-terraform@v3
with:
terraform_version: 1.7.0

- name: Terraform Init
run: terraform init
working-directory: infrastructure/environments/production

- name: Terraform Plan
id: plan
run: terraform plan -no-color -out=tfplan
working-directory: infrastructure/environments/production

- name: Comment PR
if: github.event_name == 'pull_request'
uses: actions/github-script@v7
with:
script: |
const plan = `${{ steps.plan.outputs.stdout }}`;
github.rest.issues.createComment({
issue_number: context.issue.number,
owner: context.repo.owner,
repo: context.repo.repo,
body: `## Terraform Plan\n\`\`\`\n${plan}\n\`\`\``
});

apply:
needs: plan
if: github.ref == 'refs/heads/main'
runs-on: ubuntu-latest
environment: production
steps:
- uses: actions/checkout@v4

- uses: hashicorp/setup-terraform@v3

- name: Terraform Apply
run: |
terraform init
terraform apply -auto-approve
working-directory: infrastructure/environments/production

最佳实践

  1. 状态文件永远不要提交到Git:使用远程后端(S3/GCS/Azure Blob)
  2. 启用状态锁:防止并发操作导致状态损坏
  3. 使用模块化设计:DRY原则,公共模块复用
  4. 锁定Provider版本:避免意外升级导致破坏性变更
  5. Plan审查:所有变更先plan,review后再apply
  6. 敏感数据管理:使用sensitive = true标记敏感输出,不在tfvars中存储密钥
  7. 标签规范:通过default_tags统一管理资源标签
  8. 小步变更:避免一次性大规模变更,降低风险
graph LR
    A["编写代码"] --> B["terraform plan"]
    B --> C["代码审查"]
    C --> D["terraform apply"]
    D --> E["验证资源"]
    E --> F["提交代码"]

    style C fill:#FF9800,color:#fff

总结

Terraform通过声明式的方式管理基础设施,配合模块化设计和远程状态管理,能够实现基础设施的版本化、可审计、可复用。在团队协作中,结合CI/CD流水线和严格的Plan-Review-Apply流程,可以安全高效地管理复杂的多云基础设施。

作者 · authorzt
发布 · date2024-06-26
篇幅 · length3.0k 字 · 7 min
许可 · licenseCC BY-SA 4.0
$ echo "comments" · 评论