Kubernetes资源管理与调度策略

前言

资源管理和调度策略是Kubernetes集群稳定运行的关键。合理配置资源请求和限制、理解调度器的工作原理，能够有效避免资源浪费和应用异常。本文将系统性地讲解这些核心概念。

资源模型

Requests与Limits

Kubernetes通过requests和limits两个维度管理容器资源：

requests：容器运行所需的最低资源保证，调度器基于此进行节点选择
limits：容器能使用的资源上限，超出限制会被限流（CPU）或OOM Kill（内存）

apiVersion: v1
kind: Pod
metadata:
  name: resource-demo
spec:
  containers:
    - name: app
      image: myapp:1.0
      resources:
        requests:
          cpu: 250m       # 0.25 CPU核心
          memory: 256Mi   # 256 MiB内存
          ephemeral-storage: 1Gi
        limits:
          cpu: 500m       # 0.5 CPU核心
          memory: 512Mi   # 512 MiB内存
          ephemeral-storage: 2Gi

graph TB
    subgraph Node["Node (4 CPU, 8Gi Memory)"]
        subgraph Allocatable["可分配资源"]
            subgraph Used["已分配 (Requests)"]
                P1["Pod A<br>req: 1CPU, 2Gi"]
                P2["Pod B<br>req: 0.5CPU, 1Gi"]
            end
            Free["空闲: 2.5CPU, 5Gi<br>可调度新Pod"]
        end
        Reserved["系统预留<br>kube-reserved + system-reserved"]
    end

    style Used fill:#FF9800,color:#fff
    style Free fill:#4CAF50,color:#fff
    style Reserved fill:#9E9E9E,color:#fff

CPU资源细节

CPU以毫核（millicores）为单位，1000m = 1个CPU核心。CPU是可压缩资源，超出limits时容器会被限流（throttling），不会被杀死。

# 查看Pod实际CPU使用
kubectl top pod app-pod

# 查看CPU限流情况（cgroup v2）
cat /sys/fs/cgroup/cpu.stat
# usage_usec 12345678
# user_usec 10000000
# system_usec 2345678
# nr_periods 1000
# nr_throttled 150    # 被限流次数
# throttled_usec 500000  # 被限流总时间

内存资源细节

内存以字节为单位（支持Ki/Mi/Gi后缀）。内存是不可压缩资源，超出limits时容器会被OOM Kill。

flowchart TD
    A["容器内存使用增长"] --> B{是否超过 limits?}
    B -->|是| C["OOM Kill<br>容器被终止"]
    B -->|否| D{节点内存是否紧张?}
    D -->|是| E{Pod QoS等级?}
    D -->|否| F["正常运行"]
    E -->|BestEffort| G["优先被驱逐"]
    E -->|Burstable| H["其次被驱逐"]
    E -->|Guaranteed| I["最后被驱逐"]
    C --> J["根据restartPolicy决定是否重启"]

QoS等级

Kubernetes根据requests和limits的配置自动划分Pod的QoS（Quality of Service）等级：

Guaranteed

所有容器的CPU和内存都设置了requests和limits，且相等。

apiVersion: v1
kind: Pod
metadata:
  name: guaranteed-pod
spec:
  containers:
    - name: app
      image: myapp:1.0
      resources:
        requests:
          cpu: 500m
          memory: 256Mi
        limits:
          cpu: 500m       # limits == requests
          memory: 256Mi   # limits == requests

Burstable

至少一个容器设置了requests或limits，但不满足Guaranteed条件。

apiVersion: v1
kind: Pod
metadata:
  name: burstable-pod
spec:
  containers:
    - name: app
      image: myapp:1.0
      resources:
        requests:
          cpu: 250m
          memory: 128Mi
        limits:
          cpu: 500m       # limits > requests
          memory: 512Mi

BestEffort

没有任何容器设置resources。

apiVersion: v1
kind: Pod
metadata:
  name: besteffort-pod
spec:
  containers:
    - name: app
      image: myapp:1.0
      # 没有设置resources

graph TB
    subgraph QoS["QoS等级优先级 (驱逐顺序)"]
        direction TB
        BE["BestEffort<br>最先被驱逐<br>oom_score_adj=1000"]
        BU["Burstable<br>其次被驱逐<br>oom_score_adj=2~999"]
        GU["Guaranteed<br>最后被驱逐<br>oom_score_adj=-997"]
    end

    BE --> BU --> GU

    style BE fill:#f44336,color:#fff
    style BU fill:#FF9800,color:#fff
    style GU fill:#4CAF50,color:#fff

LimitRange

LimitRange用于在命名空间级别设置资源使用的默认值和约束：

apiVersion: v1
kind: LimitRange
metadata:
  name: resource-limits
  namespace: production
spec:
  limits:
    # 容器级别限制
    - type: Container
      default:          # 默认limits
        cpu: 500m
        memory: 512Mi
      defaultRequest:   # 默认requests
        cpu: 100m
        memory: 128Mi
      max:              # 最大limits
        cpu: 2
        memory: 2Gi
      min:              # 最小requests
        cpu: 50m
        memory: 64Mi
      maxLimitRequestRatio:  # limits/requests最大比值
        cpu: 4
        memory: 4
    # Pod级别限制
    - type: Pod
      max:
        cpu: 4
        memory: 4Gi
    # PVC级别限制
    - type: PersistentVolumeClaim
      min:
        storage: 1Gi
      max:
        storage: 100Gi

ResourceQuota

ResourceQuota用于限制命名空间的总体资源消耗：

apiVersion: v1
kind: ResourceQuota
metadata:
  name: compute-quota
  namespace: production
spec:
  hard:
    # 计算资源
    requests.cpu: "20"
    requests.memory: 40Gi
    limits.cpu: "40"
    limits.memory: 80Gi
    # 对象数量
    pods: "50"
    services: "20"
    services.loadbalancers: "2"
    persistentvolumeclaims: "20"
    configmaps: "50"
    secrets: "50"
    # 存储
    requests.storage: 500Gi

---
# 按QoS等级分配配额
apiVersion: v1
kind: ResourceQuota
metadata:
  name: guaranteed-quota
  namespace: production
spec:
  hard:
    pods: "10"
  scopeSelector:
    matchExpressions:
      - operator: In
        scopeName: PriorityClass
        values: ["high"]

# 查看配额使用情况
kubectl describe resourcequota compute-quota -n production
# Name:             compute-quota
# Namespace:        production
# Resource          Used    Hard
# --------          ----    ----
# limits.cpu        12      40
# limits.memory     24Gi    80Gi
# pods              15      50
# requests.cpu      6       20
# requests.memory   12Gi    40Gi

调度器算法

调度流程

sequenceDiagram
    participant Queue as 调度队列
    participant Sched as Scheduler
    participant Filter as 预选(Filter)
    participant Score as 优选(Score)
    participant Bind as 绑定(Bind)
    participant API as API Server

    Queue->>Sched: 取出待调度Pod
    Sched->>Filter: 过滤不满足条件的节点
    Note over Filter: NodeResourcesFit<br>NodeAffinity<br>PodTopologySpread<br>TaintToleration
    Filter-->>Sched: 候选节点列表
    Sched->>Score: 对候选节点打分
    Note over Score: LeastRequestedPriority<br>BalancedResourceAllocation<br>ImageLocality
    Score-->>Sched: 评分排序结果
    Sched->>Bind: 选择最高分节点
    Bind->>API: 更新Pod.spec.nodeName

调度插件扩展点

graph LR
    QSort["QueueSort"] --> PreFilter["PreFilter"]
    PreFilter --> Filter["Filter"]
    Filter --> PostFilter["PostFilter"]
    PostFilter --> PreScore["PreScore"]
    PreScore --> Score["Score"]
    Score --> NormalizeScore["NormalizeScore"]
    NormalizeScore --> Reserve["Reserve"]
    Reserve --> Permit["Permit"]
    Permit --> PreBind["PreBind"]
    PreBind --> Bind["Bind"]
    Bind --> PostBind["PostBind"]

节点亲和性（Node Affinity）

apiVersion: v1
kind: Pod
metadata:
  name: affinity-pod
spec:
  affinity:
    # 节点亲和性
    nodeAffinity:
      # 硬性要求：必须满足
      requiredDuringSchedulingIgnoredDuringExecution:
        nodeSelectorTerms:
          - matchExpressions:
              - key: kubernetes.io/arch
                operator: In
                values: ["amd64"]
              - key: node-type
                operator: In
                values: ["compute", "gpu"]
      # 软性偏好：尽量满足
      preferredDuringSchedulingIgnoredDuringExecution:
        - weight: 80
          preference:
            matchExpressions:
              - key: zone
                operator: In
                values: ["zone-a"]
        - weight: 20
          preference:
            matchExpressions:
              - key: zone
                operator: In
                values: ["zone-b"]
  containers:
    - name: app
      image: myapp:1.0

Pod亲和性与反亲和性

apiVersion: apps/v1
kind: Deployment
metadata:
  name: web-deployment
spec:
  replicas: 3
  selector:
    matchLabels:
      app: web
  template:
    metadata:
      labels:
        app: web
    spec:
      affinity:
        # Pod亲和性：与cache Pod调度到同一节点
        podAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
            - labelSelector:
                matchExpressions:
                  - key: app
                    operator: In
                    values: ["cache"]
              topologyKey: kubernetes.io/hostname
        # Pod反亲和性：web Pod分散在不同节点
        podAntiAffinity:
          preferredDuringSchedulingIgnoredDuringExecution:
            - weight: 100
              podAffinityTerm:
                labelSelector:
                  matchExpressions:
                    - key: app
                      operator: In
                      values: ["web"]
                topologyKey: kubernetes.io/hostname
      containers:
        - name: web
          image: nginx:1.25

污点与容忍（Taints & Tolerations）

污点（Taint）用于标记节点不接受特定Pod，容忍（Toleration）允许Pod调度到有污点的节点。

# 添加污点
kubectl taint nodes node-gpu dedicated=gpu:NoSchedule
kubectl taint nodes node-1 maintenance=true:NoExecute

# 查看污点
kubectl describe node node-gpu | grep Taints

# 删除污点
kubectl taint nodes node-gpu dedicated=gpu:NoSchedule-

apiVersion: v1
kind: Pod
metadata:
  name: gpu-pod
spec:
  tolerations:
    # 精确匹配
    - key: "dedicated"
      operator: "Equal"
      value: "gpu"
      effect: "NoSchedule"
    # 匹配所有值
    - key: "node.kubernetes.io/not-ready"
      operator: "Exists"
      effect: "NoExecute"
      tolerationSeconds: 300  # 容忍300秒后驱逐
  containers:
    - name: gpu-app
      image: gpu-app:1.0
      resources:
        limits:
          nvidia.com/gpu: 1

graph TB
    subgraph Scheduling["调度决策"]
        Pod1["Pod A<br>无Toleration"] --> |"拒绝"| TaintNode["Node-GPU<br>Taint: dedicated=gpu:NoSchedule"]
        Pod2["Pod B<br>Toleration: dedicated=gpu"] --> |"允许"| TaintNode
        Pod3["Pod C<br>无Toleration"] --> |"允许"| NormalNode["Normal Node<br>无Taint"]
    end

    style TaintNode fill:#f44336,color:#fff
    style NormalNode fill:#4CAF50,color:#fff

污点效果类型：

Effect	行为
`NoSchedule`	不调度新Pod到该节点
`PreferNoSchedule`	尽量不调度，软性限制
`NoExecute`	不调度+驱逐已有Pod

优先级与抢占（Priority & Preemption）

当集群资源不足时，高优先级Pod可以抢占低优先级Pod的资源。

# 定义PriorityClass
apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
  name: critical
value: 1000000
globalDefault: false
preemptionPolicy: PreemptLowerPriority
description: "用于关键业务应用"

---
apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
  name: standard
value: 100
globalDefault: true
preemptionPolicy: PreemptLowerPriority
description: "默认优先级"

---
apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
  name: batch
value: 10
globalDefault: false
preemptionPolicy: Never  # 不抢占，排队等待
description: "批处理任务"

# 使用PriorityClass
apiVersion: apps/v1
kind: Deployment
metadata:
  name: critical-app
spec:
  replicas: 3
  selector:
    matchLabels:
      app: critical-app
  template:
    metadata:
      labels:
        app: critical-app
    spec:
      priorityClassName: critical
      containers:
        - name: app
          image: critical-app:1.0
          resources:
            requests:
              cpu: 1
              memory: 1Gi
            limits:
              cpu: 2
              memory: 2Gi

sequenceDiagram
    participant HP as 高优先级Pod
    participant Sched as Scheduler
    participant Node as Node
    participant LP as 低优先级Pod

    HP->>Sched: 请求调度
    Sched->>Node: 检查可用资源
    Node-->>Sched: 资源不足
    Sched->>Sched: 查找可抢占的Pod
    Sched->>LP: 发送抢占信号
    LP->>LP: 优雅终止
    LP-->>Node: 释放资源
    Sched->>Node: 调度高优先级Pod
    Node-->>HP: Pod运行

Topology Spread Constraints

拓扑分布约束确保Pod在不同拓扑域（zone/node）之间均匀分布：

apiVersion: apps/v1
kind: Deployment
metadata:
  name: balanced-app
spec:
  replicas: 6
  selector:
    matchLabels:
      app: balanced-app
  template:
    metadata:
      labels:
        app: balanced-app
    spec:
      topologySpreadConstraints:
        - maxSkew: 1
          topologyKey: topology.kubernetes.io/zone
          whenUnsatisfiable: DoNotSchedule
          labelSelector:
            matchLabels:
              app: balanced-app
        - maxSkew: 1
          topologyKey: kubernetes.io/hostname
          whenUnsatisfiable: ScheduleAnyway
          labelSelector:
            matchLabels:
              app: balanced-app
      containers:
        - name: app
          image: myapp:1.0

资源管理最佳实践

# 推荐配置示例
apiVersion: apps/v1
kind: Deployment
metadata:
  name: production-app
spec:
  replicas: 3
  selector:
    matchLabels:
      app: production-app
  template:
    metadata:
      labels:
        app: production-app
    spec:
      # 优先级
      priorityClassName: standard
      # 拓扑分布
      topologySpreadConstraints:
        - maxSkew: 1
          topologyKey: topology.kubernetes.io/zone
          whenUnsatisfiable: DoNotSchedule
          labelSelector:
            matchLabels:
              app: production-app
      # 反亲和性确保Pod分散
      affinity:
        podAntiAffinity:
          preferredDuringSchedulingIgnoredDuringExecution:
            - weight: 100
              podAffinityTerm:
                labelSelector:
                  matchLabels:
                    app: production-app
                topologyKey: kubernetes.io/hostname
      containers:
        - name: app
          image: myapp:1.0
          resources:
            # 基于实际监控数据设置
            requests:
              cpu: 200m     # P50使用量
              memory: 256Mi # 稳态使用量
            limits:
              cpu: 1        # 突发上限
              memory: 512Mi # 含安全余量

总结

Kubernetes资源管理的核心要点：

始终设置requests和limits，优先保证QoS为Guaranteed或Burstable
requests基于实际监控数据，避免过度申请导致资源浪费
利用LimitRange和ResourceQuota进行命名空间级别的资源治理
合理使用亲和性、拓扑约束确保Pod分散部署，提高可用性
为关键业务设置高优先级，确保资源紧张时不被抢占
定期审视资源配置，结合Prometheus+Grafana监控资源使用趋势