前言
资源管理和调度策略是Kubernetes集群稳定运行的关键。合理配置资源请求和限制、理解调度器的工作原理,能够有效避免资源浪费和应用异常。本文将系统性地讲解这些核心概念。
资源模型
Requests与Limits
Kubernetes通过requests和limits两个维度管理容器资源:
requests :容器运行所需的最低资源保证,调度器基于此进行节点选择
limits :容器能使用的资源上限,超出限制会被限流(CPU)或OOM
Kill(内存)
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 apiVersion: v1 kind: Pod metadata: name: resource-demo spec: containers: - name: app image: myapp:1.0 resources: requests: cpu: 250m memory: 256Mi ephemeral-storage: 1Gi limits: cpu: 500m memory: 512Mi ephemeral-storage: 2Gi
graph TB
subgraph Node["Node (4 CPU, 8Gi Memory)"]
subgraph Allocatable["可分配资源"]
subgraph Used["已分配 (Requests)"]
P1["Pod A<br>req: 1CPU, 2Gi"]
P2["Pod B<br>req: 0.5CPU, 1Gi"]
end
Free["空闲: 2.5CPU, 5Gi<br>可调度新Pod"]
end
Reserved["系统预留<br>kube-reserved + system-reserved"]
end
style Used fill:#FF9800,color:#fff
style Free fill:#4CAF50,color:#fff
style Reserved fill:#9E9E9E,color:#fff
CPU资源细节
CPU以毫核(millicores)为单位,1000m =
1个CPU核心。CPU是可压缩资源 ,超出limits时容器会被限流(throttling),不会被杀死。
1 2 3 4 5 6 7 8 9 10 11 kubectl top pod app-podcat /sys/fs/cgroup/cpu.stat
内存资源细节
内存以字节为单位(支持Ki/Mi/Gi后缀)。内存是不可压缩资源 ,超出limits时容器会被OOM
Kill。
flowchart TD
A["容器内存使用增长"] --> B{是否超过 limits?}
B -->|是| C["OOM Kill<br>容器被终止"]
B -->|否| D{节点内存是否紧张?}
D -->|是| E{Pod QoS等级?}
D -->|否| F["正常运行"]
E -->|BestEffort| G["优先被驱逐"]
E -->|Burstable| H["其次被驱逐"]
E -->|Guaranteed| I["最后被驱逐"]
C --> J["根据restartPolicy决定是否重启"]
QoS等级
Kubernetes根据requests和limits的配置自动划分Pod的QoS(Quality of
Service)等级:
Guaranteed
所有容器的CPU和内存都设置了requests和limits,且相等。
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 apiVersion: v1 kind: Pod metadata: name: guaranteed-pod spec: containers: - name: app image: myapp:1.0 resources: requests: cpu: 500m memory: 256Mi limits: cpu: 500m memory: 256Mi
Burstable
至少一个容器设置了requests或limits,但不满足Guaranteed条件。
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 apiVersion: v1 kind: Pod metadata: name: burstable-pod spec: containers: - name: app image: myapp:1.0 resources: requests: cpu: 250m memory: 128Mi limits: cpu: 500m memory: 512Mi
BestEffort
没有任何容器设置resources。
1 2 3 4 5 6 7 8 9 apiVersion: v1 kind: Pod metadata: name: besteffort-pod spec: containers: - name: app image: myapp:1.0
graph TB
subgraph QoS["QoS等级优先级 (驱逐顺序)"]
direction TB
BE["BestEffort<br>最先被驱逐<br>oom_score_adj=1000"]
BU["Burstable<br>其次被驱逐<br>oom_score_adj=2~999"]
GU["Guaranteed<br>最后被驱逐<br>oom_score_adj=-997"]
end
BE --> BU --> GU
style BE fill:#f44336,color:#fff
style BU fill:#FF9800,color:#fff
style GU fill:#4CAF50,color:#fff
LimitRange
LimitRange用于在命名空间级别设置资源使用的默认值和约束:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 apiVersion: v1 kind: LimitRange metadata: name: resource-limits namespace: production spec: limits: - type: Container default: cpu: 500m memory: 512Mi defaultRequest: cpu: 100m memory: 128Mi max: cpu: 2 memory: 2Gi min: cpu: 50m memory: 64Mi maxLimitRequestRatio: cpu: 4 memory: 4 - type: Pod max: cpu: 4 memory: 4Gi - type: PersistentVolumeClaim min: storage: 1Gi max: storage: 100Gi
ResourceQuota
ResourceQuota用于限制命名空间的总体资源消耗:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 apiVersion: v1 kind: ResourceQuota metadata: name: compute-quota namespace: production spec: hard: requests.cpu: "20" requests.memory: 40Gi limits.cpu: "40" limits.memory: 80Gi pods: "50" services: "20" services.loadbalancers: "2" persistentvolumeclaims: "20" configmaps: "50" secrets: "50" requests.storage: 500Gi --- apiVersion: v1 kind: ResourceQuota metadata: name: guaranteed-quota namespace: production spec: hard: pods: "10" scopeSelector: matchExpressions: - operator: In scopeName: PriorityClass values: ["high" ]
1 2 3 4 5 6 7 8 9 10 11 kubectl describe resourcequota compute-quota -n production
调度器算法
调度流程
sequenceDiagram
participant Queue as 调度队列
participant Sched as Scheduler
participant Filter as 预选(Filter)
participant Score as 优选(Score)
participant Bind as 绑定(Bind)
participant API as API Server
Queue->>Sched: 取出待调度Pod
Sched->>Filter: 过滤不满足条件的节点
Note over Filter: NodeResourcesFit<br>NodeAffinity<br>PodTopologySpread<br>TaintToleration
Filter-->>Sched: 候选节点列表
Sched->>Score: 对候选节点打分
Note over Score: LeastRequestedPriority<br>BalancedResourceAllocation<br>ImageLocality
Score-->>Sched: 评分排序结果
Sched->>Bind: 选择最高分节点
Bind->>API: 更新Pod.spec.nodeName
调度插件扩展点
graph LR
QSort["QueueSort"] --> PreFilter["PreFilter"]
PreFilter --> Filter["Filter"]
Filter --> PostFilter["PostFilter"]
PostFilter --> PreScore["PreScore"]
PreScore --> Score["Score"]
Score --> NormalizeScore["NormalizeScore"]
NormalizeScore --> Reserve["Reserve"]
Reserve --> Permit["Permit"]
Permit --> PreBind["PreBind"]
PreBind --> Bind["Bind"]
Bind --> PostBind["PostBind"]
节点亲和性(Node Affinity)
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 apiVersion: v1 kind: Pod metadata: name: affinity-pod spec: affinity: nodeAffinity: requiredDuringSchedulingIgnoredDuringExecution: nodeSelectorTerms: - matchExpressions: - key: kubernetes.io/arch operator: In values: ["amd64" ] - key: node-type operator: In values: ["compute" , "gpu" ] preferredDuringSchedulingIgnoredDuringExecution: - weight: 80 preference: matchExpressions: - key: zone operator: In values: ["zone-a" ] - weight: 20 preference: matchExpressions: - key: zone operator: In values: ["zone-b" ] containers: - name: app image: myapp:1.0
Pod亲和性与反亲和性
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 apiVersion: apps/v1 kind: Deployment metadata: name: web-deployment spec: replicas: 3 selector: matchLabels: app: web template: metadata: labels: app: web spec: affinity: podAffinity: requiredDuringSchedulingIgnoredDuringExecution: - labelSelector: matchExpressions: - key: app operator: In values: ["cache" ] topologyKey: kubernetes.io/hostname podAntiAffinity: preferredDuringSchedulingIgnoredDuringExecution: - weight: 100 podAffinityTerm: labelSelector: matchExpressions: - key: app operator: In values: ["web" ] topologyKey: kubernetes.io/hostname containers: - name: web image: nginx:1.25
污点与容忍(Taints &
Tolerations)
污点(Taint)用于标记节点不接受特定Pod,容忍(Toleration)允许Pod调度到有污点的节点。
1 2 3 4 5 6 7 8 9 kubectl taint nodes node-gpu dedicated=gpu:NoSchedule kubectl taint nodes node-1 maintenance=true :NoExecute kubectl describe node node-gpu | grep Taints kubectl taint nodes node-gpu dedicated=gpu:NoSchedule-
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 apiVersion: v1 kind: Pod metadata: name: gpu-pod spec: tolerations: - key: "dedicated" operator: "Equal" value: "gpu" effect: "NoSchedule" - key: "node.kubernetes.io/not-ready" operator: "Exists" effect: "NoExecute" tolerationSeconds: 300 containers: - name: gpu-app image: gpu-app:1.0 resources: limits: nvidia.com/gpu: 1
graph TB
subgraph Scheduling["调度决策"]
Pod1["Pod A<br>无Toleration"] --> |"拒绝"| TaintNode["Node-GPU<br>Taint: dedicated=gpu:NoSchedule"]
Pod2["Pod B<br>Toleration: dedicated=gpu"] --> |"允许"| TaintNode
Pod3["Pod C<br>无Toleration"] --> |"允许"| NormalNode["Normal Node<br>无Taint"]
end
style TaintNode fill:#f44336,color:#fff
style NormalNode fill:#4CAF50,color:#fff
污点效果类型:
NoSchedule
不调度新Pod到该节点
PreferNoSchedule
尽量不调度,软性限制
NoExecute
不调度+驱逐已有Pod
优先级与抢占(Priority &
Preemption)
当集群资源不足时,高优先级Pod可以抢占低优先级Pod的资源。
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 apiVersion: scheduling.k8s.io/v1 kind: PriorityClass metadata: name: critical value: 1000000 globalDefault: false preemptionPolicy: PreemptLowerPriority description: "用于关键业务应用" --- apiVersion: scheduling.k8s.io/v1 kind: PriorityClass metadata: name: standard value: 100 globalDefault: true preemptionPolicy: PreemptLowerPriority description: "默认优先级" --- apiVersion: scheduling.k8s.io/v1 kind: PriorityClass metadata: name: batch value: 10 globalDefault: false preemptionPolicy: Never description: "批处理任务"
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 apiVersion: apps/v1 kind: Deployment metadata: name: critical-app spec: replicas: 3 selector: matchLabels: app: critical-app template: metadata: labels: app: critical-app spec: priorityClassName: critical containers: - name: app image: critical-app:1.0 resources: requests: cpu: 1 memory: 1Gi limits: cpu: 2 memory: 2Gi
sequenceDiagram
participant HP as 高优先级Pod
participant Sched as Scheduler
participant Node as Node
participant LP as 低优先级Pod
HP->>Sched: 请求调度
Sched->>Node: 检查可用资源
Node-->>Sched: 资源不足
Sched->>Sched: 查找可抢占的Pod
Sched->>LP: 发送抢占信号
LP->>LP: 优雅终止
LP-->>Node: 释放资源
Sched->>Node: 调度高优先级Pod
Node-->>HP: Pod运行
Topology Spread Constraints
拓扑分布约束确保Pod在不同拓扑域(zone/node)之间均匀分布:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 apiVersion: apps/v1 kind: Deployment metadata: name: balanced-app spec: replicas: 6 selector: matchLabels: app: balanced-app template: metadata: labels: app: balanced-app spec: topologySpreadConstraints: - maxSkew: 1 topologyKey: topology.kubernetes.io/zone whenUnsatisfiable: DoNotSchedule labelSelector: matchLabels: app: balanced-app - maxSkew: 1 topologyKey: kubernetes.io/hostname whenUnsatisfiable: ScheduleAnyway labelSelector: matchLabels: app: balanced-app containers: - name: app image: myapp:1.0
资源管理最佳实践
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 apiVersion: apps/v1 kind: Deployment metadata: name: production-app spec: replicas: 3 selector: matchLabels: app: production-app template: metadata: labels: app: production-app spec: priorityClassName: standard topologySpreadConstraints: - maxSkew: 1 topologyKey: topology.kubernetes.io/zone whenUnsatisfiable: DoNotSchedule labelSelector: matchLabels: app: production-app affinity: podAntiAffinity: preferredDuringSchedulingIgnoredDuringExecution: - weight: 100 podAffinityTerm: labelSelector: matchLabels: app: production-app topologyKey: kubernetes.io/hostname containers: - name: app image: myapp:1.0 resources: requests: cpu: 200m memory: 256Mi limits: cpu: 1 memory: 512Mi
总结
Kubernetes资源管理的核心要点:
始终设置requests和limits ,优先保证QoS为Guaranteed或Burstable
requests基于实际监控数据 ,避免过度申请导致资源浪费
利用LimitRange和ResourceQuota 进行命名空间级别的资源治理
合理使用亲和性、拓扑约束 确保Pod分散部署,提高可用性
为关键业务设置高优先级 ,确保资源紧张时不被抢占
定期审视资源配置 ,结合Prometheus+Grafana监控资源使用趋势