Architecture · #circuit-breaker#hystrix#sentinel#resilience

熔断器模式:从Hystrix到Sentinel

2024.03.06 7 min 2.9k
// 目录 · contents

前言

在分布式系统中,服务之间的调用链路可能很长。当某个下游服务出现故障或响应变慢时,如果调用方持续等待,大量请求会堆积,最终导致调用方自身资源耗尽——这就是”级联故障”(Cascading Failure)。熔断器模式(Circuit Breaker Pattern)通过快速失败来切断故障传播链路,是微服务架构中不可或缺的弹性机制。

级联故障问题

graph LR
    A[服务A] --> B[服务B]
    B --> C[服务C]
    C --> D[服务D ☠️]

    style D fill:#f66

    A2[服务A<br/>线程池耗尽] --> B2[服务B<br/>线程池耗尽]
    B2 --> C2[服务C<br/>超时堆积]
    C2 --> D2[服务D<br/>宕机]

    style A2 fill:#f96
    style B2 fill:#f96
    style C2 fill:#f96
    style D2 fill:#f66

当服务 D 宕机时: 1. 服务 C 调用 D 超时,线程被阻塞 2. 服务 C 的线程池逐渐耗尽,开始拒绝请求 3. 服务 B 调用 C 也开始超时 4. 最终整个调用链路全部瘫痪

熔断器状态机

熔断器有三个状态:

stateDiagram-v2
    [*] --> Closed
    Closed --> Open: 错误率超过阈值
    Open --> HalfOpen: 超时时间到
    HalfOpen --> Closed: 探测请求成功
    HalfOpen --> Open: 探测请求失败

    state Closed {
        [*] --> Monitoring
        Monitoring: 正常放行请求
        Monitoring: 统计错误率
    }

    state Open {
        [*] --> Rejecting
        Rejecting: 快速失败
        Rejecting: 返回降级响应
    }

    state HalfOpen {
        [*] --> Probing
        Probing: 放行少量探测请求
        Probing: 根据结果决定状态
    }
  • Closed(关闭):正常状态,请求正常通过,同时统计错误率
  • Open(打开):熔断状态,所有请求快速失败,返回降级响应
  • Half-Open(半开):恢复探测状态,放行少量请求检测下游是否恢复

基础实现

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
public class CircuitBreaker {
private enum State { CLOSED, OPEN, HALF_OPEN }

private State state = State.CLOSED;
private int failureCount = 0;
private int successCount = 0;
private final int failureThreshold;
private final int successThreshold;
private final long openTimeoutMs;
private long lastStateChangeTime;
private final Object lock = new Object();

public CircuitBreaker(int failureThreshold, int successThreshold, long openTimeoutMs) {
this.failureThreshold = failureThreshold;
this.successThreshold = successThreshold;
this.openTimeoutMs = openTimeoutMs;
}

public <T> T execute(Supplier<T> action, Supplier<T> fallback) {
if (!allowRequest()) {
return fallback.get();
}

try {
T result = action.get();
recordSuccess();
return result;
} catch (Exception e) {
recordFailure();
return fallback.get();
}
}

private boolean allowRequest() {
synchronized (lock) {
switch (state) {
case CLOSED:
return true;
case OPEN:
if (System.currentTimeMillis() - lastStateChangeTime >= openTimeoutMs) {
transitionTo(State.HALF_OPEN);
return true;
}
return false;
case HALF_OPEN:
return true;
default:
return false;
}
}
}

private void recordSuccess() {
synchronized (lock) {
if (state == State.HALF_OPEN) {
successCount++;
if (successCount >= successThreshold) {
transitionTo(State.CLOSED);
}
}
failureCount = 0;
}
}

private void recordFailure() {
synchronized (lock) {
failureCount++;
if (state == State.HALF_OPEN) {
transitionTo(State.OPEN);
} else if (failureCount >= failureThreshold) {
transitionTo(State.OPEN);
}
}
}

private void transitionTo(State newState) {
state = newState;
lastStateChangeTime = System.currentTimeMillis();
failureCount = 0;
successCount = 0;
}
}

Hystrix 架构

核心设计

Hystrix 是 Netflix 开源的容错库,虽然已进入维护模式,但其设计思想仍有很高的参考价值。

graph TB
    subgraph Hystrix 工作流程
        Request[请求] --> CB{熔断器<br/>是否打开?}
        CB -->|打开| Fallback1[降级响应]
        CB -->|关闭| Pool{线程池/信号量<br/>是否已满?}
        Pool -->|已满| Fallback2[降级响应]
        Pool -->|未满| Execute[执行命令]
        Execute -->|成功| Response[返回响应]
        Execute -->|失败/超时| Fallback3[降级响应]

        Response --> Metrics[指标统计]
        Fallback1 --> Metrics
        Fallback2 --> Metrics
        Fallback3 --> Metrics
    end

Hystrix 使用示例

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
// HystrixCommand 方式
public class OrderServiceCommand extends HystrixCommand<Order> {
private final String orderId;
private final OrderClient orderClient;

public OrderServiceCommand(String orderId, OrderClient orderClient) {
super(HystrixCommand.Setter
.withGroupKey(HystrixCommandGroupKey.Factory.asKey("OrderService"))
.andCommandKey(HystrixCommandKey.Factory.asKey("GetOrder"))
.andCommandPropertiesDefaults(
HystrixCommandProperties.Setter()
.withCircuitBreakerRequestVolumeThreshold(20) // 至少20个请求
.withCircuitBreakerErrorThresholdPercentage(50) // 错误率50%
.withCircuitBreakerSleepWindowInMilliseconds(5000) // 熔断5秒
.withExecutionTimeoutInMilliseconds(3000) // 超时3秒
.withMetricsRollingStatisticalWindowInMilliseconds(10000) // 统计窗口10秒
)
.andThreadPoolPropertiesDefaults(
HystrixThreadPoolProperties.Setter()
.withCoreSize(10) // 线程池核心线程数
.withMaxQueueSize(100) // 队列大小
.withQueueSizeRejectionThreshold(50) // 队列拒绝阈值
));
this.orderId = orderId;
this.orderClient = orderClient;
}

@Override
protected Order run() throws Exception {
return orderClient.getOrder(orderId);
}

@Override
protected Order getFallback() {
// 降级策略:返回缓存数据或默认值
return Order.defaultOrder(orderId);
}
}

// 注解方式 (Spring Cloud)
@Service
public class OrderService {
@HystrixCommand(
fallbackMethod = "getOrderFallback",
commandProperties = {
@HystrixProperty(name = "circuitBreaker.requestVolumeThreshold", value = "20"),
@HystrixProperty(name = "circuitBreaker.errorThresholdPercentage", value = "50"),
@HystrixProperty(name = "execution.isolation.thread.timeoutInMilliseconds", value = "3000")
},
threadPoolProperties = {
@HystrixProperty(name = "coreSize", value = "10")
}
)
public Order getOrder(String orderId) {
return orderClient.getOrder(orderId);
}

private Order getOrderFallback(String orderId) {
return Order.defaultOrder(orderId);
}
}

Hystrix 隔离策略

graph TB
    subgraph 线程池隔离
        TP1[服务A线程池<br/>10线程] --> ServiceA[服务A]
        TP2[服务B线程池<br/>20线程] --> ServiceB[服务B]
        TP3[服务C线程池<br/>15线程] --> ServiceC[服务C]
    end

    subgraph 信号量隔离
        SEM[信号量计数器] --> Handler[请求处理<br/>在调用线程中执行]
    end
隔离方式 线程池隔离 信号量隔离
线程开销 有线程切换开销 无额外线程
超时支持 支持(独立线程可中断) 不支持主动超时
资源隔离 完全隔离 共享调用线程
适用场景 网络调用 内存缓存访问

Sentinel 架构

为什么选择 Sentinel

Sentinel 是阿里巴巴开源的流量防护组件,相比已停止维护的 Hystrix,Sentinel 提供了更丰富的流控功能和更好的实时监控能力。

graph TB
    subgraph Sentinel 核心
        Entry[资源入口] --> Slots[Slot Chain]

        subgraph Slot Chain
            NS[NodeSelectorSlot<br/>节点选择] --> CS[ClusterBuilderSlot<br/>集群统计]
            CS --> LS[LogSlot<br/>日志记录]
            LS --> SS[StatisticSlot<br/>实时统计]
            SS --> AU[AuthoritySlot<br/>授权规则]
            AU --> SP[SystemSlot<br/>系统保护]
            SP --> FC[FlowSlot<br/>流量控制]
            FC --> DG[DegradeSlot<br/>降级熔断]
        end
    end

    DG --> Resource[受保护资源]

流量控制

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
// 基于注解的流控
@SentinelResource(
value = "getOrder",
blockHandler = "getOrderBlockHandler",
fallback = "getOrderFallback"
)
public Order getOrder(String orderId) {
return orderClient.getOrder(orderId);
}

// 流控触发时的处理(参数列表需一致,额外加 BlockException)
public Order getOrderBlockHandler(String orderId, BlockException ex) {
// 被限流或降级时执行
return Order.limitedOrder(orderId);
}

// 业务异常降级
public Order getOrderFallback(String orderId, Throwable throwable) {
return Order.defaultOrder(orderId);
}
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
// 编程式流控规则
private void initFlowRules() {
List<FlowRule> rules = new ArrayList<>();

// QPS 限流
FlowRule qpsRule = new FlowRule();
qpsRule.setResource("getOrder");
qpsRule.setGrade(RuleConstant.FLOW_GRADE_QPS);
qpsRule.setCount(100); // 100 QPS
qpsRule.setControlBehavior(RuleConstant.CONTROL_BEHAVIOR_WARM_UP);
qpsRule.setWarmUpPeriodSec(10); // 预热10秒
rules.add(qpsRule);

// 线程数限流
FlowRule threadRule = new FlowRule();
threadRule.setResource("getOrder");
threadRule.setGrade(RuleConstant.FLOW_GRADE_THREAD);
threadRule.setCount(50); // 最大并发线程数50
rules.add(threadRule);

// 关联限流:当写接口压力大时,限制读接口
FlowRule relatedRule = new FlowRule();
relatedRule.setResource("queryOrder");
relatedRule.setStrategy(RuleConstant.STRATEGY_RELATE);
relatedRule.setRefResource("createOrder"); // 关联资源
relatedRule.setCount(50);
rules.add(relatedRule);

FlowRuleManager.loadRules(rules);
}

降级规则

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
// 降级(熔断)规则
private void initDegradeRules() {
List<DegradeRule> rules = new ArrayList<>();

// 慢调用比例熔断
DegradeRule slowCallRule = new DegradeRule();
slowCallRule.setResource("getOrder");
slowCallRule.setGrade(CircuitBreakerStrategy.SLOW_REQUEST_RATIO.getType());
slowCallRule.setCount(500); // 慢调用阈值: 500ms
slowCallRule.setSlowRatioThreshold(0.5); // 慢调用比例: 50%
slowCallRule.setMinRequestAmount(10); // 最小请求数
slowCallRule.setStatIntervalMs(10000); // 统计窗口: 10s
slowCallRule.setTimeWindow(30); // 熔断时长: 30s
rules.add(slowCallRule);

// 异常比例熔断
DegradeRule errorRatioRule = new DegradeRule();
errorRatioRule.setResource("createOrder");
errorRatioRule.setGrade(CircuitBreakerStrategy.ERROR_RATIO.getType());
errorRatioRule.setCount(0.5); // 异常比例: 50%
errorRatioRule.setMinRequestAmount(10);
errorRatioRule.setStatIntervalMs(10000);
errorRatioRule.setTimeWindow(60); // 熔断时长: 60s
rules.add(errorRatioRule);

// 异常数熔断
DegradeRule errorCountRule = new DegradeRule();
errorCountRule.setResource("payOrder");
errorCountRule.setGrade(CircuitBreakerStrategy.ERROR_COUNT.getType());
errorCountRule.setCount(10); // 异常数: 10
errorCountRule.setStatIntervalMs(60000); // 统计窗口: 1min
errorCountRule.setTimeWindow(120); // 熔断时长: 2min
rules.add(errorCountRule);

DegradeRuleManager.loadRules(rules);
}

系统保护

Sentinel 独有的系统级保护规则,基于系统负载自适应限流。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
// 系统保护规则
private void initSystemRules() {
List<SystemRule> rules = new ArrayList<>();

SystemRule rule = new SystemRule();
rule.setHighestSystemLoad(3.0); // CPU load > 3 时触发
rule.setHighestCpuUsage(0.8); // CPU 使用率 > 80%
rule.setAvgRt(200); // 平均RT > 200ms
rule.setMaxThread(500); // 最大并发线程数
rule.setQps(5000); // 入口QPS上限
rules.add(rule);

SystemRuleManager.loadRules(rules);
}
graph TB
    subgraph 系统自适应保护
        Load[系统负载监控] --> Decision{负载是否超阈值?}
        Decision -->|是| Limit[自适应限流]
        Decision -->|否| Pass[正常通行]

        Limit --> BBR[BBR算法<br/>自适应计算通过QPS]
        BBR --> |允许通过QPS =<br/>maxSuccessQPS ×<br/>minRT / avgRT| Actual[实际限流]
    end

Dashboard 配置

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
# Sentinel Dashboard 配置
# application.yml
spring:
cloud:
sentinel:
transport:
dashboard: localhost:8080
port: 8719
datasource:
flow:
nacos:
server-addr: localhost:8848
dataId: ${spring.application.name}-flow-rules
groupId: SENTINEL_GROUP
data-type: json
rule-type: flow
degrade:
nacos:
server-addr: localhost:8848
dataId: ${spring.application.name}-degrade-rules
groupId: SENTINEL_GROUP
data-type: json
rule-type: degrade
1
2
3
4
# 启动 Sentinel Dashboard
java -jar sentinel-dashboard-1.8.7.jar \
--server.port=8080 \
--csp.sentinel.dashboard.server=localhost:8080

Hystrix vs Sentinel

维度 Hystrix Sentinel
维护状态 停止维护 活跃维护
隔离策略 线程池/信号量 信号量(线程数限流)
熔断策略 错误率 慢调用比例/错误率/错误数
流控策略 QPS/线程数/关联/链路
系统保护 支持(CPU/Load/RT)
流控效果 直接拒绝/Warm Up/排队
控制台 简单 功能丰富,实时监控
规则持久化 Nacos/Apollo/ZooKeeper
扩展性 一般 SPI 扩展点丰富
graph LR
    subgraph 选型建议
        New[新项目] --> Sentinel1[推荐 Sentinel]
        Legacy[存量Hystrix项目] --> Migrate{是否迁移?}
        Migrate -->|是| Sentinel2[迁移到 Sentinel]
        Migrate -->|否| Resilience4j[或 Resilience4j]
    end

降级策略设计

graph TB
    Request[请求] --> CB{熔断器状态?}
    CB -->|关闭| Execute[正常执行]
    CB -->|打开| Fallback{降级策略}

    Fallback --> Cache[返回缓存数据]
    Fallback --> Default[返回默认值]
    Fallback --> Stub[返回兜底数据]
    Fallback --> Retry[切换备用服务]
    Fallback --> Queue[请求排队]

    Execute -->|成功| Response[返回响应]
    Execute -->|失败| Fallback
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
// 多级降级策略
@Service
public class ProductService {
@Autowired
private ProductClient productClient;
@Autowired
private RedisTemplate<String, Product> redis;

@SentinelResource(value = "getProduct", fallback = "getProductFallback")
public Product getProduct(String productId) {
Product product = productClient.getById(productId);
// 成功时更新缓存
redis.opsForValue().set("product:" + productId, product, 1, TimeUnit.HOURS);
return product;
}

// 降级:先查缓存
public Product getProductFallback(String productId, Throwable t) {
Product cached = redis.opsForValue().get("product:" + productId);
if (cached != null) {
return cached;
}
// 缓存也没有,返回兜底数据
return Product.skeleton(productId);
}
}

总结

熔断器模式是分布式系统弹性设计的核心。Hystrix 奠定了 Java 生态的熔断器标准,但已停止维护。Sentinel 凭借更丰富的流控能力、系统自适应保护和活跃的社区,已成为首选方案。在实际应用中,需要结合限流、熔断和降级三位一体的策略,并为每个外部依赖设计合理的降级方案。同时要重视监控告警,及时发现和处理故障。

作者 · authorzt
发布 · date2024-03-06
篇幅 · length2.9k 字 · 7 min
许可 · licenseCC BY-SA 4.0
$ echo "comments" · 评论