前言
在分布式系统中,服务之间的调用链路可能很长。当某个下游服务出现故障或响应变慢时,如果调用方持续等待,大量请求会堆积,最终导致调用方自身资源耗尽——这就是”级联故障”(Cascading
Failure)。熔断器模式(Circuit Breaker
Pattern)通过快速失败来切断故障传播链路,是微服务架构中不可或缺的弹性机制。
级联故障问题
graph LR
A[服务A] --> B[服务B]
B --> C[服务C]
C --> D[服务D ☠️]
style D fill:#f66
A2[服务A<br/>线程池耗尽] --> B2[服务B<br/>线程池耗尽]
B2 --> C2[服务C<br/>超时堆积]
C2 --> D2[服务D<br/>宕机]
style A2 fill:#f96
style B2 fill:#f96
style C2 fill:#f96
style D2 fill:#f66
当服务 D 宕机时: 1. 服务 C 调用 D 超时,线程被阻塞 2. 服务 C
的线程池逐渐耗尽,开始拒绝请求 3. 服务 B 调用 C 也开始超时 4.
最终整个调用链路全部瘫痪
熔断器状态机
熔断器有三个状态:
stateDiagram-v2
[*] --> Closed
Closed --> Open: 错误率超过阈值
Open --> HalfOpen: 超时时间到
HalfOpen --> Closed: 探测请求成功
HalfOpen --> Open: 探测请求失败
state Closed {
[*] --> Monitoring
Monitoring: 正常放行请求
Monitoring: 统计错误率
}
state Open {
[*] --> Rejecting
Rejecting: 快速失败
Rejecting: 返回降级响应
}
state HalfOpen {
[*] --> Probing
Probing: 放行少量探测请求
Probing: 根据结果决定状态
}
- Closed(关闭):正常状态,请求正常通过,同时统计错误率
- Open(打开):熔断状态,所有请求快速失败,返回降级响应
- Half-Open(半开):恢复探测状态,放行少量请求检测下游是否恢复
基础实现
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82
| public class CircuitBreaker { private enum State { CLOSED, OPEN, HALF_OPEN }
private State state = State.CLOSED; private int failureCount = 0; private int successCount = 0; private final int failureThreshold; private final int successThreshold; private final long openTimeoutMs; private long lastStateChangeTime; private final Object lock = new Object();
public CircuitBreaker(int failureThreshold, int successThreshold, long openTimeoutMs) { this.failureThreshold = failureThreshold; this.successThreshold = successThreshold; this.openTimeoutMs = openTimeoutMs; }
public <T> T execute(Supplier<T> action, Supplier<T> fallback) { if (!allowRequest()) { return fallback.get(); }
try { T result = action.get(); recordSuccess(); return result; } catch (Exception e) { recordFailure(); return fallback.get(); } }
private boolean allowRequest() { synchronized (lock) { switch (state) { case CLOSED: return true; case OPEN: if (System.currentTimeMillis() - lastStateChangeTime >= openTimeoutMs) { transitionTo(State.HALF_OPEN); return true; } return false; case HALF_OPEN: return true; default: return false; } } }
private void recordSuccess() { synchronized (lock) { if (state == State.HALF_OPEN) { successCount++; if (successCount >= successThreshold) { transitionTo(State.CLOSED); } } failureCount = 0; } }
private void recordFailure() { synchronized (lock) { failureCount++; if (state == State.HALF_OPEN) { transitionTo(State.OPEN); } else if (failureCount >= failureThreshold) { transitionTo(State.OPEN); } } }
private void transitionTo(State newState) { state = newState; lastStateChangeTime = System.currentTimeMillis(); failureCount = 0; successCount = 0; } }
|
Hystrix 架构
核心设计
Hystrix 是 Netflix
开源的容错库,虽然已进入维护模式,但其设计思想仍有很高的参考价值。
graph TB
subgraph Hystrix 工作流程
Request[请求] --> CB{熔断器<br/>是否打开?}
CB -->|打开| Fallback1[降级响应]
CB -->|关闭| Pool{线程池/信号量<br/>是否已满?}
Pool -->|已满| Fallback2[降级响应]
Pool -->|未满| Execute[执行命令]
Execute -->|成功| Response[返回响应]
Execute -->|失败/超时| Fallback3[降级响应]
Response --> Metrics[指标统计]
Fallback1 --> Metrics
Fallback2 --> Metrics
Fallback3 --> Metrics
end
Hystrix 使用示例
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61
| public class OrderServiceCommand extends HystrixCommand<Order> { private final String orderId; private final OrderClient orderClient;
public OrderServiceCommand(String orderId, OrderClient orderClient) { super(HystrixCommand.Setter .withGroupKey(HystrixCommandGroupKey.Factory.asKey("OrderService")) .andCommandKey(HystrixCommandKey.Factory.asKey("GetOrder")) .andCommandPropertiesDefaults( HystrixCommandProperties.Setter() .withCircuitBreakerRequestVolumeThreshold(20) .withCircuitBreakerErrorThresholdPercentage(50) .withCircuitBreakerSleepWindowInMilliseconds(5000) .withExecutionTimeoutInMilliseconds(3000) .withMetricsRollingStatisticalWindowInMilliseconds(10000) ) .andThreadPoolPropertiesDefaults( HystrixThreadPoolProperties.Setter() .withCoreSize(10) .withMaxQueueSize(100) .withQueueSizeRejectionThreshold(50) )); this.orderId = orderId; this.orderClient = orderClient; }
@Override protected Order run() throws Exception { return orderClient.getOrder(orderId); }
@Override protected Order getFallback() { return Order.defaultOrder(orderId); } }
@Service public class OrderService { @HystrixCommand( fallbackMethod = "getOrderFallback", commandProperties = { @HystrixProperty(name = "circuitBreaker.requestVolumeThreshold", value = "20"), @HystrixProperty(name = "circuitBreaker.errorThresholdPercentage", value = "50"), @HystrixProperty(name = "execution.isolation.thread.timeoutInMilliseconds", value = "3000") }, threadPoolProperties = { @HystrixProperty(name = "coreSize", value = "10") } ) public Order getOrder(String orderId) { return orderClient.getOrder(orderId); }
private Order getOrderFallback(String orderId) { return Order.defaultOrder(orderId); } }
|
Hystrix 隔离策略
graph TB
subgraph 线程池隔离
TP1[服务A线程池<br/>10线程] --> ServiceA[服务A]
TP2[服务B线程池<br/>20线程] --> ServiceB[服务B]
TP3[服务C线程池<br/>15线程] --> ServiceC[服务C]
end
subgraph 信号量隔离
SEM[信号量计数器] --> Handler[请求处理<br/>在调用线程中执行]
end
| 线程开销 |
有线程切换开销 |
无额外线程 |
| 超时支持 |
支持(独立线程可中断) |
不支持主动超时 |
| 资源隔离 |
完全隔离 |
共享调用线程 |
| 适用场景 |
网络调用 |
内存缓存访问 |
Sentinel 架构
为什么选择 Sentinel
Sentinel 是阿里巴巴开源的流量防护组件,相比已停止维护的
Hystrix,Sentinel 提供了更丰富的流控功能和更好的实时监控能力。
graph TB
subgraph Sentinel 核心
Entry[资源入口] --> Slots[Slot Chain]
subgraph Slot Chain
NS[NodeSelectorSlot<br/>节点选择] --> CS[ClusterBuilderSlot<br/>集群统计]
CS --> LS[LogSlot<br/>日志记录]
LS --> SS[StatisticSlot<br/>实时统计]
SS --> AU[AuthoritySlot<br/>授权规则]
AU --> SP[SystemSlot<br/>系统保护]
SP --> FC[FlowSlot<br/>流量控制]
FC --> DG[DegradeSlot<br/>降级熔断]
end
end
DG --> Resource[受保护资源]
流量控制
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
| @SentinelResource( value = "getOrder", blockHandler = "getOrderBlockHandler", fallback = "getOrderFallback" ) public Order getOrder(String orderId) { return orderClient.getOrder(orderId); }
public Order getOrderBlockHandler(String orderId, BlockException ex) { return Order.limitedOrder(orderId); }
public Order getOrderFallback(String orderId, Throwable throwable) { return Order.defaultOrder(orderId); }
|
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30
| private void initFlowRules() { List<FlowRule> rules = new ArrayList<>();
FlowRule qpsRule = new FlowRule(); qpsRule.setResource("getOrder"); qpsRule.setGrade(RuleConstant.FLOW_GRADE_QPS); qpsRule.setCount(100); qpsRule.setControlBehavior(RuleConstant.CONTROL_BEHAVIOR_WARM_UP); qpsRule.setWarmUpPeriodSec(10); rules.add(qpsRule);
FlowRule threadRule = new FlowRule(); threadRule.setResource("getOrder"); threadRule.setGrade(RuleConstant.FLOW_GRADE_THREAD); threadRule.setCount(50); rules.add(threadRule);
FlowRule relatedRule = new FlowRule(); relatedRule.setResource("queryOrder"); relatedRule.setStrategy(RuleConstant.STRATEGY_RELATE); relatedRule.setRefResource("createOrder"); relatedRule.setCount(50); rules.add(relatedRule);
FlowRuleManager.loadRules(rules); }
|
降级规则
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36
| private void initDegradeRules() { List<DegradeRule> rules = new ArrayList<>();
DegradeRule slowCallRule = new DegradeRule(); slowCallRule.setResource("getOrder"); slowCallRule.setGrade(CircuitBreakerStrategy.SLOW_REQUEST_RATIO.getType()); slowCallRule.setCount(500); slowCallRule.setSlowRatioThreshold(0.5); slowCallRule.setMinRequestAmount(10); slowCallRule.setStatIntervalMs(10000); slowCallRule.setTimeWindow(30); rules.add(slowCallRule);
DegradeRule errorRatioRule = new DegradeRule(); errorRatioRule.setResource("createOrder"); errorRatioRule.setGrade(CircuitBreakerStrategy.ERROR_RATIO.getType()); errorRatioRule.setCount(0.5); errorRatioRule.setMinRequestAmount(10); errorRatioRule.setStatIntervalMs(10000); errorRatioRule.setTimeWindow(60); rules.add(errorRatioRule);
DegradeRule errorCountRule = new DegradeRule(); errorCountRule.setResource("payOrder"); errorCountRule.setGrade(CircuitBreakerStrategy.ERROR_COUNT.getType()); errorCountRule.setCount(10); errorCountRule.setStatIntervalMs(60000); errorCountRule.setTimeWindow(120); rules.add(errorCountRule);
DegradeRuleManager.loadRules(rules); }
|
系统保护
Sentinel 独有的系统级保护规则,基于系统负载自适应限流。
1 2 3 4 5 6 7 8 9 10 11 12 13 14
| private void initSystemRules() { List<SystemRule> rules = new ArrayList<>();
SystemRule rule = new SystemRule(); rule.setHighestSystemLoad(3.0); rule.setHighestCpuUsage(0.8); rule.setAvgRt(200); rule.setMaxThread(500); rule.setQps(5000); rules.add(rule);
SystemRuleManager.loadRules(rules); }
|
graph TB
subgraph 系统自适应保护
Load[系统负载监控] --> Decision{负载是否超阈值?}
Decision -->|是| Limit[自适应限流]
Decision -->|否| Pass[正常通行]
Limit --> BBR[BBR算法<br/>自适应计算通过QPS]
BBR --> |允许通过QPS =<br/>maxSuccessQPS ×<br/>minRT / avgRT| Actual[实际限流]
end
Dashboard 配置
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23
|
spring: cloud: sentinel: transport: dashboard: localhost:8080 port: 8719 datasource: flow: nacos: server-addr: localhost:8848 dataId: ${spring.application.name}-flow-rules groupId: SENTINEL_GROUP data-type: json rule-type: flow degrade: nacos: server-addr: localhost:8848 dataId: ${spring.application.name}-degrade-rules groupId: SENTINEL_GROUP data-type: json rule-type: degrade
|
1 2 3 4
| java -jar sentinel-dashboard-1.8.7.jar \ --server.port=8080 \ --csp.sentinel.dashboard.server=localhost:8080
|
Hystrix vs Sentinel
| 维护状态 |
停止维护 |
活跃维护 |
| 隔离策略 |
线程池/信号量 |
信号量(线程数限流) |
| 熔断策略 |
错误率 |
慢调用比例/错误率/错误数 |
| 流控策略 |
无 |
QPS/线程数/关联/链路 |
| 系统保护 |
无 |
支持(CPU/Load/RT) |
| 流控效果 |
无 |
直接拒绝/Warm Up/排队 |
| 控制台 |
简单 |
功能丰富,实时监控 |
| 规则持久化 |
无 |
Nacos/Apollo/ZooKeeper |
| 扩展性 |
一般 |
SPI 扩展点丰富 |
graph LR
subgraph 选型建议
New[新项目] --> Sentinel1[推荐 Sentinel]
Legacy[存量Hystrix项目] --> Migrate{是否迁移?}
Migrate -->|是| Sentinel2[迁移到 Sentinel]
Migrate -->|否| Resilience4j[或 Resilience4j]
end
降级策略设计
graph TB
Request[请求] --> CB{熔断器状态?}
CB -->|关闭| Execute[正常执行]
CB -->|打开| Fallback{降级策略}
Fallback --> Cache[返回缓存数据]
Fallback --> Default[返回默认值]
Fallback --> Stub[返回兜底数据]
Fallback --> Retry[切换备用服务]
Fallback --> Queue[请求排队]
Execute -->|成功| Response[返回响应]
Execute -->|失败| Fallback
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26
| @Service public class ProductService { @Autowired private ProductClient productClient; @Autowired private RedisTemplate<String, Product> redis;
@SentinelResource(value = "getProduct", fallback = "getProductFallback") public Product getProduct(String productId) { Product product = productClient.getById(productId); redis.opsForValue().set("product:" + productId, product, 1, TimeUnit.HOURS); return product; }
public Product getProductFallback(String productId, Throwable t) { Product cached = redis.opsForValue().get("product:" + productId); if (cached != null) { return cached; } return Product.skeleton(productId); } }
|
总结
熔断器模式是分布式系统弹性设计的核心。Hystrix 奠定了 Java
生态的熔断器标准,但已停止维护。Sentinel
凭借更丰富的流控能力、系统自适应保护和活跃的社区,已成为首选方案。在实际应用中,需要结合限流、熔断和降级三位一体的策略,并为每个外部依赖设计合理的降级方案。同时要重视监控告警,及时发现和处理故障。