Chapter 12: Exception Handling and Recovery
Robust agentic systems must be able to handle errors and unexpected situations gracefully. 融合社区洞察与最新实践,涵盖重试、回退、熔断器、优雅降级等生产级容错模式。
Chapter 12: Exception Handling and Recovery
第十二章:异常处理与恢复
Exception Handling Pattern Overview
异常处理模式概述
Robust agentic systems must be able to handle errors and unexpected situations gracefully.
健壮的智能体系统必须能够优雅地处理错误和意外情况。
The exception handling pattern involves detecting errors, recovering from failures, and maintaining system stability.
异常处理模式涉及检测错误、从故障中恢复以及维护系统稳定性。
Types of Exceptions
异常类型
Tool Failures: External tools or APIs may fail or return unexpected results.
工具故障: 外部工具或API可能失败或返回意外结果。
LLM Errors: The language model may produce invalid outputs or fail to generate responses.
LLM错误: 语言模型可能产生无效输出或无法生成响应。
Input Validation: User input may be invalid or unexpected.
输入验证: 用户输入可能无效或意外。
Resource Limits: The system may run out of memory, API quota, or other resources.
资源限制: 系统可能耗尽内存、API配额或其他资源。
Recovery Strategies
恢复策略
Retry: Attempt the failed operation again, potentially with different parameters.
重试: 再次尝试失败的操作,可能使用不同的参数。
Fallback: Use an alternative approach when the primary method fails.
回退: 当主要方法失败时使用替代方法。
Escalation: Notify a human when the system cannot handle an error.
升级: 当系统无法处理错误时通知人工。
Graceful Degradation: Continue operating in a reduced capacity when full operation is not possible.
优雅降级: 当无法完全运行时以降低容量继续运行。
Hands-On Code Examples
实践代码示例
1. Error Handling (错误处理)
以下代码实现了智能体系统中的错误处理机制:
import { ChatOpenAI } from '@langchain/openai'; import { ChatPromptTemplate } from '@langchain/core/prompts'; import { StringOutputParser } from '@langchain/core/output_parsers'; const llm = new ChatOpenAI({ temperature: 0.7 }); // Error Types class AgentError extends Error { constructor(message, type, recoverable = false) { super(message); this.name = 'AgentError'; this.type = type; this.recoverable = recoverable; this.timestamp = Date.now(); } } class ToolError extends AgentError { constructor(message, toolName, originalError = null) { super(message, 'TOOL_ERROR', true); this.name = 'ToolError'; this.toolName = toolName; this.originalError = originalError; } } class LLMError extends AgentError { constructor(message, model, originalError = null) { super(message, 'LLM_ERROR', true); this.name = 'LLMError'; this.model = model; this.originalError = originalError; } } class ValidationError extends AgentError { constructor(message, field, value) { super(message, 'VALIDATION_ERROR', false); this.name = 'ValidationError'; this.field = field; this.value = value; } } class ResourceLimitError extends AgentError { constructor(message, resourceType, limit) { super(message, 'RESOURCE_LIMIT', true); this.name = 'ResourceLimitError'; this.resourceType = resourceType; this.limit = limit; } } // Error Handler class ErrorHandler { constructor() { this.errorLog = []; this.errorHandlers = new Map(); } // Register error handler registerHandler(errorType, handler) { this.errorHandlers.set(errorType, handler); } // Handle error async handle(error) { console.error(`[Error] ${error.name}: ${error.message}`); // Log error this.errorLog.push({ error: error.message, type: error.type, timestamp: error.timestamp, recoverable: error.recoverable, }); // Find handler const handler = this.errorHandlers.get(error.type); if (handler) { try { return await handler(error); } catch (handlerError) { console.error(`[Error] Handler failed: ${handlerError.message}`); } } // Default handling if (error.recoverable) { return { action: 'retry', reason: 'recoverable error' }; } return { action: 'escalate', reason: 'unrecoverable error' }; } // Get error statistics getStats() { const stats = { total: this.errorLog.length, byType: {}, recoverable: 0, unrecoverable: 0, }; this.errorLog.forEach((e) => { stats.byType[e.type] = (stats.byType[e.type] || 0) + 1; if (e.recoverable) stats.recoverable++; else stats.unrecoverable++; }); return stats; } } // Usage const errorHandler = new ErrorHandler(); // Register custom handlers errorHandler.registerHandler('TOOL_ERROR', async (error) => { console.log(`[Handler] Attempting to recover from tool error in ${error.toolName}`); return { action: 'fallback', tool: error.toolName }; }); errorHandler.registerHandler('VALIDATION_ERROR', async (error) => { console.log(`[Handler] Validation failed for field: ${error.field}`); return { action: 'reject', reason: 'invalid input' }; }); // Test error handling async function demoErrors() { // Test tool error const toolError = new ToolError('API timeout', 'weather_api', new Error('ETIMEDOUT')); const result1 = await errorHandler.handle(toolError); console.log('Tool error result:', result1); // Test validation error const validationError = new ValidationError('Invalid email', 'email', 'not-an-email'); const result2 = await errorHandler.handle(validationError); console.log('Validation error result:', result2); // Get stats console.log('\nError Statistics:'); console.log(errorHandler.getStats()); } demoErrors();
2. Retry Mechanism (重试机制)
以下代码实现了带指数退避的重试机制:
import { ChatOpenAI } from '@langchain/openai'; import { ChatPromptTemplate } from '@langchain/core/prompts'; import { StringOutputParser } from '@langchain/core/output_parsers'; const llm = new ChatOpenAI({ temperature: 0.7 }); // Retry Strategy class RetryStrategy { constructor(options = {}) { this.maxRetries = options.maxRetries || 3; this.initialDelay = options.initialDelay || 1000; // ms this.maxDelay = options.maxDelay || 30000; // ms this.backoffMultiplier = options.backoffMultiplier || 2; this.retryableErrors = options.retryableErrors || ['TOOL_ERROR', 'LLM_ERROR', 'NETWORK_ERROR']; } // Calculate delay with exponential backoff calculateDelay(attempt) { const delay = Math.min( this.initialDelay * Math.pow(this.backoffMultiplier, attempt), this.maxDelay, ); // Add jitter return delay + Math.random() * 1000; } // Should retry shouldRetry(error, attempt) { if (attempt >= this.maxRetries) { return { shouldRetry: false, reason: 'max retries exceeded' }; } if (!this.retryableErrors.includes(error.type)) { return { shouldRetry: false, reason: 'non-retryable error' }; } return { shouldRetry: true, delay: this.calculateDelay(attempt) }; } } // Retryable function wrapper async function withRetry(fn, strategy = new RetryStrategy()) { let attempt = 0; let lastError; while (true) { try { return await fn(); } catch (error) { lastError = error; const { shouldRetry, delay, reason } = strategy.shouldRetry(error, attempt); console.log(`Attempt ${attempt + 1} failed: ${error.message}`); console.log(`Should retry: ${shouldRetry}, Reason: ${reason}`); if (!shouldRetry) { throw error; } attempt++; console.log(`Waiting ${Math.round(delay)}ms before retry...`); await new Promise((resolve) => setTimeout(resolve, delay)); } } } // Retry Executor with Circuit Breaker class RetryExecutor { constructor() { this.strategy = new RetryStrategy({ maxRetries: 3, initialDelay: 500 }); this.circuitState = 'CLOSED'; // CLOSED, OPEN, HALF_OPEN this.failureCount = 0; this.successCount = 0; this.circuitThreshold = 5; this.circuitTimeout = 30000; } async execute(operation, operationName = 'operation') { // Check circuit breaker if (this.circuitState === 'OPEN') { throw new Error(`Circuit breaker OPEN: ${operationName} temporarily unavailable`); } try { const result = await withRetry(operation, this.strategy); // Success - reset circuit this.successCount++; this.failureCount = 0; if (this.circuitState === 'HALF_OPEN') { this.circuitState = 'CLOSED'; console.log('[Circuit Breaker] Closed after successful call'); } return result; } catch (error) { this.failureCount++; console.log(`[Circuit Breaker] Failure count: ${this.failureCount}/${this.circuitThreshold}`); // Open circuit if (this.failureCount >= this.circuitThreshold) { this.circuitState = 'OPEN'; console.log('[Circuit Breaker] Opened due to repeated failures'); // Auto-reset after timeout setTimeout(() => { this.circuitState = 'HALF_OPEN'; console.log('[Circuit Breaker] Half-open for testing'); }, this.circuitTimeout); } throw error; } } getStatus() { return { circuitState: this.circuitState, failureCount: this.failureCount, successCount: this.successCount, }; } } // Usage async function demoRetry() { const executor = new RetryExecutor(); // Simulate flaky operation let callCount = 0; const flakyOperation = async () => { callCount++; console.log(`\n--- Call attempt ${callCount} ---`); if (callCount < 3) { throw new ToolError('Service unavailable', 'test_service'); } return 'Operation succeeded!'; }; // Execute with retry try { const result = await executor.execute(flakyOperation, 'test_operation'); console.log('\nResult:', result); } catch (error) { console.log('\nFinal error:', error.message); } console.log('\nCircuit Breaker Status:'); console.log(executor.getStatus()); } demoRetry();
3. Fallback Strategy (回退策略)
以下代码实现了多级回退策略:
import { ChatOpenAI } from '@langchain/openai'; import { ChatPromptTemplate } from '@langchain/core/prompts'; import { StringOutputParser } from '@langchain/core/output_parsers'; // Different LLM instances for fallback const primaryLLM = new ChatOpenAI({ model: 'gpt-4', temperature: 0.7, maxRetries: 2, }); const fallbackLLM1 = new ChatOpenAI({ model: 'gpt-4o-mini', temperature: 0.7, maxRetries: 2, }); const fallbackLLM2 = new ChatOpenAI({ model: 'gpt-3.5-turbo', temperature: 0.7, maxRetries: 2, }); // Tool fallback class ToolFallback { constructor() { this.primaryTools = new Map(); this.fallbackTools = new Map(); } register(toolName, primary, fallback) { this.primaryTools.set(toolName, primary); this.fallbackTools.set(toolName, fallback); } async execute(toolName, ...args) { const primary = this.primaryTools.get(toolName); const fallback = this.fallbackTools.get(toolName); if (!primary && !fallback) { throw new Error(`Tool '${toolName}' not registered`); } try { if (primary) { console.log(`[Fallback] Using primary tool: ${toolName}`); return await primary(...args); } } catch (error) { console.log(`[Fallback] Primary tool failed: ${error.message}`); if (fallback) { console.log(`[Fallback] Switching to fallback tool: ${toolName}`); return await fallback(...args); } throw error; } } } // LLM Fallback Chain class LLMFallbackChain { constructor() { this.llms = [primaryLLM, fallbackLLM1, fallbackLLM2]; this.currentIndex = 0; } async generate(prompt) { const errors = []; for (let i = this.currentIndex; i < this.llms.length; i++) { const llm = this.llms[i]; console.log(`[Fallback] Trying LLM ${i + 1}: ${llm.modelName}`); try { const response = await llm.invoke(prompt); this.currentIndex = i; // Reset to successful LLM return response; } catch (error) { console.log(`[Fallback] LLM ${i + 1} failed: ${error.message}`); errors.push({ llm: llm.modelName, error: error.message }); } } throw new Error('All LLM fallbacks failed'); } reset() { this.currentIndex = 0; } } // Response Fallback class ResponseFallback { constructor() { this.strategies = []; } addStrategy(priority, generator) { this.strategies.push({ priority, generator }); this.strategies.sort((a, b) => b.priority - a.priority); } async generate(context) { for (const strategy of this.strategies) { try { console.log(`[Response Fallback] Trying strategy with priority ${strategy.priority}`); const result = await strategy.generator(context); if (result && result.trim().length > 0) { return result; } } catch (error) { console.log(`[Response Fallback] Strategy failed: ${error.message}`); } } return 'I apologize, but I am unable to process your request at this time.'; } } // Complete Fallback System class FallbackSystem { constructor() { this.toolFallback = new ToolFallback(); this.llmChain = new LLMFallbackChain(); this.responseFallback = new ResponseFallback(); // Setup default response fallbacks this.responseFallback.addStrategy(10, async (ctx) => { // Try with full context const prompt = ChatPromptTemplate.fromTemplate(`Provide a detailed response to: {query}`); const chain = prompt.pipe(primaryLLM).pipe(new StringOutputParser()); return chain.invoke({ query: ctx.query }); }); this.responseFallback.addStrategy(5, async (ctx) => { // Try with simplified prompt const prompt = ChatPromptTemplate.fromTemplate(`Briefly answer: {query}`); const chain = prompt.pipe(fallbackLLM1).pipe(new StringOutputParser()); return chain.invoke({ query: ctx.query }); }); this.responseFallback.addStrategy(1, async (ctx) => { // Static fallback return `I understand you're asking about "${ctx.query}". Please try rephrasing your question.`; }); } async executeWithFallback(query) { console.log(`\n[System] Processing query: ${query}`); try { // Try main pipeline const prompt = ChatPromptTemplate.fromTemplate(`Answer: {query}`); const chain = prompt.pipe(primaryLLM).pipe(new StringOutputParser()); return await chain.invoke({ query }); } catch (error) { console.log(`[System] Main pipeline failed: ${error.message}`); try { // Try LLM fallback return await this.llmChain.generate(query); } catch (error) { console.log(`[System] LLM chain failed: ${error.message}`); // Try response fallback return await this.responseFallback.generate({ query }); } } } } // Usage async function demoFallback() { const system = new FallbackSystem(); // Register custom tool fallback system.toolFallback.register( 'get_weather', async (city) => { // Primary - might fail throw new Error('Weather API unavailable'); }, async (city) => { // Fallback - return cached/static data return `Weather for ${city}: Sunny, 72°F (cached data)`; }, ); // Execute query const result = await system.executeWithFallback('What is the capital of France?'); console.log('\nFinal Result:', result); // Test tool fallback console.log('\n--- Tool Fallback Test ---'); const toolResult = await system.toolFallback.execute('get_weather', 'New York'); console.log('Tool result:', toolResult); } demoFallback();
4. Graceful Degradation (优雅降级)
以下代码实现了优雅降级机制:
import { ChatOpenAI } from '@langchain/openai'; import { ChatPromptTemplate } from '@langchain/core/prompts'; import { StringOutputParser } from '@langchain/core/output_parsers'; const llm = new ChatOpenAI({ temperature: 0.7 }); // Service Health Status const HealthStatus = { HEALTHY: 'healthy', DEGRADED: 'degraded', UNHEALTHY: 'unhealthy', }; // Service with Health Check class Service { constructor(name, healthCheckFn, operationFn) { this.name = name; this.healthCheckFn = healthCheckFn; this.operationFn = operationFn; this.status = HealthStatus.HEALTHY; this.lastCheck = null; this.errorCount = 0; } async checkHealth() { try { await this.healthCheckFn(); this.status = HealthStatus.HEALTHY; this.errorCount = 0; } catch (error) { this.errorCount++; this.status = this.errorCount > 5 ? HealthStatus.UNHEALTHY : HealthStatus.DEGRADED; } this.lastCheck = Date.now(); return this.status; } async execute(...args) { if (this.status === HealthStatus.UNHEALTHY) { throw new Error(`Service ${this.name} is unhealthy`); } return this.operationFn(...args); } } // Graceful Degradation Manager class GracefulDegradationManager { constructor() { this.services = new Map(); this.degradationLevel = 0; this.maxDegradationLevel = 3; } registerService(name, primaryFn, degradedFn, minimalFn) { this.services.set(name, { primary: primaryFn, degraded: degradedFn || primaryFn, minimal: minimalFn || (() => 'Service temporarily unavailable'), }); } setDegradationLevel(level) { this.degradationLevel = Math.min(level, this.maxDegradationLevel); console.log(`[Degradation] Level set to ${level}`); } async execute(serviceName, ...args) { const service = this.services.get(serviceName); if (!service) { throw new Error(`Service ${serviceName} not registered`); } // Select function based on degradation level let fn; let mode; if (this.degradationLevel === 0) { fn = service.primary; mode = 'primary'; } else if (this.degradationLevel === 1) { fn = service.degraded; mode = 'degraded'; } else { fn = service.minimal; mode = 'minimal'; } console.log(`[Degradation] Executing ${serviceName} in ${mode} mode`); try { return await fn(...args); } catch (error) { console.log(`[Degradation] ${mode} mode failed: ${error.message}`); // Try fallback modes if (mode !== 'minimal') { try { return await service.minimal(...args); } catch { throw error; } } throw error; } } getStatus() { const status = {}; for (const [name, service] of this.services.entries()) { status[name] = { degradationLevel: this.degradationLevel, mode: this.degradationLevel === 0 ? 'primary' : this.degradationLevel === 1 ? 'degraded' : 'minimal', }; } return status; } } // Feature Flags with Degradation class FeatureFlags { constructor() { this.flags = new Map(); } setFlag(name, enabled, degradedEnabled = true) { this.flags.set(name, { enabled, degradedEnabled }); } isEnabled(name) { return this.flags.get(name)?.enabled || false; } isDegradedEnabled(name) { return this.flags.get(name)?.degradedEnabled || false; } // Disable feature gracefully disableFeature(name) { const flag = this.flags.get(name); if (flag) { flag.enabled = false; console.log(`[Feature] Disabled: ${name}`); } } // Enable degraded mode for feature enableDegradedMode(name) { const flag = this.flags.get(name); if (flag) { flag.degradedEnabled = true; console.log(`[Feature] Degraded mode enabled: ${name}`); } } } // Complete Graceful Degradation System class DegradationSystem { constructor() { this.degradationManager = new GracefulDegradationManager(); this.featureFlags = new FeatureFlags(); // Setup services this.degradationManager.registerService( 'analysis', // Primary - full analysis with GPT-4 async (data) => { const prompt = ChatPromptTemplate.fromTemplate(`Provide detailed analysis: {data}`); const chain = prompt.pipe(llm).pipe(new StringOutputParser()); return await chain.invoke({ data }); }, // Degraded - simpler analysis async (data) => { const prompt = ChatPromptTemplate.fromTemplate(`Briefly analyze: {data}`); const chain = prompt.pipe(llm).pipe(new StringOutputParser()); return await chain.invoke({ data }); }, // Minimal - basic response async (data) => `Analysis complete for: ${data.substring(0, 50)}...`, ); // Setup feature flags this.featureFlags.setFlag('advanced_analysis', true, true); this.featureFlags.setFlag('detailed_explanations', true, false); this.featureFlags.setFlag('real_time_data', true, false); } async analyze(query) { // Check if feature is available if (!this.featureFlags.isEnabled('analysis')) { return 'Analysis service is currently unavailable.'; } // Check if we should use degraded mode if (!this.featureFlags.isDegradedEnabled('analysis')) { this.degradationManager.setDegradationLevel(1); } try { return await this.degradationManager.execute('analysis', query); } catch (error) { console.log(`[System] Analysis failed: ${error.message}`); // Increase degradation level this.degradationManager.setDegradationLevel(this.degradationManager.degradationLevel + 1); // Retry with higher degradation return await this.degradationManager.execute('analysis', query); } } getSystemStatus() { return { degradation: this.degradationManager.getStatus(), features: Object.fromEntries( Array.from(this.featureFlags.flags.entries()).map(([k, v]) => [ k, { enabled: v.enabled, degradedEnabled: v.degradedEnabled }, ]), ), }; } } // Usage async function demoDegradation() { const system = new DegradationSystem(); // Normal operation console.log('--- Normal Operation ---'); const result1 = await system.analyze('Analyze the tech industry trends'); console.log('Result:', result1.substring(0, 100), '...\n'); // Simulate degradation console.log('--- Degraded Operation ---'); system.featureFlags.disableFeature('detailed_explanations'); system.degradationManager.setDegradationLevel(1); const result2 = await system.analyze('Analyze healthcare sector'); console.log('Result:', result2.substring(0, 100), '...\n'); // Show system status console.log('--- System Status ---'); console.log(system.getSystemStatus()); } demoDegradation();
5. Complete Exception Handling System (完整的异常处理系统)
以下代码实现了一个完整的异常处理与恢复系统:
import { ChatOpenAI } from '@langchain/openai'; import { ChatPromptTemplate } from '@langchain/core/prompts'; import { StringOutputParser } from '@langchain/core/output_parsers'; const llm = new ChatOpenAI({ temperature: 0.7 }); // Complete Exception Handling System class AgentExceptionSystem { constructor() { this.errorHandler = new ErrorHandler(); this.retryExecutor = new RetryExecutor(); this.fallbackSystem = new FallbackSystem(); this.degradationManager = new GracefulDegradationManager(); this.setupErrorHandlers(); this.setupDegradationLevels(); } setupErrorHandlers() { // Handle tool errors this.errorHandler.registerHandler('TOOL_ERROR', async (error) => { console.log(`[System] Handling tool error: ${error.toolName}`); return { action: 'fallback', useFallback: true }; }); // Handle LLM errors this.errorHandler.registerHandler('LLM_ERROR', async (error) => { console.log(`[System] Handling LLM error: ${error.model}`); return { action: 'retry', maxRetries: 2 }; }); // Handle validation errors this.errorHandler.registerHandler('VALIDATION_ERROR', async (error) => { console.log(`[System] Handling validation error: ${error.field}`); return { action: 'reject', message: `Invalid ${error.field}` }; }); // Handle resource errors this.errorHandler.registerHandler('RESOURCE_LIMIT', async (error) => { console.log(`[System] Handling resource limit: ${error.resourceType}`); return { action: 'degrade', level: 1 }; }); } setupDegradationLevels() { // Level 0: Full functionality this.degradationManager.registerService( 'chat', async (prompt) => { const chain = ChatPromptTemplate.fromTemplate(prompt) .pipe(llm) .pipe(new StringOutputParser()); return chain.invoke({}); }, null, () => 'Service is operating in reduced mode.', ); // Level 1: Reduced features this.degradationManager.registerService( 'analysis', async (data) => { const chain = ChatPromptTemplate.fromTemplate(`Analyze: {data}`) .pipe(llm) .pipe(new StringOutputParser()); return chain.invoke({ data }); }, async (data) => `Basic analysis: ${data.substring(0, 100)}`, () => 'Analysis not available.', ); } async executeOperation(operation, options = {}) { const { useRetry = true, useFallback = true, useDegradation = true, operationName = 'operation', } = options; try { if (useRetry) { return await this.retryExecutor.execute(() => operation(), operationName); } return await operation(); } catch (error) { console.log(`[System] Operation failed: ${error.message}`); // Handle error const handling = await this.errorHandler.handle(error); // Execute recovery action switch (handling.action) { case 'retry': if (useRetry) { console.log('[System] Retrying operation...'); return await this.executeOperation(operation, { ...options, useRetry: false, // Prevent infinite retry }); } break; case 'fallback': if (useFallback && handling.useFallback) { console.log('[System] Using fallback...'); // Would execute fallback logic here } break; case 'degrade': if (useDegradation) { console.log('[System] Degrading service...'); this.degradationManager.setDegradationLevel(handling.level || 1); } break; case 'escalate': console.log('[System] Escalating to human...'); return { error: error.message, escalated: true }; case 'reject': return { error: handling.message, rejected: true }; } throw error; } } getSystemHealth() { return { errorStats: this.errorHandler.getStats(), circuitBreaker: this.retryExecutor.getStatus(), degradation: this.degradationManager.getStatus(), }; } } // Usage async function demoCompleteSystem() { const system = new AgentExceptionSystem(); // Test various scenarios console.log('=== Scenario 1: Normal Operation ==='); try { const result = await system.executeOperation( async () => { const chain = ChatPromptTemplate.fromTemplate('Say hello') .pipe(llm) .pipe(new StringOutputParser()); return chain.invoke({}); }, { operationName: 'greeting' }, ); console.log('Result:', result); } catch (e) { console.log('Error:', e.message); } console.log('\n=== Scenario 2: Simulated Failure ==='); try { await system.executeOperation( async () => { throw new ToolError('API timeout', 'external_api'); }, { operationName: 'external_call' }, ); } catch (e) { console.log('Final Error:', e.message); } console.log('\n=== System Health ==='); console.log(system.getSystemHealth()); } demoCompleteSystem();
Practical Applications & Use Cases
实际应用和用例
社区热议与实践分享
异常处理与恢复模式在 2025-2026 年的 Agentic AI 生产部署中成为核心挑战,社区围绕重试策略、熔断器和优雅降级展开了深入讨论。
为什么 AI Agent 需要不同的错误处理
社区共识是:传统的微服务错误处理策略不能直接照搬到 Agent 系统。Agents Arcade 指出,LLM 是概率性的、有状态的、对输入措辞敏感的。当 Agent 失败时,它丢失的是对话历史、学习到的偏好和专业知识 — 这些无法通过简单重启恢复。
重试策略的"滥用"问题
Portkey 和 Maxim 在 2026 年初发表的生产指南中警告:重试是 Agentic 系统中最被滥用的可靠性机制。重试无状态模型调用通常安全,但重试写数据库、发邮件或触发下游工作流的工具调用往往是"伪装成弹性的 Bug"。
SparkCo 的 2025 最佳实践指出,Agent 需要与意图对齐的重试边界,而非与步骤对齐。推荐使用带抖动的指数退避(Exponential Backoff with Jitter)和自适应错误处理。
熔断器:防止雪崩
DasRoot 的 2026 年文章强调:没有熔断器,单个 Agent 就能通过疯狂重试拖垮关键业务工具。熔断器监控外部服务调用的失败率,当超过阈值时自动"打开"电路,在冷却期内阻止请求。
生产工具方面,Resilience4j 2.2.0(Java)、Polly(C#)和 PyBreaker(Python)与 LLM 网关架构集成良好。
回滚 vs 补偿动作
社区重要洞察:在分布式 Agent 系统中,几乎不可能真正回滚,只能执行补偿动作。Galileo 推荐 Saga 编排模式,使用补偿动作自动回滚失败的多步工作流。
超时的模糊性
超时在 Agentic 系统中不等于失败 — 它是不确定性。工具可能仍在运行,消息可能仍在传输。将超时视为硬失败并立即重试是创造重复副作用的经典方式。成熟的 Agent 将超时视为模糊结果,通过查询状态而非猜测来解决。
Amazon 的大规模生产经验
AWS 在 2026 年 2 月分享了在 Amazon 构建 Agentic 系统的经验:评估框架必须衡量 Agent 识别多种失败场景的能力,包括不当规划、无效工具调用、格式错误参数、认证失败和内存检索错误。生产级 Agent 必须展示一致的错误恢复模式。
参考资源
生产实践指南
- Portkey - Retries, Fallbacks, and Circuit Breakers in LLM Apps (Jan 2026)
- Maxim - Retries, Fallbacks, and Circuit Breakers: A Production Guide (Feb 2026)
- AWS - Evaluating AI Agents: Lessons from Amazon (Feb 2026)
- DasRoot - Building Resilient Systems: Circuit Breakers and Retry Patterns (Feb 2026)
框架与模式
- Galileo - Multi-Agent AI Failure Recovery That Actually Works (Jul 2025)
- GoCodeo - Error Recovery and Fallback Strategies in AI Agent Development
- Datagrid - 5 Steps to Build Exception Handling for AI Agent Failures (Dec 2025)
- Agents Arcade - Error Handling in Agentic Systems (Jan 2026)