Branch: feature/circuit-breaker (merged to main)
Commit: 241481b
Date: 2026-02-26
Library: Resilience4j 2.2.0 via Spring Cloud Circuit Breaker 3.2.0
Problem
Auth-service calls member-service via ServiceClient (WebClient) with:
- No timeout — a slow member-service hangs auth-service indefinitely
- No circuit breaker — repeated failures keep hitting a dead service
- Cascading failure risk: member-service down → auth-service down → gateway backs up → everything dead
Call Chain Before
Client → Gateway → Auth Service --REST (no timeout, no protection)--> Member Service → PostgreSQL
Solution
Added Resilience4j circuit breaker to ServiceClient in common-api module, protecting ALL inter-service calls automatically.
How It Works
The circuit breaker has 3 states:
| State | Behavior |
|---|---|
| CLOSED (normal) | All requests pass through. Failures are counted. |
| OPEN (tripped) | All requests fail fast with 503. No calls to member-service. |
| HALF-OPEN (testing) | 3 test calls allowed. If they succeed → CLOSED. If they fail → OPEN. |
Configuration
| Parameter | Value | Meaning |
|---|---|---|
sliding-window-size |
10 | Evaluate the last 10 calls |
minimum-number-of-calls |
5 | Need at least 5 calls before evaluating |
failure-rate-threshold |
50% | Open circuit if ≥50% of calls fail |
wait-duration-in-open-state |
30s | Stay open for 30s before trying again |
permitted-number-of-calls-in-half-open-state |
3 | Allow 3 test calls in half-open |
timeout-duration |
5s | Fail if member-service doesn’t respond in 5s |
Key Design Decision: 4xx vs 5xx
Only 5xx errors and network failures trip the circuit breaker. 4xx errors (USER_NOT_FOUND, DUPLICATE_EMAIL, etc.) are business logic errors and do NOT count as failures.
This is enforced at two levels:
- Java config (
Resilience4jConfig.java): CustomrecordExceptionpredicate — only countsServiceClientExceptionwithhttpStatus >= 500as failures - YAML config (
auth-service.yml/web-service.yml):ignore-exceptions: ServiceClientClientException— tells the named circuit breaker instance to completely ignore 4xx exceptions
Both are needed: the Java config sets the default for all circuit breakers, while the YAML ignore-exceptions applies to named instances that override the default.
Bug Fix: YAML Instance Naming Mismatch (2026-03-04)
The initial implementation had a silent configuration bug — the YAML circuit breaker config was never actually applied at runtime.
Root cause: ServiceClient.extractServiceName() derives the circuit breaker name from the Eureka URL (e.g., http://MEMBER-SERVICE → "MEMBER-SERVICE" uppercase). But the YAML instance was configured as memberService (camelCase). Resilience4j does an exact match on instance names, so the YAML config for memberService was dead config — never matched.
What was actually running: Only the Java default from Resilience4jConfig.defaultCustomizer() was applied. This happened to have the same settings, so functionally there was no bug. But the YAML-level ignore-exceptions was not in effect.
Fix applied:
auth-service.yml: Renamed instance frommemberService→MEMBER-SERVICEweb-service.yml: Added new instanceGEOSPATIAL-SERVICEwith matching config + actuator endpoints
Lesson: When using ReactiveCircuitBreakerFactory.create(name), the name parameter must exactly match the YAML instances.<name> key. In this project, circuit breaker names come from Eureka service IDs (uppercase with hyphens), not camelCase.
Files Changed (8 files, 194 insertions)
New Files
| File | Purpose |
|---|---|
common-api/.../config/Resilience4jConfig.java |
Circuit breaker config — failure predicate (only 5xx trips circuit), sliding window, timeout |
Modified Files
| File | Change |
|---|---|
common-api/pom.xml |
Added spring-cloud-starter-circuitbreaker-reactor-resilience4j dependency |
common-api/.../client/ServiceClient.java |
Wrapped postReactive() and getReactive() with circuit breaker + 5s timeout |
config-service/.../config/auth-service.yml |
Added Resilience4j YAML config + actuator endpoints. Fixed: instance name memberService → MEMBER-SERVICE |
config-service/.../config/web-service.yml |
Added (2026-03-04): Resilience4j config for GEOSPATIAL-SERVICE instance + ignore-exceptions + actuator endpoints |
auth-service/.../exceptions/code/AuthErrorCode.java |
Added SERVICE_CIRCUIT_OPEN (503), fixed SERVER_NOT_AVAILABLE from 500 → 503 |
auth-service/.../service/AuthService.java |
Added CIRCUIT_OPEN case to error mapping switch |
common-api/.../client/ServiceClientReactiveTest.java |
Updated tests with no-op circuit breaker factory |
auth-service/.../service/AuthServiceComponentTest.java |
Updated tests with no-op circuit breaker factory |
Error Flow
When member-service is down (circuit CLOSED, counting failures):
ServiceClient: WebClientRequestException
→ ServiceClientException("SERVER_NOT_AVAILABLE", 503)
→ AuthService.mapServiceClientException()
→ MemberServiceException(SERVER_NOT_AVAILABLE)
→ GlobalExceptionHandler → 503 to client
When circuit is OPEN (fail fast):
ServiceClient: circuit breaker blocks the call immediately
→ ServiceClientException("CIRCUIT_OPEN", 503)
→ AuthService.mapServiceClientException()
→ MemberServiceException(SERVICE_CIRCUIT_OPEN)
→ GlobalExceptionHandler → 503 "Member service is temporarily unavailable"
When call times out (>5 seconds):
ServiceClient: TimeoutException after 5s
→ ServiceClientException("SERVER_NOT_AVAILABLE", 503)
→ same flow as network error above
Monitoring
Actuator endpoints exposed for circuit breaker health:
GET /actuator/health— shows circuit breaker state in health checkGET /actuator/circuitbreakers— list all circuit breakers and their stateGET /actuator/circuitbreakerevents— recent events (state transitions, errors)
These can be wired to Grafana (already running at grafana.adventuretube.net) via Prometheus for real-time dashboards.
Verified in Production
Tested with member-service down — auth-service returns:
{
"success": false,
"message": "Member service is temporarily unavailable, please try again later",
"errorCode": "SERVICE_CIRCUIT_OPEN",
"data": "com.adventuretube.auth.service.AuthService.mapServiceClientException : auth-service",
"timestamp": "2026-02-26T18:12:32.19415"
}
HTTP Status: 503 Service Unavailable (previously would have been 500 or hung indefinitely)
Testing
All 21 existing tests pass:
- 7 tests —
ServiceClientReactiveTest(common-api) - 14 tests —
AuthServiceComponentTest+ReactiveAuthenticationManagerTest+MemberMapperTest(auth-service)
Tests use a no-op circuit breaker factory that passes through all calls without tripping.
Architecture Impact
Before
Auth Service --REST (no protection)--> Member Service
(infinite wait)
(cascading failure risk)
After
Auth Service --[Timeout 5s]--[Circuit Breaker]--REST--> Member Service
(fail fast when open)
(503 instead of hanging)
(auto-recovery via half-open)
Future Work
Web-service circuit breaker— Done (2026-03-04). AddedGEOSPATIAL-SERVICEinstance config + actuator endpoints- Gateway-level circuit breaker — Spring Cloud Gateway has built-in Resilience4j support for route-level circuit breaking
- Retry pattern — Add retry for transient failures (1-2 retries with backoff) before circuit breaker evaluates
- Bulkhead — Limit concurrent calls to prevent connection pool exhaustion
- Grafana dashboard — Wire Resilience4j Micrometer metrics to Prometheus → Grafana
