Circuit Breaker Implementation (Resilience4j)

Branch: feature/circuit-breaker (merged to main)
Commit: 241481b
Date: 2026-02-26
Library: Resilience4j 2.2.0 via Spring Cloud Circuit Breaker 3.2.0


Problem

Auth-service calls member-service via ServiceClient (WebClient) with:

  • No timeout — a slow member-service hangs auth-service indefinitely
  • No circuit breaker — repeated failures keep hitting a dead service
  • Cascading failure risk: member-service down → auth-service down → gateway backs up → everything dead

Call Chain Before

Client → Gateway → Auth Service --REST (no timeout, no protection)--> Member Service → PostgreSQL

Solution

Added Resilience4j circuit breaker to ServiceClient in common-api module, protecting ALL inter-service calls automatically.

How It Works

The circuit breaker has 3 states:

State Behavior
CLOSED (normal) All requests pass through. Failures are counted.
OPEN (tripped) All requests fail fast with 503. No calls to member-service.
HALF-OPEN (testing) 3 test calls allowed. If they succeed → CLOSED. If they fail → OPEN.

Configuration

Parameter Value Meaning
sliding-window-size 10 Evaluate the last 10 calls
minimum-number-of-calls 5 Need at least 5 calls before evaluating
failure-rate-threshold 50% Open circuit if ≥50% of calls fail
wait-duration-in-open-state 30s Stay open for 30s before trying again
permitted-number-of-calls-in-half-open-state 3 Allow 3 test calls in half-open
timeout-duration 5s Fail if member-service doesn’t respond in 5s

Key Design Decision: 4xx vs 5xx

Only 5xx errors and network failures trip the circuit breaker. 4xx errors (USER_NOT_FOUND, DUPLICATE_EMAIL, etc.) are business logic errors and do NOT count as failures.

This is enforced at two levels:

  1. Java config (Resilience4jConfig.java): Custom recordException predicate — only counts ServiceClientException with httpStatus >= 500 as failures
  2. YAML config (auth-service.yml / web-service.yml): ignore-exceptions: ServiceClientClientException — tells the named circuit breaker instance to completely ignore 4xx exceptions

Both are needed: the Java config sets the default for all circuit breakers, while the YAML ignore-exceptions applies to named instances that override the default.

Bug Fix: YAML Instance Naming Mismatch (2026-03-04)

The initial implementation had a silent configuration bug — the YAML circuit breaker config was never actually applied at runtime.

Root cause: ServiceClient.extractServiceName() derives the circuit breaker name from the Eureka URL (e.g., http://MEMBER-SERVICE"MEMBER-SERVICE" uppercase). But the YAML instance was configured as memberService (camelCase). Resilience4j does an exact match on instance names, so the YAML config for memberService was dead config — never matched.

What was actually running: Only the Java default from Resilience4jConfig.defaultCustomizer() was applied. This happened to have the same settings, so functionally there was no bug. But the YAML-level ignore-exceptions was not in effect.

Fix applied:

  • auth-service.yml: Renamed instance from memberServiceMEMBER-SERVICE
  • web-service.yml: Added new instance GEOSPATIAL-SERVICE with matching config + actuator endpoints

Lesson: When using ReactiveCircuitBreakerFactory.create(name), the name parameter must exactly match the YAML instances.<name> key. In this project, circuit breaker names come from Eureka service IDs (uppercase with hyphens), not camelCase.


Files Changed (8 files, 194 insertions)

New Files

File Purpose
common-api/.../config/Resilience4jConfig.java Circuit breaker config — failure predicate (only 5xx trips circuit), sliding window, timeout

Modified Files

File Change
common-api/pom.xml Added spring-cloud-starter-circuitbreaker-reactor-resilience4j dependency
common-api/.../client/ServiceClient.java Wrapped postReactive() and getReactive() with circuit breaker + 5s timeout
config-service/.../config/auth-service.yml Added Resilience4j YAML config + actuator endpoints. Fixed: instance name memberServiceMEMBER-SERVICE
config-service/.../config/web-service.yml Added (2026-03-04): Resilience4j config for GEOSPATIAL-SERVICE instance + ignore-exceptions + actuator endpoints
auth-service/.../exceptions/code/AuthErrorCode.java Added SERVICE_CIRCUIT_OPEN (503), fixed SERVER_NOT_AVAILABLE from 500 → 503
auth-service/.../service/AuthService.java Added CIRCUIT_OPEN case to error mapping switch
common-api/.../client/ServiceClientReactiveTest.java Updated tests with no-op circuit breaker factory
auth-service/.../service/AuthServiceComponentTest.java Updated tests with no-op circuit breaker factory

Error Flow

When member-service is down (circuit CLOSED, counting failures):

ServiceClient: WebClientRequestException
  → ServiceClientException("SERVER_NOT_AVAILABLE", 503)
    → AuthService.mapServiceClientException()
      → MemberServiceException(SERVER_NOT_AVAILABLE)
        → GlobalExceptionHandler → 503 to client

When circuit is OPEN (fail fast):

ServiceClient: circuit breaker blocks the call immediately
  → ServiceClientException("CIRCUIT_OPEN", 503)
    → AuthService.mapServiceClientException()
      → MemberServiceException(SERVICE_CIRCUIT_OPEN)
        → GlobalExceptionHandler → 503 "Member service is temporarily unavailable"

When call times out (>5 seconds):

ServiceClient: TimeoutException after 5s
  → ServiceClientException("SERVER_NOT_AVAILABLE", 503)
    → same flow as network error above

Monitoring

Actuator endpoints exposed for circuit breaker health:

  • GET /actuator/health — shows circuit breaker state in health check
  • GET /actuator/circuitbreakers — list all circuit breakers and their state
  • GET /actuator/circuitbreakerevents — recent events (state transitions, errors)

These can be wired to Grafana (already running at grafana.adventuretube.net) via Prometheus for real-time dashboards.


Verified in Production

Tested with member-service down — auth-service returns:

{
  "success": false,
  "message": "Member service is temporarily unavailable, please try again later",
  "errorCode": "SERVICE_CIRCUIT_OPEN",
  "data": "com.adventuretube.auth.service.AuthService.mapServiceClientException : auth-service",
  "timestamp": "2026-02-26T18:12:32.19415"
}

HTTP Status: 503 Service Unavailable (previously would have been 500 or hung indefinitely)


Testing

All 21 existing tests pass:

  • 7 tests — ServiceClientReactiveTest (common-api)
  • 14 tests — AuthServiceComponentTest + ReactiveAuthenticationManagerTest + MemberMapperTest (auth-service)

Tests use a no-op circuit breaker factory that passes through all calls without tripping.


Architecture Impact

Before

Auth Service --REST (no protection)--> Member Service
             (infinite wait)
             (cascading failure risk)

After

Auth Service --[Timeout 5s]--[Circuit Breaker]--REST--> Member Service
             (fail fast when open)
             (503 instead of hanging)
             (auto-recovery via half-open)

Future Work

  • Web-service circuit breaker — Done (2026-03-04). Added GEOSPATIAL-SERVICE instance config + actuator endpoints
  • Gateway-level circuit breaker — Spring Cloud Gateway has built-in Resilience4j support for route-level circuit breaking
  • Retry pattern — Add retry for transient failures (1-2 retries with backoff) before circuit breaker evaluates
  • Bulkhead — Limit concurrent calls to prevent connection pool exhaustion
  • Grafana dashboard — Wire Resilience4j Micrometer metrics to Prometheus → Grafana

Leave a Comment

Your email address will not be published. Required fields are marked *

Scroll to Top