← Back

Quartz Cluster on Spring Boot

Multiple pods · one execution per fire · misfire policies · failover recovery · all interactive

Scenario:
What is Quartz and why "clustered"? click each option
You have a scheduled job (e.g. nightly billing report). Your service runs as 3 pods on Kubernetes. How do you guarantee the job runs exactly once at every fire time? Pick an approach below.
2. The lock race โ€” only 1 pod wins each fire live sim
When a trigger fires, every pod tries SELECT โ€ฆ FROM QRTZ_LOCKS WHERE LOCK_NAME='TRIGGER_ACCESS' FOR UPDATE. The DB grants the row lock to exactly one pod โ€” the rest block briefly, then see no triggers to acquire and back off.
Predict-then-verify: over 100 fires with 4 pods, what's the rough split? (Run it and find out.)
Every 10 sec Every minute Every 5 min Hourly Daily 02:00 Weekday 09:00
Sim time: โ€” Next fire: โ€” Total fires: 0 Per-pod execution count below โ–ผ
SQL terminal โ€” actual lock queries
# Press "Start" to begin simulation
Why this works โ€” the core insight

SELECT FOR UPDATE is a row-level pessimistic lock. The first transaction to acquire the row blocks all others until it commits or rolls back.

Quartz uses this on a single row in QRTZ_LOCKS with LOCK_NAME='TRIGGER_ACCESS'. While one pod holds it, no other pod can acquire triggers.

The winner picks up the trigger from QRTZ_TRIGGERS, marks it ACQUIRED, inserts a row into QRTZ_FIRED_TRIGGERS, releases the lock, then runs the job.

Cost: this serializes trigger acquisition cluster-wide. Beyond ~3 nodes, contention starts to matter.

3. QRTZ_ database tables โ€” what's written when 11 tables
Click any table to expand it. Watch rows appear/update as triggers fire. Three tables matter most: QRTZ_LOCKS (the lock row), QRTZ_FIRED_TRIGGERS (currently running jobs), and QRTZ_SCHEDULER_STATE (heartbeats โ€” used for failover detection).
What Quartz writes to the DB at each stage of a job's life
4. Misfire policies โ€” what happens when the scheduler is down timeline sim
A misfire happens when a trigger should have fired but couldn't (scheduler shutdown, no thread available, DB unreachable). The policy decides what to do when the scheduler comes back.
Challenge: Hourly job, scheduler down for 4 hours โ€” which policy gives 1 execution? 4? 0?
Scheduled fire
Executed
Replayed (catch-up)
Skipped
Downtime
โ€”
Total scheduled fires (in window)
โ€”
Actual executions
โ€”
Skipped (no execution)
5. Failover & recovery โ€” pod dies mid-job kill sim
Click "Kill" on any pod while it's running a job. Watch QRTZ_SCHEDULER_STATE stop heartbeating, the cluster detect the failure after the checkin interval expires, and (if requestsRecovery=true) re-fire the job on a healthy pod.
If unchecked, killed jobs are dropped on the floor
Use the Kill buttons on the pod cards above โ–ฒ
Cluster event log โ€” failure detection & recovery
Cluster idle. Start the simulation and kill a pod to see failover in action.
6. Health check โ€” Spring Boot Actuator + Quartz indicator live JSON
Toggle a component to fail and watch /actuator/health output update. Kubernetes uses this for readiness/liveness probes โ€” a NotReady pod stops receiving traffic.
Challenge: Cause the probe to fail โ€” which specific component reports DOWN?
GET /actuator/health
kubectl get pods (after readiness probe outcome)
Production setup: use /actuator/health/readiness for the K8s readiness probe and /actuator/health/liveness for liveness. Don't use /actuator/health for both โ€” a transient DB blip can liveness-fail your pod and trigger a restart loop.
7. Long-running jobs โ€” when job duration > trigger interval interactive
Drag the sliders to set how long your job takes and how often it triggers. Without @DisallowConcurrentExecution, each trigger fires a new job instance even if the last one is still running โ€” threads pile up until the pool is exhausted. Flip the annotation on and see the difference.
Uncheck โ†’ multiple instances run at once
Execution timeline โ€” first 5 minutes
Normal execution
Delayed (missed trigger)
Scheduled trigger fire
QRTZ_TRIGGERS โ€” trigger state
Thread pool (max 10 threads)
1 thread in use
What happens when the pod dies mid-job?
Quartz only detects a dead pod when another pod notices its QRTZ_SCHEDULER_STATE row has gone stale โ€” it does not use OS signals or Kubernetes events.
Recovery timeline
Event sequence
JobDetail.requestsRecovery:
Design patterns for at-least-once jobs
๐Ÿ”
Idempotent job
Check before you act. Guard every side-effect with a "already done?" query.
๐Ÿ’พ
Checkpointing
Save progress into JobDataMap or a DB table. On recovery, skip already-done work.
๐Ÿ”
ctx.isRecovering()
Quartz tells you this is a recovery run. Take a different code path for cleanup.
Java annotations
@DisallowConcurrentExecution // cluster-wide: no two instances run at once @PersistJobDataAfterExecution // saves JobDataMap changes โ€” required for checkpointing public class ReportJob implements Job { @Override public void execute(JobExecutionContext ctx) throws JobExecutionException { if (ctx.isRecovering()) { // pod died last time โ€” decide: restart from scratch or resume from checkpoint log.warn("Recovery run detected โ€” last execution was interrupted"); } // ... job work ... } }

Reference โ€” Spring Boot config, gotchas, the 11 tables

Spring Boot Quartz config โ€” application.yml
spring:
  quartz:
    job-store-type: jdbc
    jdbc:
      initialize-schema: never  # run quartz_tables.sql manually in production
    properties:
      org.quartz:
        scheduler:
          instanceName: ClusteredScheduler
          instanceId: AUTO          # auto-generates unique ID per pod
        jobStore:
          class: org.springframework.scheduling.quartz.LocalDataSourceJobStore
          driverDelegateClass: org.quartz.impl.jdbcjobstore.PostgreSQLDelegate
          tablePrefix: QRTZ_
          isClustered: true         # THE switch โ€” without this, behavior is undefined
          clusterCheckinInterval: 7500   # ms โ€” how often each pod heartbeats
          useProperties: false
        threadPool:
          class: org.quartz.simpl.SimpleThreadPool
          threadCount: 10
The 11 QRTZ_ tables โ€” what each one stores
@DisallowConcurrentExecution โ€” the cluster-wide gotcha

By default, Quartz is happy to fire the same job concurrently on different pods (or on the same pod for staggered triggers). If your job is non-idempotent โ€” like sending an email or updating a balance โ€” you don't want that.

Adding @DisallowConcurrentExecution on the JobDetail prevents concurrent runs cluster-wide. While job instance A is running on any pod, no other pod will fire the same JobKey.

Trap: if pod A crashes mid-job and the lock is held forever in QRTZ_FIRED_TRIGGERS, you may see the job stuck. Recovery (next checkin) will release it. With requestsRecovery=true, it re-fires; without, it just unblocks future fires.

Misfire instructions โ€” the actual constants

Set per-trigger when building it: .withSchedule(cronSchedule(...).withMisfireHandlingInstructionFireAndProceed())

Performance cliff at 3+ nodes

The cluster-wide lock on QRTZ_LOCKS:TRIGGER_ACCESS serializes trigger acquisition. With 3 nodes the contention is mild. With 8+ nodes the lock becomes the bottleneck โ€” pods spend more time waiting for the lock than running jobs.

Workarounds:

Common production bugs & symptoms