Quartz Cluster | Play to Learn

Multiple pods · one execution per fire · misfire policies · failover recovery · all interactive

2. The lock race — only 1 pod wins each fire live sim

When a trigger fires, every pod tries SELECT … FROM QRTZ_LOCKS WHERE LOCK_NAME='TRIGGER_ACCESS' FOR UPDATE. The DB grants the row lock to exactly one pod — the rest block briefly, then see no triggers to acquire and back off.
Predict-then-verify: over 100 fires with 4 pods, what's the rough split? (Run it and find out.)

Every 10 sec Every minute Every 5 min Hourly Daily 02:00 Weekday 09:00

Cron expression (Quartz format: sec min hour dom month dow)

Pod count: 3

Sim speed: 1×

Sim time: — Next fire: — Total fires: 0 Per-pod execution count below ▼

SQL terminal — actual lock queries

# Press "Start" to begin simulation

Why this works — the core insight

SELECT FOR UPDATE is a row-level pessimistic lock. The first transaction to acquire the row blocks all others until it commits or rolls back.

Quartz uses this on a single row in QRTZ_LOCKS with LOCK_NAME='TRIGGER_ACCESS'. While one pod holds it, no other pod can acquire triggers.

The winner picks up the trigger from QRTZ_TRIGGERS, marks it ACQUIRED, inserts a row into QRTZ_FIRED_TRIGGERS, releases the lock, then runs the job.

Cost: this serializes trigger acquisition cluster-wide. Beyond ~3 nodes, contention starts to matter.

7. Long-running jobs — when job duration > trigger interval interactive

Drag the sliders to set how long your job takes and how often it triggers. Without @DisallowConcurrentExecution, each trigger fires a new job instance even if the last one is still running — threads pile up until the pool is exhausted. Flip the annotation on and see the difference.

Job duration: 45s

Trigger every: 20s

@DisallowConcurrentExecution Uncheck → multiple instances run at once

Misfire policy (when DCE on & job overruns):

Execution timeline — first 5 minutes

Normal execution

Delayed (missed trigger)

Scheduled trigger fire

QRTZ_TRIGGERS — trigger state

Thread pool (max 10 threads)

        1 thread in use
        
      

What happens when the pod dies mid-job?

Quartz only detects a dead pod when another pod notices its QRTZ_SCHEDULER_STATE row has gone stale — it does not use OS signals or Kubernetes events.

Checkin interval: 7.5s

Job was 40% through when pod died:

Recovery timeline

Event sequence

JobDetail.requestsRecovery:

Design patterns for at-least-once jobs

🔁

Idempotent job

Check before you act. Guard every side-effect with a "already done?" query.

💾

Checkpointing

Save progress into JobDataMap or a DB table. On recovery, skip already-done work.

🔍

ctx.isRecovering()

Quartz tells you this is a recovery run. Take a different code path for cleanup.

Java annotations

@DisallowConcurrentExecution // cluster-wide: no two instances run at once @PersistJobDataAfterExecution // saves JobDataMap changes — required for checkpointing public class ReportJob implements Job { @Override public void execute(JobExecutionContext ctx) throws JobExecutionException { if (ctx.isRecovering()) { // pod died last time — decide: restart from scratch or resume from checkpoint log.warn("Recovery run detected — last execution was interrupted"); } // ... job work ... } }

Reference — Spring Boot config, gotchas, the 11 tables

Spring Boot Quartz config — application.yml

spring:
  quartz:
    job-store-type: jdbc
    jdbc:
      initialize-schema: never  # run quartz_tables.sql manually in production
    properties:
      org.quartz:
        scheduler:
          instanceName: ClusteredScheduler
          instanceId: AUTO          # auto-generates unique ID per pod
        jobStore:
          class: org.springframework.scheduling.quartz.LocalDataSourceJobStore
          driverDelegateClass: org.quartz.impl.jdbcjobstore.PostgreSQLDelegate
          tablePrefix: QRTZ_
          isClustered: true         # THE switch — without this, behavior is undefined
          clusterCheckinInterval: 7500   # ms — how often each pod heartbeats
          useProperties: false
        threadPool:
          class: org.quartz.simpl.SimpleThreadPool
          threadCount: 10

The 11 QRTZ_ tables — what each one stores

QRTZ_JOB_DETAILS — job class names, durability, requestsRecovery flag, JobDataMap
QRTZ_TRIGGERS — trigger metadata (next/prev fire time, state, misfire instruction)
QRTZ_SIMPLE_TRIGGERS — SimpleTrigger-specific data (repeat count, repeat interval)
QRTZ_CRON_TRIGGERS — CronTrigger-specific (cron expression, time zone)
QRTZ_SIMPROP_TRIGGERS — calendar/daily-time-interval triggers (simple properties)
QRTZ_BLOB_TRIGGERS — serialized custom triggers (rarely used)
QRTZ_CALENDARS — serialized Calendar objects (exclusion calendars)
QRTZ_PAUSED_TRIGGER_GRPS — paused trigger groups
QRTZ_FIRED_TRIGGERS — currently executing triggers (one row per running execution; key for failover detection)
QRTZ_SCHEDULER_STATE — heartbeat row per scheduler instance with LAST_CHECKIN_TIME
QRTZ_LOCKS — row-level lock table (TRIGGER_ACCESS, STATE_ACCESS, JOB_ACCESS, CALENDAR_ACCESS, MISFIRE_ACCESS)

@DisallowConcurrentExecution — the cluster-wide gotcha

By default, Quartz is happy to fire the same job concurrently on different pods (or on the same pod for staggered triggers). If your job is non-idempotent — like sending an email or updating a balance — you don't want that.

Adding @DisallowConcurrentExecution on the JobDetail prevents concurrent runs cluster-wide. While job instance A is running on any pod, no other pod will fire the same JobKey.

Trap: if pod A crashes mid-job and the lock is held forever in QRTZ_FIRED_TRIGGERS, you may see the job stuck. Recovery (next checkin) will release it. With requestsRecovery=true, it re-fires; without, it just unblocks future fires.

Misfire instructions — the actual constants

MISFIRE_INSTRUCTION_SMART_POLICY (default) — Quartz picks based on trigger type: simple = fire once now, cron = fire once now
MISFIRE_INSTRUCTION_FIRE_ONCE_NOW — execute immediately, ignore how many were missed
MISFIRE_INSTRUCTION_DO_NOTHING — skip all missed, wait for next regular fire
MISFIRE_INSTRUCTION_IGNORE_MISFIRE_POLICY — disable misfire detection, fire all missed back-to-back (dangerous for high-frequency jobs)

Set per-trigger when building it: .withSchedule(cronSchedule(...).withMisfireHandlingInstructionFireAndProceed())

Performance cliff at 3+ nodes

The cluster-wide lock on QRTZ_LOCKS:TRIGGER_ACCESS serializes trigger acquisition. With 3 nodes the contention is mild. With 8+ nodes the lock becomes the bottleneck — pods spend more time waiting for the lock than running jobs.

Workarounds:

Partition jobs across multiple Quartz scheduler instances (different schedulerName per shard)
Use org.quartz.jobStore.acquireTriggersWithinLock=false (older Quartz; modern versions handle this)
If you have thousands of short jobs, use a different tool (Celery / Sidekiq / k8s CronJob with sharding)

Common production bugs & symptoms

Job runs N times every fire — clustering wasn't enabled. Check isClustered=true
Two pods running the same instance — both pods got the same instanceId (don't hardcode it; use AUTO)
Triggers fire all at once after restart — wrong misfire policy (IGNORE_MISFIRE_POLICY)
Jobs stuck in QRTZ_FIRED_TRIGGERS forever — pod was killed without running the recovery cycle. Check checkinInterval is reasonable
Clock drift between pods — checkin times become unreliable. Run NTP on every node

Quartz Cluster on Spring Boot

Reference — Spring Boot config, gotchas, the 11 tables