Dec 18, 2025·7 min read

Scheduling background jobs without cron headaches: patterns

Learn patterns for scheduling background jobs using workflows and a jobs table to run reminders, daily summaries, and cleanup reliably.

Scheduling background jobs without cron headaches: patterns

Why cron feels simple until it does not

Cron is great on day one: write a line, pick a time, forget about it. For one server and one task, it often works.

The problems show up when you rely on scheduling for real product behavior: reminders, daily summaries, cleanup, or sync jobs. Most “missed run” stories aren’t cron failing. They’re everything around it: a server reboot, a deploy that overwrote crontab, a job that ran longer than expected, or a clock or time zone mismatch. And once you run multiple app instances, you can get the opposite failure mode: duplicates, because two machines think they should run the same task.

Testing is another weak spot. A cron line doesn’t give you a clean way to run “what would happen at 9:00 AM tomorrow” in a repeatable test. So scheduling turns into manual checks, production surprises, and log hunting.

Before you pick an approach, be clear about what you’re scheduling. Most background work falls into a few buckets:

  • Reminders (send at a specific time, only once)
  • Daily summaries (aggregate data, then send)
  • Cleanup tasks (delete, archive, expire)
  • Periodic syncs (pull or push updates)

Sometimes you can skip scheduling entirely. If something can happen right when an event occurs (a user signs up, a payment succeeds, a ticket changes status), event-driven work is usually simpler and more reliable than time-driven work.

When you do need time, reliability mostly comes down to visibility and control. You want a place to record what should run, what did run, and what failed, plus a safe way to retry without creating duplicates.

The basic pattern: scheduler, jobs table, worker

A simple way to avoid cron headaches is to split responsibilities:

  • A scheduler decides what should run and when.
  • A worker does the work.

Keeping those roles separate helps in two ways. You can change timing without touching business logic, and you can change business logic without breaking the schedule.

A jobs table becomes the source of truth. Instead of hiding state inside a server process or a cron line, every unit of work is a row: what to do, who it’s for, when it should run, and what happened last time. When something goes wrong, you can inspect it, retry it, or cancel it without guessing.

A typical flow looks like this:

  • The scheduler scans for due jobs (for example, run_at <= now and status = queued).
  • It claims a job so only one worker takes it.
  • A worker reads the job details and performs the action.
  • The worker records the result back to the same row.

The key idea is to make work resumable, not magical. If a worker crashes halfway through, the job row should still tell you what happened and what to do next.

Designing a jobs table that stays useful

A jobs table should answer two questions quickly: what needs to run next, and what happened last time.

Start with a small set of fields that cover identity, timing, and progress:

  • id, type: a unique id plus a short type like send_reminder or daily_summary.
  • payload: validated JSON with only what the worker needs (for example user_id, not the whole user object).
  • run_at: when the job becomes eligible to run.
  • status: queued, running, succeeded, failed, canceled.
  • attempts: incremented on each try.

Then add a few operational columns that make concurrency safe and incidents easier to handle. locked_at, locked_by, and locked_until let one worker claim a job so you don’t run it twice. last_error should be a short message (and optionally an error code), not a full stack trace dump that bloats rows.

Finally, keep timestamps that help both support and reporting: created_at, updated_at, and finished_at. These let you answer questions like “How many reminders failed today?” without digging through logs.

Indexes matter because your system constantly asks “what’s next?” Two that usually pay for themselves:

  • (status, run_at) to fetch due jobs fast
  • (type, status) to inspect or pause one job family during issues

For payloads, prefer small, focused JSON and validate it before inserting the job. Store identifiers and parameters, not snapshots of business data. Treat the payload shape like an API contract so older queued jobs still run after you change your app.

Job lifecycle: statuses, locking, and idempotency

A job runner stays reliable when every job follows a small, predictable lifecycle. That lifecycle is your safety net when two workers start at once, a server restarts mid-run, or you need to retry without creating duplicates.

A simple state machine is usually enough:

  • queued: ready to run at or after run_at
  • running: claimed by a worker
  • succeeded: finished and shouldn’t run again
  • failed: finished with an error and needs attention
  • canceled: intentionally stopped (for example, user opted out)

Claiming jobs without double work

To prevent duplicates, claiming a job needs to be atomic. The common approach is a lock with a timeout (a lease): a worker claims a job by setting status=running and writing locked_by plus locked_until. If the worker crashes, the lock expires and another worker can reclaim it.

A practical claiming rule set:

  • claim only queued jobs whose run_at <= now
  • set status, locked_by, and locked_until in the same update
  • reclaim running jobs only when locked_until < now
  • keep the lease short and extend it if the job is long

Idempotency (the habit that saves you)

Idempotency means: if the same job runs twice, the result is still correct.

The simplest tool is a unique key. For example, for a daily summary you can enforce one job per user per day with a key like summary:user123:2026-01-25. If a duplicate insert happens, it points to the same job rather than creating a second one.

Mark success only when the side effect is truly done (email sent, record updated). If you retry, the retry path must not create a second email or duplicate write.

Retries and failure handling without drama

Send reminders the safer way
Queue one-off reminders with idempotency keys to reduce duplicates and surprises.
Build Reminders

Retries are where job systems either become dependable or turn into noise. The goal is straightforward: retry when a failure is likely temporary, stop when it isn’t.

A default retry policy usually includes:

  • max attempts (for example, 5 total tries)
  • a delay strategy (fixed delay or exponential backoff)
  • stop conditions (don’t retry “invalid input” type errors)
  • jitter (a small random offset to avoid retry spikes)

Instead of inventing a new status for retries, you can often reuse queued: set run_at to the next attempt time and put the job back in the queue. That keeps the state machine small.

When a job can make partial progress, treat that as normal. Store a checkpoint so a retry can continue safely, either in the job payload (like last_processed_id) or in a related table.

Example: a daily summary job generates messages for 500 users. If it fails at user 320, store the last successful user ID and retry from 321. If you also store a summary_sent record per user per day, a rerun can skip users already done.

Logging that actually helps

Log enough to debug in minutes:

  • job id, type, and attempt number
  • key inputs (user/team id, date range)
  • timing (started_at, finished_at, next run time)
  • short error summary (plus stack trace if you have one)
  • side effects count (emails sent, rows updated)

Step by step: build a simple scheduler loop

Handle failures without drama
Use attempts, backoff timing, and stop conditions as part of your job workflow.
Add Retries

A scheduler loop is a small process that wakes up on a fixed rhythm, looks for due work, and hands it off. The goal is boring reliability, not perfect timing. For many apps, “wake up every minute” is enough.

Pick your wake-up frequency based on how time-sensitive the jobs are and how much load your database can take. If reminders must be near-real-time, run every 30 to 60 seconds. If daily summaries can drift a bit, every 5 minutes is fine and cheaper.

A simple loop:

  1. Wake up and get the current time (use UTC).
  2. Select due jobs where status = 'queued' and run_at <= now.
  3. Claim jobs safely so only one worker can take them.
  4. Hand each claimed job to a worker.
  5. Sleep until the next tick.

The claim step is where many systems break. You want to mark a job as running (and store locked_by and locked_until) in the same transaction that selects it. Many databases support “skip locked” reads so multiple schedulers can run without stepping on each other.

-- concept example
BEGIN;
SELECT id FROM jobs
WHERE status='queued' AND run_at <= NOW()
ORDER BY run_at
LIMIT 100
FOR UPDATE SKIP LOCKED;
UPDATE jobs
SET status='running', locked_until=NOW() + INTERVAL '5 minutes'
WHERE id IN (...);
COMMIT;

Keep the batch size small (like 50 to 200). Bigger batches can slow down the database and make crashes more painful.

If the scheduler crashes mid-batch, the lease saves you. Jobs stuck in running become eligible again after locked_until. Your worker should be idempotent so a reclaimed job doesn’t create duplicate emails or double charges.

Patterns for reminders, daily summaries, and cleanup

Most teams end up with the same three kinds of background work: messages that need to go out on time, reports that run on a schedule, and cleanup that keeps storage and performance healthy. The same jobs table and worker loop can handle all of them.

Reminders

For reminders, store everything needed to send the message in the job row: who it’s for, which channel (email, SMS, Telegram, in-app), which template, and the exact send time. The worker should be able to run the job without “looking around” for extra context.

If many reminders are due at the same time, add rate limiting. Cap messages per minute per channel and let extra jobs wait for the next run.

Daily summaries

Daily summaries fail when the time window is fuzzy. Pick one stable cutoff time (for example, 08:00 in the user’s local time), and define the window clearly (for example, “yesterday 08:00 to today 08:00”). Store the cutoff and the user time zone with the job so reruns produce the same result.

Keep each summary job small. If it needs to process thousands of records, split it into chunks (per team, per account, or by ID range) and enqueue follow-up jobs.

Cleanup tasks

Cleanup is safer when you separate “delete” from “archive.” Decide what can be removed forever (temporary tokens, expired sessions) and what should be archived (audit logs, invoices). Run cleanup in predictable batches to avoid long locks and sudden load spikes.

Time and time zones: the hidden source of bugs

Make scheduling testable
Test “run at 9 AM tomorrow” flows by triggering the same job logic on demand.
Prototype

Many failures are time bugs: a reminder goes out an hour early, a daily summary skips Monday, or cleanup runs twice.

A good default is to store schedule timestamps in UTC and store the user’s time zone separately. Your run_at should be one UTC moment. When a user says “9:00 AM my time,” convert that to UTC when scheduling.

Daylight saving time is where naive setups break. “Every day at 9:00 AM” is not the same as “every 24 hours.” On DST shifts, 9:00 AM maps to a different UTC time, and some local times don’t exist (spring forward) or happen twice (fall back). The safer approach is to compute the next local occurrence each time you reschedule, then convert it to UTC again.

For a daily summary, decide what “a day” means before you write code. A calendar day (midnight to midnight in the user’s time zone) matches human expectations. “Last 24 hours” is simpler but drifts and surprises people.

Late data is inevitable: an event arrives after a retry, or a note is added a few minutes after midnight. Decide whether late events belong to “yesterday” (with a grace period) or “today,” and keep that rule consistent.

A practical buffer can prevent misses:

  • scan for jobs due up to 2 to 5 minutes ago
  • make the job idempotent so reruns are safe
  • record the covered time range in the payload so summaries stay consistent

Common mistakes that cause missed or duplicate runs

Most pain comes from a few predictable assumptions.

The biggest is assuming “exactly once” execution. In real systems, workers restart, network calls time out, and locks can be lost. You typically get “at least once” delivery, which means duplicates are normal and your code must tolerate them.

Another is doing effects first (send email, charge card) without a dedupe check. A simple guard often solves this: a sent_at timestamp, a unique key like (user_id, reminder_type, date), or a stored dedupe token.

Visibility is the next gap. If you can’t answer “what is stuck, since when, and why,” you’ll end up guessing. The minimum data to keep close is status, attempt count, next scheduled time, last error, and worker id.

The mistakes that show up most often:

  • designing jobs as if they run exactly once, then being surprised by duplicates
  • writing side effects without a dedupe check
  • running one huge job that tries to do everything and hits timeouts mid-way
  • retrying forever with no cap
  • skipping basic queue visibility (no clear view of backlog, failures, long-running items)

A concrete example: a daily summary job loops over 50,000 users and times out at user 20,000. On retry, it starts over and sends summaries again to the first 20,000 unless you track per-user completion or split it into per-user jobs.

Quick checklist for a reliable job system

Ship background jobs to production
Implement claim-process-update patterns with real generated Go backend code.
Create Backend

A job runner is only “done” when you can trust it at 2 a.m.

Make sure you have:

  • Queue visibility: counts for queued vs running vs failed, plus the oldest queued job.
  • Idempotency by default: assume every job can run twice; use unique keys or “already processed” markers.
  • Retry policy per job type: retries, backoff, and a clear stop condition.
  • Consistent time storage: keep run_at in UTC; convert only at input and display.
  • Recoverable locks: a lease so crashes don’t leave jobs running forever.

Also cap batch size (how many jobs you claim at once) and worker concurrency (how many run at the same time). Without caps, one spike can overload your database or starve other work.

A realistic example: reminders and summaries for a small team

Make jobs visible and controllable
Model statuses, locks, and retries in a backend you can evolve without rewrites.
Build Now

A small SaaS tool has 30 customer accounts. Each account wants two things: a 9:00 AM reminder for any open tasks, and a 6:00 PM daily summary of what changed today. They also need weekly cleanup so the database doesn’t fill up with old logs and expired tokens.

They use a jobs table plus a worker that polls for due jobs. When a new customer signs up, the backend schedules the first reminder and summary runs based on the customer’s time zone.

Jobs get created at a few common moments: on signup (create recurring schedules), on certain events (enqueue one-off notifications), on a schedule tick (insert upcoming runs), and on maintenance day (enqueue cleanup).

One Tuesday, the email provider has a temporary outage at 8:59 AM. The worker tries to send reminders, gets a timeout, and reschedules those jobs by setting run_at using backoff (for example, 2 minutes, then 10, then 30), incrementing attempts each time. Because each reminder job has an idempotency key like account_id + date + job_type, retries don’t produce duplicates if the provider recovers mid-flight.

Cleanup runs weekly in small batches, so it doesn’t block other work. Instead of deleting a million rows in one job, it deletes up to N rows per run and reschedules itself until done.

When a customer complains “I never got my summary,” the team checks the jobs table for that account and day: the job status, the attempt count, the current lock fields, and the last error returned by the provider. That turns “it should have sent” into “here’s exactly what happened.”

Next steps: implement, observe, then scale

Pick one job type and build it end to end before adding more. A single reminder job is a good starter because it touches everything: scheduling, claiming due work, sending a message, and recording outcomes.

Start with a version you can trust:

  • create the jobs table and one worker that processes one job type
  • add a scheduler loop that claims and runs due jobs
  • store enough payload to run the job without extra guessing
  • log every attempt and outcome so “Did it run?” is a 10-second question
  • add a manual rerun path for failed jobs so recovery doesn’t require a deploy

Once it runs, make it observable for humans. Even a basic admin view pays off quickly: search jobs by status, filter by time, inspect payload, cancel a stuck job, rerun a specific job id.

If you prefer building this kind of scheduler and worker flow with visual backend logic, AppMaster (appmaster.io) can model the jobs table in PostgreSQL and implement the claim-process-update loop as a Business Process, while still generating real source code for deployment.

Easy to start
Create something amazing

Experiment with AppMaster with free plan.
When you will be ready you can choose the proper subscription.

Get Started