Skip to content

Retries and failure

The contract is small: a #[process] method returns anyhow::Result<()>, Ok(()) acknowledges the job, Err(_) marks it failed. After that the backend decides — and Redis (via apalis) gives you a fixed retry budget per method, a panic safety net, and a failed-jobs list. Anything richer than that is bolted on top.

#[process(queue = "audio", concurrency = 5, retries = 3)]
async fn transcode(&self, job: TranscodeDto) -> Result<()> {
self.svc.transcode(&job.file).await
}
  • Ok(()) ⇒ the job is removed from the queue. apalis acknowledges it; that’s the end of its life.
  • Err(e) ⇒ the job goes back for another attempt. apalis tracks the attempt count, and after retries failures it’s moved to the queue’s failed_jobs list.
  • A panic inside the method is caught by a CatchPanicLayer and converted to Error::Abort. The job ends up in the failed list, the worker keeps running — one bad job doesn’t kill the queue’s consumer.

Hot-path layer order: CatchPanicLayer sits inside RetryLayer, which sits inside ErrorHandlingLayer. RetryLayer reacts to Err, not to panics, so without the panic layer a single unwrap() in user code would tear down the worker. The order is fixed in nest-rs-redis’s build_worker.

retries = N uses RetryPolicy::retries(N) from apalis. That’s a fixed retry count, no backoff — the policy retries up to N times back-to-back. The number of total attempts a job sees is N + 1 (the first try plus N retries). Default if you omit retries: 0 — the job is tried once.

If you need a different retry curve, today the path is to wrap the business call in the service:

impl AudioService {
pub async fn transcode(&self, file: &str) -> Result<()> {
let mut attempt = 0;
loop {
match self.do_transcode(file).await {
Ok(()) => return Ok(()),
Err(e) if attempt < 5 => {
let delay = Duration::from_millis(100 * 2u64.pow(attempt));
tracing::warn!(target: "features::audio", %file, attempt, "transcode failed; retrying");
tokio::time::sleep(delay).await;
attempt += 1;
}
Err(e) => return Err(e),
}
}
}
}

The #[process] budget then acts as the outer safety net — a couple of immediate retries after the service-level backoff already gave up, useful for transient framework-level issues (the container resolution failing under load, a panic in deserialization).

apalis moves a job whose budget is exhausted (or that aborted via CatchPanicLayer) to a Redis sorted set named after the queue, with :failed appended (an apalis-internal naming convention you generally read through apalis’s CLI or admin tools). The framework does not ship a dead-letter routing layer of its own — the failed list is the storage’s failed list, and inspecting / replaying it is an operational task you run with apalis tooling or your own admin handler.

If you need richer dead-lettering — alerting, replay, archival to a separate queue — write a wrapper service that catches the Err from the work and republishes the payload to a dedicated queue (e.g. audio.dead) before returning the error to apalis. A #[processor] on audio.dead then runs your DLQ logic. That’s the same pattern any non-Redis backend would expose.

#[process(queue = "audio", retries = 3)]
async fn transcode(&self, job: TranscodeDto) -> Result<()> {
match self.svc.transcode(&job.file).await {
Ok(()) => Ok(()),
Err(e) => {
self.svc.report_dead(&job, &e).await?;
Err(e)
}
}
}

Panics, deserialization errors, container misses

Section titled “Panics, deserialization errors, container misses”

Three failure modes the framework converts into the same Err path, so you don’t have to think about them separately:

  • Panics in the method body — caught by CatchPanicLayer, surfaced as Error::Abort. The job goes to the failed list immediately (regardless of remaining retries), the worker stays up.
  • Deserialization errors — the macro-emitted JobHandler does the serde_json::from_value. A payload that doesn’t match the method’s job type returns Err before your method even runs. Retrying won’t help: the payload shape is wrong. retries will burn through, then the job ends in the failed list.
  • Container resolution failuresContainer::get of the #[processor] host can fail if a dep wasn’t seeded. The handler returns Err with the resolution error. This usually means a configuration mistake at boot, not a transient — investigate, don’t rely on retries.

A processor reads only payloads it understands. The wire envelope ({ "v": 1, "payload": ... }) carries a version; the macro-emitted handler validates it. A future release that bumps WIRE_FORMAT_VERSION to 2 makes a v: 1 worker refuse to decode v: 2 jobs — fails closed with a clear error rather than misinterpreting bytes. During a rolling deploy this means jobs queue up; afterward the new worker drains them.

Unversioned legacy payloads (a bare job JSON, no v key) are decoded directly with a tracing::warn — so jobs left in Redis from a release before envelopes existed don’t get stuck. New producers always wrap.

  • Observability — what shows up in spans and logs when a job retries or fails.
  • OpenTelemetry — wiring the queue spans into a trace backend.
  • crates/nest-rs-redis/src/worker/consumer.rsbuild_worker: the apalis layer stack, the CatchPanicLayer placement, the RetryLayer budget.
  • crates/nest-rs-queue/src/inventory.rsWIRE_FORMAT_VERSION, the envelope spec.