Based on Garry Tan's gstack skill frameworkEng Plan Review

Input

Our microservices architecture has 15 services talking via REST. Should we add a message queue?

Result

# Architectural Decision: REST → Message Queue?

You're asking the right question. Let me push back before answering, because "add a queue" is one of the most common over-engineered moves in microservices.

## First, what problem are you actually solving?

15 services on REST is not inherently broken. Stripe ran on synchronous HTTP for years. The question isn't "is REST holding us back?" — it's "what specific pain are we feeling, and is a queue the right tool?"

Before recommending anything, I need to know which of these is true:

**Symptoms that justify a queue:**
- Cascading failures (Service A goes down, B/C/D start timing out, customer requests fail)
- Long-running work blocking request threads (>2s operations on the hot path)
- Fan-out patterns (one event needs to hit 5+ services, currently done with 5 sequential HTTP calls)
- Backpressure problems (Service B can't keep up with Service A's load, you're dropping requests or scaling B way past what its actual work requires)
- Cross-service transactions failing partially (you wrote to A, the call to B failed, now your data is inconsistent)
- Replay/audit needs (you need to reprocess events, regulators want an event log)

**Symptoms that do NOT justify a queue:**
- "It feels old-fashioned"
- "Everyone else uses Kafka"
- "We're worried about scale" (with no current scale problem)
- "Synchronous calls feel coupled" (they are coupled, but a queue doesn't fix that — it just hides it)

Which of these match your actual experience? If it's the second list, stop. You don't have a queue problem. You have an "I read a blog post" problem.

## The real cost of adding a queue

People underestimate this. A queue is not a library you import. It's a new piece of stateful infrastructure with its own failure modes, operational overhead, and conceptual complexity.

What you're actually signing up for:

**Operational:**
- A new system to monitor, patch, scale, back up, and page on at 3am
- Dead letter queues (where do failed messages go? who looks at them?)
- Consumer lag monitoring (how do you know when a consumer is falling behind?)
- Message ordering guarantees (is your queue actually FIFO? per-partition? do you care?)
- Idempotency on every consumer (messages WILL get delivered twice, your handlers must handle this)
- Schema evolution (when Service A changes the message format, how do consumers not break?)

**Conceptual:**
- Async debugging is dramatically harder than sync. "Why didn't this thing happen?" becomes a multi-hour investigation across logs, queue depths, and consumer health.
- Eventual consistency creeps into your data model. Now your UI has to handle "I created the thing but it doesn't show up yet."
- You lose the stack trace. With REST, a failure has a clear caller. With a queue, you stare at a failed message and try to figure out who produced it and why.

**Team:**
- Every engineer now needs to understand async messaging, at-least-once delivery, idempotency, and event-driven thinking. That's a real skill ramp.

This is your "innovation token." Make sure you're spending it on something that's actually broken.

## If you do need async, the decision tree

Assuming you've identified a real pain point, here's the order I'd think about it:

**1. Is the problem just "long-running work on the request thread"?**

Don't add a queue. Add a background job runner. Sidekiq (Ruby), BullMQ (Node), Celery (Python), or even just a Postgres-backed queue (Oban, river, pg-boss). These are dramatically simpler than Kafka/RabbitMQ. Single dependency you probably already have (your database). No new infra.

This solves 60% of "we need a queue" problems and costs you nothing.

**2. Is the problem fan-out — one event needs to hit many services?**

Now you're in queue territory. But ask: do all the consumers need it in real time? Could a nightly batch job hitting them work? If yes, do that instead.

If you genuinely need real-time fan-out to N services, then yes — pub/sub messaging. Start with the boring choice: RabbitMQ or SNS/SQS on AWS. Not Kafka. Kafka is a log, not a queue, and the operational complexity is a different sport.

**3. Is the problem cross-service transactions?**

A queue doesn't fix this. You need either:
- Saga pattern (with or without a queue, but explicit compensation logic)
- Outbox pattern (write to your DB and an outbox table in one transaction, then a separate process publishes from the outbox)
- Or, more often, "stop trying to do cross-service transactions" — reshape your service boundaries so the transaction lives in one service

The outbox pattern in particular is worth knowing about. It solves the "I wrote to my DB but the message publish failed" problem without distributed transactions. Boring, reliable, well-understood.

**4. Is the problem cascading failures?**

A queue helps here, but so does a circuit breaker. Adding `resilience4j` or `polly` or `gobreaker` to your HTTP clients is a 1-day change that buys you most of the resilience benefit without new infrastructure. Do that first. If you're still seeing cascades after circuit breakers + retries with backoff + bulkheading, then queue.

## My recommendation

Don't add a message queue based on the framing in your question. You haven't named a problem. You've named an architecture.

Instead:
1. Spend a week instrumenting what's actually slow, what's actually failing, and where requests are actually blocked. Get real numbers.
2. Identify the top 1-2 actual pain points.
3. For each one, pick the simplest tool that solves it. That might be a background job runner. It might be circuit breakers. It might be the outbox pattern. It might, eventually, be a queue.

If after that exercise you have a real fan-out or backpressure problem on 3+ services, then yes, add RabbitMQ or SNS/SQS. Start with one async path. Get good at it. Then expand.

The teams I've seen succeed with messaging are the ones that adopted it for a specific problem and grew the surface area carefully. The teams I've seen burn out are the ones that decided "we're event-driven now" and rewrote everything. The second group spent 18 months in migration hell and ended up with a system that was simultaneously async AND tightly coupled, because they didn't actually fix the coupling — they just made it harder to see.

## What I'd want to know to give you a sharper answer

- What does "talking via REST" mean in practice? Average request depth? Service A calls B calls C calls D? Or is it mostly flat?
- What's your current scale? RPS, services per request, p99 latencies.
- What's the failure mode that prompted this question? Be specific. "We had an outage" beats "we're worried about scale."
- Who would operate the queue? Do you have anyone who has run RabbitMQ or Kafka in production before?

Drop those in and I'll give you a much more concrete recommendation. Right now I'm pattern-matching, and the pattern is "team about to spend an innovation token on something that won't fix their actual problem." Happy to be wrong about that, but I want to see the evidence before I sign off.

Go Pro

$19/month for 10 Full Sessions, private results, session history, and homework follow-up emails.

Share this result.

Share on X Share on Threads

gstack.ai · office hours