Our Slack bot handled 200 messages per day for three months without breaking a sweat. Then a tech blogger mentioned it in a newsletter, and we went from 200 to 12,000 messages in 48 hours.
Everything broke. Not dramatically — the server didn’t catch fire or anything. It just… slowed down. And slowed down more. And then started dropping messages. And then went completely silent while 12,000 people wondered why the AI bot they’d just heard about wasn’t responding.
Here’s what happened, what we did, and how we scaled from “fun side project” to “thing that people actually depend on” in a week.
The First 6 Hours: Denial and Panic
The newsletter hit inboxes at 9 AM on a Tuesday. By 10 AM, our message queue had a 400-message backlog. By noon, the queue was at 2,000 and the response time was 45 seconds (normally under 3 seconds).
My first reaction: “Huh, that’s a lot of messages.” My second reaction, 20 minutes later: “Oh no.”
The bottleneck wasn’t CPU or memory — it was the AI model API. Each message required an API call, and we were hitting rate limits hard. Our free-tier API plan allowed 60 requests per minute. We needed 200+ per minute.
Quick fix: upgrade the API plan. Got our rate limit to 500 requests per minute within 30 minutes by switching to a paid tier. Queue started draining. Crisis partially averted.
But then the second wave hit.
Hours 6-24: Everything Else Breaks
Increasing the API throughput revealed all the other bottlenecks we hadn’t noticed at low volume.
Database connections maxed out. Every message triggered a database lookup for user context. At 200 messages/day, no problem. At 12,000, our connection pool was exhausted. Users got “service unavailable” errors.
Fix: increased connection pool size, added connection pooling with PgBouncer, and implemented read replicas for the context lookups.
Memory leak in the message handler. A variable that stored conversation context was growing without being cleaned up. At low volume, it grew slowly and got cleared by occasional restarts. At high volume, it consumed all available memory in about 4 hours.
Fix: added proper cleanup after each message is processed. This bug had been there since day one — it just never mattered until it did.
Single-threaded processing. Messages were being processed sequentially. One at a time. At 200 messages/day, this was fine. At 12,000, it meant every message waited behind every other message.
Fix: implemented concurrent processing with a proper job queue. Messages get distributed across multiple workers. This alone cut average response time from 45 seconds to under 5.
The “Oh, We Need Real Infrastructure” Moment
By hour 24, I realized our “it works on a $10/month VPS” architecture was not going to handle sustained growth. We needed:
A proper load balancer. Not because we needed multiple servers yet, but because we needed health checks, automatic restarts, and the ability to deploy updates without downtime.
A message queue. Redis-backed job queue that decouples message receipt from message processing. If the AI model is slow, messages wait in the queue instead of timing out. If a worker crashes, the message gets retried instead of lost.
Monitoring that actually alerts. We had logging. We didn’t have alerting. The difference matters when things break at 2 AM and nobody’s watching the logs.
Horizontal scaling. The ability to add more workers when load increases. Our architecture now auto-scales: if the queue depth exceeds a threshold, new workers spin up automatically.
What We Shipped in a Week
Day 1-2: Emergency rate limit upgrade, connection pool fix, memory leak fix.
Day 3-4: Message queue implementation, concurrent processing.
Day 5-6: Load balancer, monitoring with alerts, horizontal scaling.
Day 7: Finally slept.
Total infrastructure cost went from $10/month to about $120/month. But we went from supporting 200 messages/day to comfortably handling 50,000. And the architecture can scale further just by adding workers.
The Scaling Checklist I Wish I’d Had
If your AI bot is gaining traction and you want to be ready before the spike hits:
Set up monitoring with alerts now. Response time, error rate, queue depth, memory usage. Alert thresholds at 2x normal values. You want to know about problems before users tell you.
Implement a message queue. Even at low volume. It decouples receipt from processing, enables retries, and makes horizontal scaling trivial later.
Profile your per-message resource usage. How many database queries per message? How much memory? How many API calls? Multiply these by your growth target and see where the bottlenecks will be.
Test at 10x your current load. Use a load testing tool to simulate 10x message volume for an hour. Watch what breaks. Fix it before it breaks in production.
Have a scale-up plan documented. “If traffic doubles, do these three things.” Having the plan written down means you can execute it at 2 AM when you’re half-asleep instead of trying to architect solutions under pressure.
What I Learned About AI at Scale
The AI model isn’t usually the bottleneck — everything around it is. Database queries, context management, output formatting, message routing — all the “boring” infrastructure that you skip when building a prototype. At scale, the boring stuff matters more than the AI stuff.
Also: rate limits are the most underappreciated scaling constraint in AI applications. Your brilliant architecture doesn’t matter if the model API only allows 60 requests per minute. Check your limits before you launch, and have a plan for when you exceed them.
The viral spike was stressful but ultimately positive. It forced us to build the infrastructure we should have built from the start. And now we’re ready for the next spike — whenever it comes.
🕒 Last updated: · Originally published: January 8, 2026