Build Automation Systems That Don't Break at 2 AM
Most automation agencies build impressive 50+ module workflows that break mysteriously. Learn why staged systems with error handling beat tightly-coupled automations.
"Why is this so complicated?"
Dan was staring at the N8N workflow his previous team built. 57 modules. Nothing named properly. No clear path from input to output.
I told him: "Because they were an automation agency."
Here's what I mean:
Automation agencies build impressive workflows. "Look! We connected 8 systems!"
It works. For a while. Then QuickBooks changes their API. Or Airtable goes down for 30 minutes.
And you realize: your automation multiplies the failure, it doesn't contain it.
The Compounding Failure Problem
Airtable went down for a few hours last month.
Here's what happened to Dan's 57-module workflow:
- Module 1 pulls from QuickBooks ✓
- Module 2-8 transform the data ✓
- Module 9 writes to Airtable ✗ [Airtable down]
- Module 10-57 never run
- No error message sent
- No staging
The workflow just... stopped. Silently.
Dan found out three days later when a client asked where their pdf was.
That's compounding failure:
One 30-minute outage → Three days of broken workflows → Manual recovery of 40+ transactions → Lost client trust.
Why Complex Automations Fail Harder
Simple workflow (3 steps):
- QuickBooks → Staging Table → Done
- When Airtable goes down: Staging table still has data
- Recovery: Resume from staging
Complex workflow (57 modules):
- QuickBooks → Parse → Clean → Transform → Validate → Enrich → Calculate → Write
- When Airtable goes down: Everything after module 9 fails
- Recovery: Re-run entire workflow or manually process
Amazon has circuit breakers for this. Your 57-module workflow doesn't.
I Don't Build Automations. I Build Systems.
Automation thinking:
- How do I connect A to B?
- Can I do it all in one workflow?
Systems thinking:
- Where should I stage this data?
- What happens when Airtable goes down?
- How will we know it failed?
Dan's Workflow: Before and After
Before:
QuickBooks → Parse → Clean → Transform → Write to Airtable → Generate PDF
57 modules. Tightly coupled. No staging.
After:
- QuickBooks → Raw Data Table
- Raw Data → Cleaned Data Fields
- Cleaned Data Fields → Airtable Synced Table
- Button → Generate PDF
Same result. But now:
- When Airtable goes down, raw data is still captured
- When QuickBooks API changes, only step 1 needs fixing
- Failures are obvious, not mysterious
With staging, the process doesn't stop. It waits.
What I've Seen Fail
A financial company had a 100-node automation. Worked for 6 months. Then QuickBooks changed one field name. Took 2 days to debug.
A client came to me with a 40-step Zapier. Asked me to document it so they could maintain it. I couldn't. Rebuilt it as 5 separate automations.
Another had a Make.com workflow handling 6 scenarios in one module. Perfect until scenario #7 appeared. Had to rewrite everything.
The Platform Reality
Zapier, Make, N8N, Airtable—none are bad tools. But they all have failure modes.
When you build a 57-module workflow that depends on all of them staying up simultaneously, you're building a house of cards.
99.9% uptime times 57 modules = 94.5% reliability. It fails 6 out of 100 times it runs.
The Questions I Ask
- "What happens when Airtable goes down?"
Not if. When. - "What happens when this API changes?"
QuickBooks, Google, Airtable—they all push updates without warning. - "Where should we stage this data?"
One failure shouldn't kill everything downstream. - "What's the manual backup?"
If this breaks at 2 AM, can your team recover without you? - "How will we know it's broken?"
Silent failures are the worst kind.
What Dan Needed
His previous team built automations: Impressive. Complex. Fragile.
What he needed was systems: Boring. Staged. Resilient.
I Build the Boring Version
Not 57 modules. Maybe 12. Each one with:
- Clear naming ("QBO Transaction Pull" not "Webhook 3")
- Staging before transformation
- Error handling with Slack alerts
- Fallback logic for when platforms go down
- Documentation
It doesn't look impressive in screenshots.
But it doesn't multiply failures either.
The Bottom Line
When Airtable goes down for 30 minutes, your system should:
- Continue capturing data elsewhere
- Send you an alert
- Resume automatically when it's back up
Not silently fail for three days.
The question isn't "will it break?"
The question is "what happens when it breaks?"
If the answer is "I don't know," you have an automation.
If the answer is "data stages in table X, alert goes to Slack, resume from step 3," you have a system.
Ready to stop fighting failures and start building systems that handle them? Let's talk.