Why Most Automation Fails (Lessons Learned from Google SRE)


Issue #4

Why Most Automation Fails (Lessons Learned from Google SRE)

A framework for understanding processes before you automate them

The Two-Month "Simple" Task

"Just move the cluster to the new datacenter. How hard could it be?"

That was my thinking when I started what seemed like a straightforward automation project at Google. We needed to move SmartAds' click-through rate prediction system, a critical ML pipeline that helps decide which ads to show users. The manual process existed, people had done it before, so automating it should be simple, right?

Two months later, I was still documenting the process.

Not automating it. Not even starting to write code. Just figuring out what actually happened when humans did this "simple" task.

What I discovered during those two months changed how I think about automation forever. And it's why most automation projects fail before they even begin.

The Iceberg Beneath the Surface

The original process looked straightforward: shut down the old cluster, spin up the new one, redirect traffic. A few configuration changes. Tweaking some knobs.

But as I started documenting each step, I kept uncovering layers of complexity that weren't visible from the surface:

Hidden Team Rituals: The ML engineers didn't just flip a switch to retrain models. They had specific sequences for warming up the new cluster, particular ways they validated model accuracy after a move, and tribal knowledge about which models needed special handling.

Storage: Moving the cluster meant moving huge amounts of training data. But we needed to set up all the underlying storage units and layers while accounting for maintenance windows, backups, and bandwidth availability, all creating a complex dance of dependencies. What looked like "copy files" was actually a choreographed sequence across multiple systems.

Network: Redirecting traffic wasn't just updating a config file. It involved load balancer health checks, draining procedures, and fallback plans for when things went wrong.

Monitoring: The extended team had specific dashboards, alerts, and playbooks that assumed certain cluster configurations. Moving the cluster meant updating dozens of monitoring systems, many of which were identified as the process continued.

Downstream Dependencies: There are also those unfortunate times when you discover that some completely different team has a dependency on your service or datacenter that no one knew existed.

Each conversation revealed more dependencies. Each dependency revealed more complexity. What started as a clear task became an intricate ecosystem of interconnected processes, each with its own logic, constraints, and failure modes.

The lesson: Before you can automate a process, you need to become an archaeologist of that process.

The visible portion of any process is typically just the tip of the iceberg. Beneath the surface lie dependencies, tribal knowledge, exception handling procedures, and cross-team integrations that aren't captured in official documentation but are critical to success.

Why This Matters to You Right Now

Here's the hard truth: most automation projects fail not because the technology is wrong, but because teams skip the understanding phase. They jump straight to "let's automate this" when they should be asking "do we actually know what we're automating?"

The result? Brittle automations that break when edge cases appear. Partial solutions that handle 80% of cases but require manual intervention for the rest. Technical debt from automating the wrong process. And team resistance because the automation doesn't match how work actually gets done.

I've seen this happen across engineering and product teams, marketing departments, and operations groups. The pattern is always the same: good intentions, incomplete understanding, costly failure.

The good news? There's a predictable framework for understanding any process before you automate it. It's the same approach Google's Site Reliability Engineers use to safely automate critical systems. And it works at any organization, in any department.

What I Learned from Two Months of Process Archaeology

After that SmartAds cluster move, I documented the framework that saved the project. It's a 4-phase approach that moves you from surface-level documentation to a thorough process understanding to an automation roadmap.

Here's what you'll learn:

  • Uncover the undocumented steps your team actually follows (that nobody wrote down)
  • Specific discovery questions that expose hidden complexity
  • Identify automation opportunities vs. steps that need human judgment
  • Build a phased automation plan instead of one risky bet

This isn't theoretical. It's based on real experience automating mission-critical systems at scale.

Once I understood the full process, the actual automation became straightforward. More importantly, it was robust. It handled edge cases gracefully because I knew they existed. It integrated smoothly with other teams because I understood their constraints. And it was adopted quickly because it matched how work actually got done.

Ready to Apply This to Your Next Project?

The teams that skip the archaeology phase end up spending far more time fixing broken automations than they would have spent understanding the process upfront.

The next time you're tempted to dive straight into automation tools, ask yourself: "Do I really understand this process, or do I just think I do?"

I've created the Process Archaeology Toolkit to help you answer that question confidently. It's a free PDF that walks you through the exact framework I developed at Google, with discovery questions, assessment tools, role-specific guides, and a phased automation roadmap you can adapt to your specific process.

Get the Process Archaeology Toolkit

Download it, walk through the framework with your team, and discover what you've been missing about your processes. Your future self and your teammates will thank you for taking the time to get this right.


Looking For Personalized Guidance?

Want to know which of your processes are ready to automate? Book a free 30-minute strategy call to map out your biggest opportunity.

Schedule Your Call

One More Thing

Want to share your own process archaeology discoveries? I'd love to hear about the hidden complexities you've uncovered when automating tasks that seemed "simple" on the surface. Reply to this email—I read every message!

Let's Chillaborate!

Dina


About this series: This article is part of "Automate Yourself", a podcast and newsletter exploring how professionals can use this era of automation and AI as an opportunity for career growth and reinvention. We share practical guides, real-world automation stories, and mindset shifts to help you turn uncertainty about AI into opportunity.

Subscribe here for more stories, tips, and frameworks.

More from Chill Labs

Dina Levitan was recently featured on the latest episode of the Product Science Podcast. Check it out on Spotify!

show
The Dina Levitan Hypothesis:...
Oct 21 · The Product Science Podc...
32:52
Spotify Logo
 

Let's Chillaborate!


Unsubscribe · Preferences · 113 Cherry St #92768, Seattle, WA 98104

Chill Labs

Chill Labs is a boutique consultancy helping companies think strategically, solve business problems, and streamline operations utilizing Product Management, Software Engineering principles and AI. Combining a decade of experience running complex, globally distributed software products with expertise in product discovery, user research, and strategy, Chill Labs helps companies build products that users want and do so in a way that supports growth and scale. Dina Levitan, Founder and Principal at Chill Labs, based out of Seattle, WA, brings over 15 years of experience as a product and technical leader ranging from startups to companies like Google.

Read more from Chill Labs
show

Issue #3 How One CEO Multiplied His Impact with 40+ AI Agents Ever wonder what happens when someone actually runs an entire marketing department with AI agents? My guest this week, Jacob Bank, is doing exactly that. I'm thrilled to share the second podcast episode of the “Automate Yourself” podcast, featuring Jacob Bank, CEO of Relay.app and former Google Product Management director. Jacob’s journey from academia to leading AI-driven innovations across multiple startups and Big Tech shares a...

Deonna Hodges and AI

Issue #2 I Can Do Anything Now: How AI Renewed One Netflix Developer's Love for Engineering In a rapidly evolving tech landscape, the ability to adapt and leverage new tools is more crucial than ever. The latest podcast episode of “Automate Yourself Out of a Job and Into a New Career,” hosted by Dina Levitan and Chill Labs, features a compelling conversation with Deonna Hodges, a senior software engineer at Netflix and a passionate advocate for AI and automation. Deonna’s journey offers a...

Issue #1 Why "Automate Yourself Out of a Job" is the Best Career Move You'll Ever Make There's a saying that we used a lot in Google's Site Reliability Engineering (SRE) team: "Automate Yourself Out of a Job." Counterintuitive? Absolutely. At first glance, it sounds like terrible career advice. But dive deeper, and you'll discover this philosophy might be the most powerful career strategy for our times. The Origin Story In Google's SRE culture, this mantra means: Systematically eliminating...