Resiliency Through Purposeful Chaos: Gremlin’s Failure-as-a-Service Platform Helps Engineers Proactively Avoid Disaster

Resiliency Through Purposeful Chaos: Gremlin’s Failure-as-a-Service Platform Helps Engineers Proactively Avoid Disaster

TL; DR: Gremlin’s chaos engineering techniques empower users to safely and proactively identify weaknesses in their systems — and fix them before they become a problem. By intentionally stressing systems in various ways, the company ultimately transforms failure into resilience. With additional resources offered through the Gremlin community, the company is creating opportunities for users all over the world to build more reliable software.

As counterintuitive as it may seem to intentionally break your technology in the name of reliability, a new approach to DevOps suggests doing just that. Chaos engineering, a disciplined method of injecting harm into a system to bring weaknesses to light, is making an impact on the way we improve reliability in the software engineering space.

In fact, the discipline’s popularity has soared during the last few years. Just a decade ago, when Kolton Andrus joined Amazon as a Software Development Engineer, the approach still lacked a formal name.

“One of my first projects involved this idea of proactive failure testing for infrastructure,” Kolton said. “We did our homework and built a robust self-service system with many different failure modes, an API, a user interface — the whole gamut.”

The system proved proficient in helping developers identify and address weaknesses around network partitions and consistency, which boosted uptime and availability. After four years, Kolton took what he learned at Amazon to Netflix, where he focused on building a proactive failure testing platform for applications. According to Kolton, that effort took uptime from 99.9% to 99.99%.

Gremlin logo

Gremlin helps businesses proactively weed out risk, preventing costly failures.

Kolton saw his early successes at both Amazon and Netflix — plus the industry’s shift toward the cloud and containerization — as signs that chaos engineering would prove valuable as a service. In 2016, he joined forces with former Amazon colleague Matt Fornaciari, and the pair founded Gremlin.

Safely and Securely Identify Weaknesses in Your System

Kolton said Gremlin’s engineering team is made up of top talent from the likes of Amazon, Google, Netflix, and Dropbox. The company spent its first year building out the Gremlin platform, getting it in the hands of customers, soliciting feedback, and making modifications as necessary. It spent the second year focused on internal expansion as the staff ballooned from a dozen people to nearly 75.

“Now we’re at the point where we’re seeing the market open up — people are embracing the idea of chaos engineering,” Kolton said. “We’re on our third iteration of building a great product and really helping customers address their pain points.”

Gremlin mascot

Gremlin makes it safe and easy to uncover weaknesses in the system before they become problematic.

Kolton said it’s no longer a matter of whether businesses should adopt chaos engineering — it’s a matter of how. And that’s where Gremlin comes in.

“As we go out to the broader market and we’re talking to engineers who don’t have as much experience in this space, what they’re really looking for is guidance,” he said. “And I think it’s been great for us because we collectively know how we achieved what we did at Amazon, Netflix, Google, or Dropbox, and now we’re making it work at ‘normal’ companies.”

Gremlin’s chaos engineering platform leverages an ever-growing library of attacks to recreate almost any failure scenario a business might encounter in production and reveals how the technology being tested will behave in the face of failure. The process is foolproof: If something unexpected happens during the testing process, Gremlin’s safety features will automatically halt the experiment and default to a steady state.

Build Resilient Systems and Prevent Costly Outages

There’s no doubt that downtime poses a significant threat to businesses operating in an increasingly online marketplace. According to estimates from the research firm Gartner, the average cost of network downtime is $5,600 per minute, which equates to a whopping $300,000 per hour.

In addition to financial costs, it also wastes time. “I was recently speaking with a financial services institute on the east coast of the U.S. which caused 75 engineers to get on a call,” Kolton said. “Regardless of how long that call lasted, it was immensely expensive — and then there’s the time and effort looking into the root causes to make sure it doesn’t happen again.”

With a tool like Gremlin, businesses can run mock incidents with a safety net in case things go wrong. The proactive approach helps prevent costly and reputation-damaging outages. And if something does go wrong, it’s better to be prepared.

Depiction of a gremlin working within the platform

The platform also serves as a robust training tool.

“When it’s two in the morning, and you have the VP on the phone, you don’t want to ask a dumb question,” Kolton said. “But in the middle of the day, you have an opportunity to practice for any situation.”

Kolton said that investments in digital transformation, such as moving to the cloud or adopting Kubernetes, aren’t cheap — and Gremlin’s goal is to help protect them. In a March 11, 2019, blog post, for example, the company explained that organizations that plan to migrate to the cloud should adopt chaos engineering to test how the system will behave once traffic is switched over. Doing so will significantly reduce the potential for unexpected failure and outages.

Tap Into Additional Resources within the Gremlin Community

Kolton told us Gremlin is committed to drinking its own champagne — a phrase regularly used to signify whether a company has enough confidence in its goods to put them to use internally.

“We’re a company focused on reliability, so we’d better have a reliable product,” he said. “To ensure we’re at the top of our game, we run complete failure tests to harden our builds before they go out.”

Gremlin understands that not everyone is confident in running experiments in production. Kolton told us a lot of businesses are concerned about where they stand in relation to their peers when it comes to realiability.

“They’re often a little gun-shy because they think they’re too far behind,” he said. “One thing that I would tell the industry is we’re all fighting the same battle: many of us were in the same position early on and are working our way forward.”

Kolton said he would love to get to a point where businesses are open to discussing their failures so the industry at large can learn from others’ mistakes. To that end, the Gremlin community offers the resources and relationship-building opportunities businesses need to build more resilient systems together.

Between hands-on tutorials, sponsored meetups across the globe, inspiring presentations, and engaging discussion forums, these resources encourage collaboration among the industry. Be sure to keep an eye on upcoming conferences, webinars, and more for an opportunity near you.

Reproduce and Learn from Real-World Outages

Gremlin is currently preparing for Chaos Conf, an inclusive industry event for chaos engineering practitioners and developers that takes place on September 26, 2019, in San Francisco.

The event will also feature keynote presentations from Dave Rensin, Director of SRE at Google; Crystal Hirschorn, VP of Engineering and Cloud Platforms at Condé Nast; and Kolton himself, plus a number of sessions exploring the different aspects of chaos engineering.

Kolton said Gremlin is also announcing a new feature that will empower users to build their own attack libraries to help reproduce real-world outages. “Stay tuned for a big announcement in September,” he said.

Christine Preusler

Questions or Comments? Ask Christine!

Ask a question and Christine will respond to you. We strive to provide the best advice on the net and we are here to help you in any way we can.