How Microservices Saved the Day: Sean McCullough on Explosive Growth at Groupon
How would you like to maintain the eCommerce website for the fastest growing company in the world?
How would you deal with massive, unpredictable spikes in traffic, a patchwork technology stack, and rapidly changing demands from product owners and executives in your company?
It’s a story Sean McCullough knows well.
Today, McCullough is a principal software engineer and senior team lead at Atlassian.
But from 2010 to 2014, he was a senior software developer for Groupon, the fastest growing company in the world at the time.
Keeping Pace with Explosive Growth
In 2010, Groupon’s subscriber list exploded.
In 12 months, the company went from 3.4 million subscribers to 83.1 million subscribers.
It was the job of Groupon’s development team to keep the site running under the massive increase in web traffic.
Groupon’s Growing Tech Debt
Groupon’s development team did what any development team would do in its situation: whatever it took.
“We had this giant monolithic Ruby on Rails application,” McCullough told us. “To maintain developer velocity with a giant monolith like that, you need to have a lot of tooling, and a lot of restraint and engineering practices in place to keep that code base clean.”
“Groupon just didn’t have the resources to do that correctly,” he told us. “We were just trying to keep up.”
Groupon needed a clean, fast, easy-to-change website that could keep up with the flurry of changes happening in the business.
Instead, it had a Frankenstein.
A Variety of Causes
As with most organizations, a variety of reasons—taken together—were causing the company’s problems.
McCullough explained a few of the bigger issues they were facing:
1. Hacks on Hacks
To keep the site running, the development team had implemented a variety of hacks over the years.
The patchwork assembly of hacks made the code base more complex. That, in turn, slowed the development process.
“We always found our developer velocity lagging with our Ruby on Rails stack, and that was probably a result of the way we designed the thing,” McCullough said. “As the company grew, we were trying to hit all these various benchmarks. To be honest, we added hacks on hacks in a lot of areas.”
2. Separate Mobile and Desktop Sites
Complex code drove the need to operate two completely separate sites to maintain: one for mobile and one for desktop.
By necessity, Groupon took a desktop-first approach.
Features were developed for desktop first. Then—later—those same features were built into the mobile site.
It was necessary given the technology they were working with. But it perennially left customers on mobile without access to some of Groupon’s newest and best features.
3. Limitations of Ruby on Rails
As the site grew, the hardware needed for the Ruby on Rails framework was becoming prohibitively expensive.
“Running a Ruby on Rails site at scale is incredibly expensive,” McCullough said. “We had these giant machines that could only service a few hundred requests per second, which was nowhere near the capacity we needed.”
4. Business Acquisitions
Between 2010 and 2013, Groupon made 29 acquisitions, buying up coupon and daily deal sites all over the world.
Ideally, Groupon would have integrated its acquired sites into the Ruby on Rails framework, making the process consistent across its different properties.
But the speed at which acquisitions were happening and the limitations of Groupon’s Ruby on Rails framework made that path impractical.
5. A Monolithic, Horizontal Org Structure
Like its website, Groupon’s internal organizational structure was monolithic. The team was divided into three basic tiers:
- Back end
- Front end
“The stratification began to break down,” McCullough said. “The problem was that we weren’t shipping as fast as we wanted to. We had this giant queue behind the front-end team to deliver features, and it was taking far too long to ship them.”
The First Change: Switching to Vertical Teams
Groupon needed to move faster. And its management team knew it.
At the core of their strategy to improve was a focus on one key metric: developer velocity.
The first attempt to improve came when the organization switched its internal org chart from a horizontal to a vertical structure.
“Groupon had its main deal business. For example, its Gateways business, a goods business, and several others,” McCullough said. “Each of those was a fundamentally different business, but from a development perspective, we were treating them all the same.”
“We changed things so the goods organization had a completely separate engineering team from our Gateways organization,” he added. “That allowed each business vertical to build and ship their own features as needed.”
At first, the change seemed to have worked well.
Changing to a vertical structure solved the backlog problem the company had faced. Teams could now build and deliver changes more quickly—without getting backed up waiting on front-end engineers to get to a project.
That was a good start. But, as Groupon quickly found, it still had other problems to solve.
‘Regressions that Lasted for Days’
Switching to vertical teams helped, but it also painfully revealed the technology debt Groupon had built during its period of hypergrowth.
Groupon’s new vertical teams were still working in the same old monolithic code base.
If one team committed a bug or had a regression, it could affect one page, or it could affect every page on the website.
“Sometimes someone would commit a bug or have a regression and it would block everyone else’s changes for days,” McCullough said. “Performance became a big issue.”
Another symptom of the company’s tech debt was that it didn’t have sophisticated tooling. When deploying a new change, the team lacked the tools to understand how the changes would affect the site.
“The complexity of the code got to the point where nobody could really understand the implications of the change until you did it,” McCullough said.
How Microservices Saved the Day
The change to vertical teams was an improvement.
Now Groupon needed to make a similar change to its tech stack.
It needed to rebuild its code base in a way that wasn’t full of hacks.
Most of all, to keep developer velocity high, it needed a site that didn’t break when different teams launched new features or made changes.
Building a Prototype with Node.js During a Hackathon
In 2013, McCullough took action.
He and a fellow engineer used a hackathon day to build a rough prototype of a page of the site using Node.js.
They picked Node.js because it seemed like an obvious choice.
The two developers took their prototype to several influential individuals within the organization to get their feedback. With the blessing of these influencers, they then took the prototype to management.
6 Months to Build a Working Page
Groupon’s management team liked what it saw from McCullough’s Node.js prototype.
They assigned McCullough to a small small team—just three developers—and gave them six months to build a real, working page of the website using Node.js.
“We had to build something that could replicate the functionality of one of the existing pages on the website,” McCullough said. “We needed to be able to measure the impact of the new platform and compare it to the existing platform in some kind of meaningful way.”
The project worked.
Six months later, McCullough and his teammates had completed a working page of the website.
“That kicked off a real conversation,” McCullough said. “Then management could look at if we had enough information to make a decision to rebuild the entire web platform this way.”
“Eventually, they decided we did.”
A Microservices Approach
Things moved quickly once the decision was made.
With many more developers working on the conversion, the company was able to rewrite its entire front-end code base in roughly four months.
At the same time, server-side engineers were working to upgrade the back-end layer as well.
With its new framework, Groupon was able to break apart many of the functions happening within the app.
It’s ordering protocol, for example, became two separate processes: one for ordering and the other for purchasing.
It’s API data model broke tasks into many separate microservices in the backend of the site.
“Then we could roll them up into a consistent API model for the front-end,” McCullough added.
The microservices process worked well for the team, reducing the likelihood that a developer could take down the whole site with a single bug.
This was the model Groupon had been looking for—one that allowed its developers to quickly make changes without breaking other areas of the site.
It also allowed them to bring its mobile and desktop sites together in one framework for the first time.
And just in time too.
“After the rebuild, our mobile applications and our website were at parity in a way they never were before,” McCullough said. “And it happened just as we hit the inflection point where mobile became the dominant part of our traffic.”
Sean’s Advice to Others
At the end of our conversation, we talked with McCullough about some of the broader lessons he learned from his experience working on Groupon’s site.
Here’s what he had to say:
1. Choose Tools Based on Your Goals
At the very beginning, when the team was considering whether to use Node.js or other tools, they thought about holding a bake-off to see which tool was best.
Ultimately, they decided against that approach.
“A bake-off wasn’t going to tell us what we needed,” McCullough said. “We wanted to optimize for developer velocity. We could have shown how many requests per second can we pump through this particular language of runtime. But that’s not what we needed. The results of a bake-off wouldn’t have been material to what we were trying to evaluate.”
Instead, they went with Node.js because of the advantages it provided for talent Groupon already had in house.
2. Don’t fall into the trap of being a “Ruby on Rails” or “Java” shop
At the time, Groupon was known as one of the biggest Ruby on Rails sites. After the transition to microservices, however, the company broke away from being primarily a “Ruby on Rails” shop.
“I think a lot of companies tend to feel like they need to pick one thing and stick with it—even if it’s not exactly the right fit for what you’re trying to accomplish,” McCullough said. “As the technology industry diversifies, it doesn’t mean you have to adopt every single technology. But also realize that there are some technologies that have specific benefits over the ones you might be used to.”