The Great Cloud Migration

The cloud has quickly become the de facto platform for launching apps and services, in both the B2B and B2C spaces. Today’s service providers can quickly offer the necessary computing power and scalability, as well as increased flexibility and speed-to-market. But what about an existing offering that’s been around for a while? What’s the best approach for migrating it—or even multiple apps and services—to the cloud?

It’s tempting to focus lots of energy on decisions about technology and architecture choices. Certainly those matter (a lot), for a variety of reasons. But migrating a product to the cloud also involves major changes to business processes and company culture that require buy-in across the organization. If that doesn’t happen, you could be in for a bumpy ride resulting in a half-baked product that frustrates your development team and your customers.

SparkPost’s Cloud Migration Journey

I’d like to share some of my own company’s experiences with this sort of transition. SparkPost released the first beta version of our cloud-based email delivery service over three years ago. When we launched, a handful of customers sent a few million emails a month. Now, our API is used by tens of thousands of customers—including Pinterest, Zillow, Workday, and Intercom—to send more than 16 billion emails a month.

In that time, our business made the shift from providing on-premises email infrastructure to operating as a fully cloud-based email delivery service.

I recently wrote about some of the decisions we made regarding service architecture and API design (and evolution) in an article at DevOps.com. But as I suggested above, the tech considerations were only part of what we needed to decide. So, today, I thought I’d take a step back and look at some basic questions that informed those development choices.

Here are four questions we needed to ask before embarking on our cloud migration adventure. They might also help you make sure your newly cloud-enabled apps and services can lead lives full of innovative updates from your team that will keep your customers happy.

1. Why do you want to do it?

“Because other companies are doing it” isn’t the best response here. “Because the rest of our company is doing it” is a better one, especially if there’s a push for a standard architecture across all of your offerings. Here are some other good reasons to do it:

  • You want to reduce operational overhead and the need to manage underlying infrastructure
  • You’d like more agility and flexibility when it comes to deploying (or decommissioning) services
  • You need elastic resource scaling because customer demand is increasing, and sometimes it really spikes
  • You need a disaster recovery system for a data center, and the cost to implement a physical setup is daunting
  • You want to expand into new geographic territories and the thought of setting up infrastructure in each one fills you with dread
  • A data center lease is expiring in several months, so an opportunity has opened up

Whatever the reason is, just make sure it’s a compelling one. You don’t want to find yourself knee-deep in a migration project only to wonder why you’re even doing it.

2. Have you considered the potential drawbacks?

I’m all-in on the cloud. But there are trade-offs—and some apps or services face extra challenges in a cloud setup. You may want to be careful of issues such as:

  • The virtual network and hardware paradigm is radically different from classic models developed in the on-premises world. Do you have expertise in-house to avoid making fundamental mistakes early on?
  • The costs of cloud infrastructure can be surprisingly high, especially if you try to simply replicate a traditional server model rather than architecting and fine-tuning for the cloud
  • The cost structure of the cloud can change how you perform financial calculations such as capitalization and amortization
  • You store sensitive data and could face additional questions, processes, and changes to your operational tooling due to privacy laws and other regulations

3. Are you prepared to audit your apps and services?

Before you say “yes” to a cloud migration, conduct a thorough audit of the apps and services on your list, along with any related technologies used at your company. You’ll soon discover that some migrations aren’t too complicated, while others will be heavy lifts.

Whenever possible, start by migrating the low-complexity apps and services first, which will not only give you some morale-boosting quick wins but also offer valuable lessons for the next products on the list. You may learn along the way that your high-complexity apps require more effort than they’re worth – it could even make sense to retire ones that really don’t serve a purpose anymore, or whose functionality could be covered by another app that runs on a modern platform. That’s especially likely to be true in the case of highly demanding functions that require specialized infrastructure and expertise. These use cases might be better served by a specialist provider (such as in the case of running email in the cloud).

4. Are you prepared to make a serious cultural shift in your company?

If your apps and services have been around for a while, there’s a good chance that some teams have become entrenched in their businesses processes. Before beginning the technical migration process, make sure you plan out the cultural migration process too.

Success in the cloud requires a move toward loosely coupled, independently scalable services that can take advantage of benefits such as autoscaling, containerization, and serverless technology supported by your cloud provider. Moving in that direction takes a different mindset than traditional development and sysadmin roles in the on-prem world.

This is why most cloud businesses have adopted the DevOps philosophy. DevOps makes it easier to implement continuous deployment and a microservice-based architecture, both of which enable teams to work more independently and at a faster pace. Microservices increase flexibility – you can develop and deploy smaller units of functional value, with the ability to quickly roll back a release if show-stopping errors are found, without impacting other microservices in the product.

Making the Leap

Increased speed, agility, and flexibility are big wins for businesses that move to the cloud. And while the tech challenges are a big part of that change, they’ve got to be made with a good understanding of cultural and process issues as well.

I hope thinking about these questions will help you begin the cloud conversation at your company and lay the foundation for a successful migration to the cloud. Got questions about cloud migration? Send us a tweet!

-Chris

P.S. More food for thought.

Last year, I was a guest on Jeffrey Meyerson’s Software Engineering Daily podcast. He asked me some really thoughtful questions about how we built the SparkPost service for the cloud. We touched on these kinds of questions as well as got into the technology choices and challenges. I enjoyed the conversation, and if you’re in the midst of a cloud migration yourself, I think you might enjoy it too.

SparkPost’s CTO and co-founder, George Schlossnagle, also wrote about some of what he learned about these challenges. His piece has lessons about the evolution of an on-premises infrastructure like an MTA from his perspective as both a technology and a business leader.

 

In Part 1 of this series I reviewed our initial agile adoption and move to the cloud.  Then in Part 2 I reviewed how we adopted continuous delivery and deployment automation to become more nimble.  In this final part of the series I share lessons we learned as our service rapidly grew in 2016 and share some of what we have in store for this year.

3rd Generation Infrastructure

We learned many valuable lessons operating email infrastructure at scale in AWS and by mid 2015 had completed the move to our third generation of infrastructure.  Now we properly leverage Amazon’s VPCs, security groups, ELBs, EC2 instance types, CloudFormation provisioning instead of Terraform, EBS and ephemeral storage, and even more service resilience using clusters spanning multiple availability zones.  We switched from Nagios to Circonus for monitoring.  An outbound email proxy now separates the management of outbound IP addresses from the MTAs which allowed us to easily add more MTAs independently of the number of IP addresses we send email with.

Improved Operations

During this time we formalized on-call schedules across not just the operations team but also the development teams with the understanding that everyone shares responsibility for the health of our production environment. This was increasingly important since most changes were deployed to production using automated deployment pipelines build by the dev teams.  Our Deliverability team and Technical Account Managers also use Opsgenie for on-call rotations. Besides shortening the resolution time for production issues, this approach empowered the development teams to make the necessary improvements to minimize and often eliminate the source of production issues resulting in a much more reliable service.

We improved a number of important processes including Change Management (CM) and Root Cause Analysis (RCA). Our CM procedures cover all deployments and changes to our production environments. Change Management helps prevent negative impact to customers by enforcing a thorough testing and review process for all production changes. This approach has greatly improved transparency and risk management for us and reduced the number of off-hour fire-drills. Not all CM tickets are the same; we account for differences between standard and emergent changes and we do not require separate CM tickets for changes deployed through our automated deployment pipelines.

Our RCA process helps us properly identify the root cause of customer impacting events and follows the “5 whys” approach. We don’t use RCA’s to place personal blame but focus on the corrective actions instead – technology, process, or training – to ensure we do not fail the same way twice.

It’s important we optimize our time to find and fix a bug in production rather than slow things down too much in a futile effort to prevent all bugs with testing. We use our continuous delivery and deployment processes to quickly fix and deploy a patch confidently.

Improved Communication

With so many customer facing changes made by different teams, we need effective internal communication without introducing unnecessary blocking dependencies. A core group from Product Management, Product Owners, Support, and technical team leads have a “scrum of scrums” each week to ensure there is sufficient awareness of coming changes.

To further help spread awareness of changes, we automatically post a daily summary of “customer impacting” JIRA tickets to an internal Slack channel and any major changes throughout the day get automatically posted here by one of our Slack bots. Slack is a fantastic tool and we use it very extensively throughout the company for team, topic, and interpersonal communication.

We also improved how we communicate changes with customers.  New methods introduced include a public change log, an #announcements channel in our community Slack channel, and a status page.

#MandrillMigration

By early 2016 when MailChimp announced they were discontinuing their developer oriented standalone Mandrill transactional email service, we were in an excellent position to fill that gap for developers and offer them a much needed alternative.

As these developers came over to SparkPost in droves we were able to easily scale our platform and quickly add features that were in high demand, including subaccounts and dedicated IP pools.

Additionally, we ramped up our Developer Relations team to support this influx of developers. This team is grounded in the SparkPost Engineering department. The team’s mission is to support developers through client libraries, tools, content and even *gasp* direct human interaction. We love to interact with our developer community at various events, hackathons and on our community Slack team. You can find our upcoming events on developers.sparkpost.com.

SRE

In the summer of 2016 we collapsed Tech Ops into Engineering to improve collaboration and efficiency. At this time we created a new Site Reliability Engineering (SRE) team that incorporated a number of functions into a single cross-functional team within Engineering. This finally broke down any remaining walls between development and operations. This new team fully embraces the “infrastructure as code” approach and has oversight of cloud infrastructure, deployments, upgrades, monitoring, and auto-scaling while promoting discipline, safety, and a positive customer experience.

Due to the efforts of this team we made significant improvements in our availability and reductions in customer impacting events. We also built out more comprehensive and actionable monitoring and alerting which has improved the overall customer experience – and boosted the morale of the team.

The Road Ahead

Besides lots of great features on the roadmap, we are rolling out the fourth generation of our AWS infrastructure. This latest iteration includes further decoupling of service tiers, improved automation and monitoring, replacing or augmenting some traditional technologies with AWS services such as SQS and DynamoDB, advances in auto-scaling of service tiers, and improvements in our outbound email proxy technology. We have some API performance improvements in the works as well, which users will love. We will complete the final transition of application configuration management from Puppet to Ansible and fill in any system configuration management gaps with Puppet. Early experiments with Amazon’s container services should begin to make their way into production. All of these advances will help make SparkPost APIs even more reliable, scalable, and faster.

This year we will likely move some services to continuous deployment.  This means eliminating the manual button pushing for the final production deployment step. There is a little more work to do on our automated smoke tests and roll-back scripts before we are ready and we are exploring options such as canary releases.

Through disciplined devops practices and a lot of hard work over the past few years, the SparkPost team has achieved tremendous success. But we realize this is a journey and there is always more to do as we continue to scale up and move faster.

If you have any questions or comments about continuous delivery and devops at SparkPost, please don’t hesitate to connect with us on Twitter – and we are always hiring.

Chris McFadden
VP Engineering and Cloud Operations

Big Rewards Blog Footer