Configuration Management and Provisioning
At SparkPost our specific needs have changed over time, along with our understanding of what is the best tool for the job we do. Two areas in particular where this has evolved significantly for us include configuration management and provisioning. After some trial and error we eventually settled on CloudFormation and a mix of Puppet and Ansible. Hopefully what we learned from our experiences can help you select the right tools for your environment.
Puppet vs Ansible
A few years ago we were using Puppet for both OS and Application configuration management. Over time we discovered some challenges with this. First of all, Puppet runs as a scheduled update, not on demand. Secondly, Puppet has no way to test and then roll back if there is a problem. And finally, Puppet is not very accessible to the developers at SparkPost and not very suitable to their use cases – it’s an OPS tool.
We explored using Ansible for application level configuration. While Puppet maintains state, Ansible is good at transforming state. Ansible has fundamentally different behavior to Puppet. We can run Ansible on demand. We also like its flow control so we can test and rollback immediately. Most importantly, our application developers can maintain Ansible playbooks in concert with code and database changes. Overall it integrates very well with our Continuous Integration and Deployment pipeline.
Our next iteration was to gradually split the configuration management problem between the pieces managed by our Ops team with Puppet and pieces managed by the Development teams using Ansible. While this approach “worked”, we found it hard to understand what the actual running config would be, since Puppet would override some config stanzas managed by Ansible. These overlapping responsibilities between Puppet and Ansible were messy and error prone.
Puppet + Ansible
To resolve these problems, we decided to standardize by using Puppet only for the OS-level stuff. This includes system tuning, mounted volumes, local users, authentication, etc. In contrast, we now use Ansible exclusively for all application software deployment and configuration. This greatly simplifies things since we are now using each tool for the purpose it is intended and well suited. A key catalyst to this break-through was our organizational changes, specifically when we broke down the divisions between development and technical operations. Now there is a single Engineering organization at SparkPost, which has helped us overcome organizational silos that had contributed to a suboptimal approach.
Another area where we have evolved our use of tools is cloud provisioning. We first started in AWS by building out all of our provisioning scripts by hand. This worked well until we had to quickly scale out.
We needed better automation. We chose Terraform as the tool to provision resources in AWS since it is a vendor agnostic tool. Terraform uses a layer of abstraction which was initially very attractive but over time became problematic for our use case. First of all, Terraform keeps local state files that need to be distributed or available every time a change is made. Second, Terraform lacked native AWS support at the time for some important aspects of our infrastructure. This required extra code to implement workarounds outside of Terraform. Needless to say, this negated many of the promised benefits of using Terraform. Finally, we simply did not need the complexity and layers of abstraction that come along with Terraform.
We eventually went down the path of creating a Python tool suite that uses CloudFormation for AWS. By this time, CloudFormation was a very mature tool. We wanted to simplify our provisioning by using built in AWS CloudFormation features. We also started leveraging CloudFormation as the source of truth for our infrastructure. This approach results in significantly less code than we needed with Terraform. Additionally, we are able to handle provisioning failures with far less effort since we had removed the extra abstraction layers and workarounds.
Learning to Keep It Simple
After experimenting with various configuration management and provisioning tools and approaches, we learned the lesson of how important it is to not get too attached to any particular tool. We needed to be flexible and make changes in our tool selection and use as our understanding of the specific use cases evolved. Also planning too far ahead and picking tools you might need in the future without practical need in the near term can end up wasting a lot of time. We also experienced first hand the pitfalls of letting organizational structure drive selection and use of tools. These lessons learned helped us come up with a simpler, more modular approach that takes advantage of the technologies most aligned with our specific needs.
You can learn more about our DevOps journey or if you have any questions about cloud provisioning and configuration management in AWS, don’t hesitate to connect on Twitter. Also our SRE team is always hiring.
VP Engineering and Cloud Operations
Many thanks to John Peacock and Leonard Lawton on our SRE team for their input to this blog post.