When we configured Trulia’s desktop and mobile web sites to serve requests exclusively over HTTPS, it was an intense feeling of elation closely followed by a collective sigh of relief. After many months of working on this project, it was finally over.
Enabling HTTPS for our sites was not technically challenging, rather the challenge was in our ability to rally and coordinate across many of our teams. It was a collective effort spanning Trulia’s Dev, QA, IT Ops, Ad Ops and other business units, and I strongly believe that if it weren’t for our culture and everyone’s willingness to work together, it would have taken twice as long and would have been even more painful. Here’s a look at how we migrated from HTTP to HTTPS, the technologies used and some of our key learnings from the journey:
Our approach
While this migration was a technical project, we knew that in order for us to succeed, we needed to rely on clear and consistent communication. Communicating is key to establishing the trust that’s needed amongst the different teams involved in large, complex projects like this one. So, we led with that philosophy, which meant business choices were constantly communicated to all stakeholders and regular status updates were shared, which enabled teams to act quickly on something, if needed. In fact, after much back and forth, one of the most powerful things we did was set a go-live date, communicate it to all stakeholders and march toward it. This level of transparency is absolutely critical when you’re working on large changes, like this migration, and it’s something I recommend to any organization considering a move like this.
Additionally, to keep all teams on track, we made an effort to break things down into smaller tasks and establish clear due dates. We started discussing this project in 2015, but it wasn’t until early 2016 that we started taking our first steps. We met weekly, and when we couldn’t meet in person, we updated Wiki pages and sent status emails, so no one ever missed a beat.
Technologies used
There were a lot of technical changes that the Dev teams worked hard on for this migration, but from an Ops standpoint, there were a few interesting things to call out:
- We switched to using an Open Source load balancer solution. We didn’t do this only for the cost savings, we love open source and try to find vendor-agnostic solutions whenever possible.
- We migrated our CDN to AWS Cloudfront and were happy to discover HTTP/2 recently became generally available. So, when we enabled HTTPS we also got HTTP/2 “for free.”
- We used Packer, Puppet and Terraform to build and deploy AMIs that handled some custom traffic routing features that weren’t available in Cloudfront.
- We used Terraform to create and manage Cloudfront distributions, making it easy for the rest of the Ops team to understand what was going on when reviewing pull-requests.
- AWS allowed us to quickly prototype different ideas to find out what didn’t work, and the knowledge we gained from roadblocks allowed us to quickly come up with the best solution for our needs.
Key lessons
Plan for the unexpected: As Dwight D. Eisenhower once said, “Plans are useless, but planning is everything,” and it’s absolutely true, especially in Ops. Two days before the launch, we uncovered a dependency that we couldn’t plan for because we couldn’t see it, and it wasn’t until we actually started the migration that we uncovered it. It could have been a huge hurdle for us, but fortunately, the process we put in place helped us quickly resolve the issue and move on.
If you’re too focused on trying to come up with the perfect plan that accounts for every scenario, as soon something unplanned happens, you’ll be caught flat-footed. The key is to focus on the things that are mission critical and plan to be nimble.
Recognize the importance of stage mirroring production: This sounds like a no-brainer, but over time, our Stage and Production environments at Trulia drifted apart. Historically, our front-end Stage environment bypassed the load balancer and so we weren’t testing any of those rules. As everyone was moving full-steam ahead, we unfortunately didn’t address this until late in the project. The good news is that we did finally fix it and it allowed us to catch some large issues before it was too late.
Be prepared for post-launch issues: We always try our best to ensure our users will get the best experience possible at Trulia, but unfortunately mistakes happen. After we enabled HTTPS, a number of Trulians reported issues. Some were not related to this migration, but because we created dedicated post-launch teams, we were able to discover, and fix, problems quickly. I would recommend communicating ahead of time how you want your team members to report issues, and also designate a couple of people, one from Dev and one from Ops, to triage the incoming issues.
This project highlighted the importance of having a generative organizational culture. Being performance-oriented, transparent and always communicating your intent along the way ensures that everyone is focusing on the end goal instead of trying to pick apart the idea or not even engaging. Now, get out there and encrypt your sites, start writing a plan and prepare to throw it away, but above all else, have fun doing it!