Recently we had the pleasure of sitting down for an extended and insightful chat with Luke Melia, Co-Founder and CTO of Yapp. Yapp has been experiencing some incredible growth, and Luke shared his thoughts on a number of related topics, including getting a business started on cloud, scalability, and much more. This is the first of two parts. Check out Part 2 here.
Thoughts? Let us know on Twitter @CloudGathering.
Gathering Clouds: So tell me about Yapp.
Luke Melia: We began as close as you can get to a garage startup in New York City. Garages are too expensive here, so we worked out of my business partner’s basement. I was the tech half of the business; she is the ‘everything else’ half. We wanted to get a prototype up so we needed to start raising some money. And I had a Herculean task ahead of me, which was to build a platform that in previous companies I probably would have asked for a team of six to build, largely by myself. So the last thing that I wanted to think about was infrastructure. So I tried to take any non-core decisions and perform them outside of my company. Much of that decision involved using services for parts of our stack, open source libraries, and toolkits. One of our key requirements in terms of hosting infrastructure was to go use the cloud, and in our case to use Heroku, which is a step up in terms of abstraction up into the platform-as-a-service (PaaS) layer.
So the core driver for outsourcing our cloud was driven by being completely overwhelmed — really needing to clear my plate of anything that I didn’t absolutely have to build myself. The open question for me was how long would we be able to grow with those choices. And I was perfectly okay if we learned six months from starting that Heroku wasn’t going to cut it and we would have to build “a real hosting environment.” What I found over the beginning of that business and starting our private data was that there’s a whole bunch of things that I really loved about hosting in the cloud. One of them was just simply that the pay-as-you-go model was great for a young company.
We had a private build in the beginning; but we had very little traffic. In the beginning we let a few people onto the site, and then there were a few dozen, eventually up into the hundreds and thousands during the private beta portion of our launch. In those early days it was nonsensical to have more than two of our servers running. And that’s the kind of situation where you realize it’s not even worth building an environment that small if you know that you’re going to need to scale. And then, the other thing that we kind of got indirectly was the constraints put upon our architecture choices based on hosting. We were in the situation where our hosting resources were essentially ephemeral and we weren’t going to get the same ones twice. And so everything had to have no reliance on the file system for anything except temporary files. All of those things ended up — sort of by accident — serving us really, really well and it’s made me a convert to the cloud approach. And I think the time where we realized that was when we started having to scale due to press. During our private beta launch, and our public launch, we had some good press. We were covered in a couple of the tech publications.
GC: We saw the Good Morning America spot. It was very cool.
LM: We graduated after our public launch to a review in the Wall Street Journal, which was terrific and drove a lot more traffic than we had seen, but it is print which means it was not kind of any kind of concentrated burst. And then we found out — with about four days to spare — that we were going to be on Good Morning America. I’d had a product appear on Good Morning America once before in my career. At that time we had a team of six engineers that basically spent about two-and-a-half weeks preparing for that, along with a whole bunch of help from our hosting provider at that time.
So here I was: not only did I find out with four days to spare — I wasn’t even home. We were in Boston when we found out. I had to take a train home that night, and on the train try to figure out okay, what do we have to get done and how are we going to do it in four days to be ready for this kind of spike? And so we did a bunch of calculations based on my previous experience having a web property appear on GMA.
And so we did load testing to my estimated level, and even the load testing thing was pretty interesting just from a cloud perspective because we didn’t have a ton of time. We spun up a service called Blitz.io, which basically spins up a bunch of Elastic Compute 2 instances, hammers the specific site path that you define with as much traffic as you want, and then spins it down. It was all totally metered, so we went from having no load testing of the structure whatsoever to having a pretty massive one in probably about half an hour, which is pretty cool.
And… I totally underestimated what the spikes were going to be, by like a factor of 10.
GC: It seems your strategy as you’ve described it had been somewhat reactive. But how have you used cloud to frame a strategy moving forward that’s more proactive to grow in line with your business?
LM: One of the things that we learned from the experience of scaling up in this reactive way was that our architectural scales really, really well, largely because of the constraints that I mentioned being locked into our vendor choice. Sitting here today and looking out, I don’t know if Heroku will be right for us forever. Being based on top of EC2 we’ve been hit with the same kind of outages that EC2 had over the last year or two. They’re not frequent, but they’re not infrequent either, and that’s a cause for concern over the long term. Over the short term, it’s just the price of doing business this way, which is okay for where we’re at. But my main take-away is that the choices that go into the cloud architecture are completely key. Designing around ephemeral server instances — those kinds of architectural decisions enabled us to scale out quickly.
I mentioned that we underestimated those traffic spikes by about 10x. We basically were able to, within about 45 seconds, dial-up our capacity on Heroku from their GUI interface to meet the traffic, which ended up being about 30 times higher for the day, but on a spike-level, it was about 100 times higher than anything we had ever seen before.
GC: Aside from Heroku, what are some of the other cloud tools you’re using?
LM: We use a hosted Redis assistance for one. All of our logging is sent over to a service called Papertrail, which is nice because on press days we’re able to scale temporarily up, and then back down as the traffic spikes subsided. Some of the user interactions in the product rely on real-time notification so we’re using a service called Pusher, which basically uses web sockets and modern browsers to open up a connection to their service where the user is in their browser. We’re then able to push data down to either specific users or specific classes of users within milliseconds of us knowing about the event in our servers. That’s been pretty cool. And again, that was something that we’re able to scale up to larger plans and scale down as we needed to.
In a sense, we are able to be reactive, whereas in situations that I’ve been in before with hosting environments and architectural decisions — even if you were willing to react, there was no way that you could.
GC: Right, so perhaps the better term is “adapt”?
LM: Yeah, I like that. Because when you’re operating a business at scale you get a real sense of the rhythm of the business in terms of seasonality — after you’ve been through a year or two of a Christmas season, or similar cycles like that. But when you’ve got a brand new business, year one is drastically different than year two, and even year three can be a whole different ball of wax. So it’s just a lot harder to get that kind of predictability. And one of the things that I’ve learned in my evolution in terms of software development, that I’ve had to personally come to grips with, is that estimates are really, really hard. That goes for server capacity, and probably more so for development timelines, especially if you want to stay committed to a level of quality. It’s tough to do. And so some of the practices for software development that we use are about acknowledging and embracing the fact that our predictions are going to be directionally correct but often inaccurate. I think that the cloud approach lets me take a similar approach. For instance, hosting capacity: I don’t have to predict this, if I don’t have to be right, then all the better because the chance of me actually being right is low.
GC: Aside from the scalability the AWS platform offers, what have you been able to do with it? What is its overall value to you?
LM: I think that the key value, really, has been in the ecosystem of Software-as-a-Service providers that live at AWS, so you’re able to take advantage of relatively low latency. That’s the same idea as there being a part of town that’s the startup hub or the tech hub. You want to locate your business there because that’s where the other events are going to be, that’s where you can easily grab coffee with somebody who can help you out. So I think of it kind of similarly, because it’s essentially the most populated cloud, if you will — the benefits, the effects of everybody being there help to just make a lot of options available.
GC: What are some of the top takeaways in terms of your cloud approach to matching increases in demand?
LM: One thing you don’t get for free with the cloud — or with any other environment — but I think has been really important to us — is a zero-downtime deployment model. Unfortunately, the status quo for most early-stage web services is that when you need to do a major release of your software, you’re going to have some sort of maintenance window. You’re typically going to try to schedule that for 2:00 in the morning in your most popular time zone and put up some advance warnings to your users. Once we started to get more traffic, we realized that there is no good time to do maintenance. There’s no good time to take down your service and interrupt things for users. And so that’s one of the things that I would have loved to be able to kind of outsource and get for free somehow.
Unfortunately, there’s not a model out there that I know of that gets you that for free, so we’ve made a lot of investment in our own tools in order to deploy a new version of our apps to production, start it up, and once it’s up, be able to switch traffic over to it. There’s a whole process around how data migrations work, when to update a database schema, and that adds a lot of complexity to that process. You begin to develop a set of rules and processes around how to accomplish those goals in in a safe way — you’re moving the environment ahead one step at a time where everything’s running smoothly throughout the whole process.
So that’s something that’s not directly related to the cloud but I think it is part of the promise of the cloud. It is actually enabled by the fact that I can easily double my capacity while I’m bringing up a new version of the software. A few minutes later I can then bring those resources back down 50% to the original resource levels.
In the early days when it was just my partner and I in the basement, we would deploy a few times a day as we had new stuff to look at and we didn’t have any users to interrupt. The freedom to deploy small adjustments and updates any time is really powerful for a development team, especially when you can get instant feedback from your user-base on a change that you made. It’s also great in that you can go to a lunch and have an idea at lunch for a way to improve your interface, come back from lunch, code it up, and within an hour send it up to production. That kind of feedback lets your development and product team execute a whole lot more effectively. And it’s a key part of taking advantage of what the cloud has to offer.
In the past I’ve always had to make choices about how to size our QA environment versus our production environment. In an ideal world where cost is not an issue, you want those environments to be basically identical because that’s going to be your best chance to address performance issues before an app goes to production. But it was always this tradeoff because are you really going to spend enough for top-end gear that basically sits idle 90% of the time? The cloud lets you avoid that choice. You can test on whatever size environment you want, when you want, and you don’t have to have it running all the time if you don’t want to. So when we did our load testing, we were able to do it against our QA environment, which we made identical to what we planned to run during “Good Morning America.” But most of the time our QA environment sets in a very small number of instances and costs us under $100 a month to run.
By Jake Gardner