Site Reliability and Digital Business

For digital organizations, an unreliable site is bad news.


Dec 03, 2016

For digital organizations, an unreliable site is bad news. When your site is down, customers cannot access their data, purchase goods or services, or seek support. At worst, the impact can be catastrophic, but an unreliable site always risks alienating customers and damaging fragile relationships.

For Airbnb, the quality of digital customer experience has a direct impact on consumer trust in the company, accommodations presented on the site, and the hosts who list property.

Learn how Airbnb has integrated its site reliability team into the core of how the company develops, delivers, and supports its service using the DevOps approach. Through a chain of linked tools and processes, Airbnb and the site reliability team create a consistent digital experience for its customers.


Michael Krigsman: I'm Michael Krigsman, industry analyst and host of CXOTalk. And we're here at Future Stack '16, which is New Relic's conference being held in San Francisco. And I'm talking with Cameron Tuckerman-Lee, who is a site reliability engineer for Airbnb. Hey Cameron, how are you doing?

Cameron Tuckerman-Lee: Really good! How about you?

Michael Krigsman: Good! We all know what Airbnb does, but what does a site reliability engineer do?

Cameron Tuckerman-Lee: I think that's a good question. I think the role is very different depending on what company you're at. So, at a lot of companies, your SRE's are your operators. You have developers on one part of your building that develop your applications, and then throw them over the metaphorical wall over to your operators, who make sure that it's running in production.

Michael Krigsman: So, silos.

Cameron Tuckerman-Lee: Yeah. So, at Airbnb, we don't subscribe to that model; we are in the dev-ops model that is becoming very popular lately. So, the same engineers that are building applications are also the ones that are running them, scaling them, and dealing with incidents. But because of that, there's a new class of tools that are required to make sure that they're doing that efficiently and using best practices; and so that's what the SRE team does: it makes sure that the entire site is reliable and available, and we do that by supporting the other teams that own their applications.

Michael Krigsman: What kind of tools help with this?

Cameron Tuckerman-Lee: So, some of it is ... a lot of it is learning. So when there are incidents, how do you make sure that there's good follow-up to that; that there's learning from that. And so, there is this tooling around, like post-mortems, and making sure that when incidents do occur, that if there are previous incidents that were like this, you are able to get that data very quickly and understand it. It's also getting the right people in the room. So, how you do [that] with pagered escalations, how you deal with alerting; those are also owned by the site reliability team. You know, we're also the ones that own and maintain the integrations with some of our monitoring tools, like StatsD and New Relic. These are how, when there are incidents, that we're able to quickly triangulate where the problem is and what the impact was.

Michael Krigsman: So it's a combination of technology tools, but also processes and approaches combined with data.

Cameron Tuckerman-Lee: Absolutely. So, I think there's lots of different good ways to go about incident response, but a really not-great way to do that would be to have everybody be doing it their own way, and have no consistency. So, having a team like SRE means that Airbnb has a consistent approach to incident response, so when there are problems that need to get escalated up the chain, they can get picked up and handled very quickly.

Michael Krigsman: And, you're very focused using the end-user as a reference-point.

Cameron Tuckerman-Lee: Absolutely.

Michael Krigsman: Tell us about that.

Cameron Tuckerman-Lee: I think no business likes having downtime. Obviously, there are financial implications to any business, but there is a really personal human aspect to downtime at Airbnb. The situation I like to remind myself of to motivate me is, you can imagine, you know: you're going on vacation, just got off the plane, you're in the cab, you're heading to your listing, you open up your application to get it's address, and you just see a 500. It would be a pretty bad or potentially scary situation.

Michael Krigsman: Yeah, very painful.

Cameron Tuckerman-Lee: Yeah. And so, Airbnb really is nothing without our community. I can't imagine what the product would be without the guests and hosts that trust us; so, making sure that we're not just up and available for taking bookings, but that people are able to rely on us is really important to our business.

Michael Krigsman: You mentioned the word "trust". How does trust relate to technology, relate to user experience; how does that web work?

Cameron Tuckerman-Lee: It's a good question. So, some might say that Airbnb is the hospitality company, but some might also argue that we're selling trust: the trust that you're going to be able to go to a stranger's home, and feel welcome and have a good experience, and be able to experience that neighborhood like a local. And so, the technology that goes into making sure that people are what they say they are, that you're able to interact with your host, and get to know each other beforehand; that you're able to, when you're searching for a listing, find a place that's going to fit with the kind of neighborhood that you're looking for; I think all contribute to making sure that when you go someplace, you trust that it's going to be a good experience.

Michael Krigsman: And how does that, then, connect to site reliability engineering, and to other engineering functions inside Airbnb? How do you think about the connections?

Cameron Tuckerman-Lee: I think this comes down to engineers feeling like they're very involved in the product. I don't think that many engineers at Airbnb feel like they're just doing what they're told - they're shipping code, and once it's deployed, they don't care about it anymore. They really feel like they need to own their own impact; that's the term that we throw around a lot.

Michael Krigsman: "Own your own impact."

Cameron Tuckerman-Lee: "Own your own impact." So, if you think something needs to get done, if you think something's not being done the right way, it's up to you to stand up and make that change happen. And so, this is from everybody from product teams developing new features for guests and hosts to make their experience better, all the way to the, say, reliability team that - you see that there's issues that need to get resolved, or there are some parts for processes that aren't working out, we need to step up and do something to make sure that our guests and hosts are going to have the best experience that they can [get].

Michael Krigsman: So you really do see it as a kind of chain of linked tools and processes that have this ultimate combined impact on the user.

Cameron Tuckerman-Lee: Absolutely. We want to have teams build on top of each other, all the way until the teams that are building the actual experience that our users see. We want to have a really strong foundation for them, so that when they are building Javascript frameworks [for] user interfaces, that they're able to trust that the back-end is going to stay up, that they're able to trust that if there are issues that go to production, that we're able to tackle them very quickly and roll back. And so, it really is a pyramid of supporting each other.

Michael Krigsman: And finally, what's the data that you look at?

Cameron Tuckerman-Lee: There are a couple different parts of the data that my team cares about. It's everything from your traditional SRE metrics, mean time to resolve, mean time to acknowledge, you know, when [it is] incident response. My team is also starting to really care about metrics around making sure that our on-call engineers are living healthy, productive lives; making sure that work-life balance is something that extends [to] something when you're on call at 2 AM. I think it's something important for industry to start looking at. Lastly, the ones that are aligned with how our users are seeing things; and these are what a lot of companies would call "service-level objectives," making sure that our response time is up, our error rates low, that [it is] not just response time to sending out bytes to our CDN as fast, but also making sure that when the browser does get that information, it's also having fast load times. And that's where things like application monitoring with companies and products like New Relic come into play.

Michael Krigsman: So, it is a very holistic view.

Cameron Tuckerman-Lee: Absolutely.

Michael Krigsman: We have been speaking with Cameron Tuckerman-Lee, who is Site Reliability Engineer at Airbnb. Cameron, thanks a lot!

Cameron Tuckerman-Lee: Thank you so much!

Published Date: Dec 03, 2016

Author: Michael Krigsman

Episode ID: 399