ZG Session | Delivering Reliable Services in a Multi-Cloud World

Watch this session for a discussion on rearchitecting for the cloud. Learn why services should be redesigned for the benefits of containerization, microservices architectures, and secure by default principles. We also dig into how companies get ahead of common challenges so that they don’t disappoint customers expecting secure and reliable services, and so their engineers can spend more time building and coding than working on-call.

Transcript:

Cameron: Welcome back, we're gonna jump right into our next session. Our next speaker is currently a founder and CEO of shoreline IO. That's a company that enables site reliability engineers to improve the efficiency, performance, availability and security of the systems they manage. Prior to shoreline, he ran analytic and relational database services as vice president at Amazon Web Services. And he was also vice president of engineering, a performance management and business intelligence at Oracle. Quite a resume. Please welcome on Anurag Gupta. Anurag, welcome in the stage is yours.

Anurag: Hey, thanks so much and I appreciate all of you joining the talk and the conference. It's awesome. So today, what I'll be talking about is delivering reliable services in the cloud, and particularly, I think, as many of us are doing in a multi cloud world. And so let's just jump in this is related but different to what I do nowadays, at Shoreline, it's going to be more based on my experiences at AWS. You know, when I joined AWS, they gave me eight people, and they said, Go disrupt, to data warehousing and transaction processing. And, you know, overall, the business, the services, I started there, the services that I took on there grew from about 3 million to 5 billion over my seven and a half years there. So I learned a lot about launching and scaling services. And that's a lot of what we're going to be talking about today. Because at the end of the day, you know, people don't care about your features, your performance, if your services and up.

And so making services reliable, particularly when you don't necessarily own the software that it's based on, for example, Oracle SQL Server, etc. You know, is can be something of a challenge. So let's just drop in, you know, and I'm from New York, I talk fast, so I suspect that there's going to be ample time for your questions. So you know, please keep them going. So let's first start by talking about some of the challenges in delivering reliable cloud services. The first one is that people just expect higher availability in the cloud than they were providing to their customers themselves on prem. I remember the first service I launched was redshift data warehouse and AWS cloud. And one of my first customers was Netflix, which was previously running Teradata, on prem. And you know, they wanted 99.99% availability, and which is just 4.4 minutes of downtime per month.

And I asked them, well, what are you getting from Teradata? And they said, 99%. So they're expecting out of the box brand new service, to be delivering 100 times better availability to them than they were getting on prem. And so that's a challenge. So you should be.. you should have those expectations of yourself and your the services you run. The second thing that's really important, is that customers care, not about fleetwide availability, but about their availability. And, you know, if you read all the Google, you know, Sr, read books, which are, frankly, awesome, you know, they're a little bit inward facing around fleet availability. But, you know, if you go, you know, I live in California, if I call up my local cloud utility provider, PG&E and say, hey, the power is out of my house. If they were to tell me like, Hey, did you know we have six ninth power availability in the state of California, I kind of don't care. The power is still out, and I need it back up.

And it's and you know, you really should think about your services like utility, in fact, back in AWS, our engineering team was known as utility computing. And the goal was to provide compute storage database, you know, another 140 services, the same way that PG&E provides electricity and gas to my house, metered on demand, scale up, scale down, you know, easy to use, just plug in your appliances into the wall, it's a lot harder to do that for data services. So if I look at something like EC2, if an instance goes down, or even an entire datacenter goes down, you can migrate to another one. It's an interruption, but it's a brief interruption. You can't do that with a database. I can't tell you like, Okay, your database is unavailable. Why don't you take a new database, because it's not about the database, it's about the data.

And so, you know, for those of you running data services, its way harder than for people who are running,
you know, more, shall we say, cloudy, ephemeral kinds of resources, even something as complex as search for Google, if I write a search query into Google box, and, you know, an entire region is down, they can just reroute to another place. And I'll just have a slower results coming back, right? It's, it's okay, you can't do that, for example, with GCP, or email or things of that nature. It's yet harder as you go multi cloud as you go to microservices, or you to go to containers rather than VMs. And that's a little bit of an odd statement, right? Because it should be easier, should be faster. But the problem is, as the, you know, yes, the these units are smaller, but they're just frankly, more things that can break. There's a larger environment, that's harder to diagnose. And then there's just more resources that you have to keep up. So those are big challenges. And all of that is leading to particularly in the cloud production ops becoming the new bottleneck for engineering teams or operations team, because we went to the cloud to make, you know, innovation faster, engineering faster.

And by and large check in I mean, we're able to deliver things faster than we ever were before. And you're most of the other steps in the software development lifecycle are also now automated, you can test, configure, deploy, build artifacts, you know, automatically, and if you're not doing that today, you really should. But you know, pity the poor SRE or got, you know, engineer on call, who is getting this just absolute tsunami of things going on, with more frequent releases, more complex environments, a lot more moving parts that they need to understand, as well as I was saying earlier, higher customer expectations, it's really tough. And, you know, you can see that in the fact that the demand for site reliability engineers is skyrocketing. The number of SRE's that you see with that title, or cloud engineer, or, you know, similar titles, it's doubling every year, the number of wrecks for SRE is to developers, is getting very close to one to one.

And that's kind of weird, because you see, you know, there are obviously 40 times more developers out there. And you know, people burn out in this job, you know, the average tenure at a company is less than 18 months nowadays. So it's hard to get them, it's hard to keep up. And, you know, part of the problem is, is that SRE as it's done today is largely manual. And that's a problem. Because, you know, you're you'll never grow your teams, those at the rate that your fleets are growing. And so, you know, there's just this rising tide of work to God. So that all sounds pretty grim. So let's see what we can do about it.

So the first thing we should do is understand that the infrastructure we're going to be running on is unreliable. Availability zones, in particular, in AWS are not intended to be high availability. The cloud services are deployed by a AZ and every time something goes, goes in, there's a probability of an error which may be automatically rolled back. But it's an issue. And you know, so you should think of the AZ as your largest unit of potential failure. So if you've got three, you know, a three way you know, replication, but it's all in one AZ. It can all go down. So you really need to think about these things across instances. When I was designing Amazon Aurora, we did a four out of six quorum where we had two instance, two storage instead. SS in each of three ACS, and that was, so if we lost one, AZ Yes, we'd lose two instances we'd keep for more importantly, if you're running a large fleet, you know, there's going to be something down all the time.

And so you want to make sure that you don't break quorum in those cases, either that. So in that case, you need to have a read quorum, so you need three out of six. So it's basically az plus one that you want to maintain for your environment. So that's what led up to four out of six quorums. So all cloud providers recommend multi AZ, your, you know, so use of regional services when it's possible, because they're going to be failure tolerant. So for example, s3, DynamoDB, Aurora on AWS, they're all good choices. If you are building your own data service, it's, again, way harder for you. Because cloud volumes like EBS are not intrinsically multi AZ managed Kubernetes, like Eks, the stateful sets are not multi AZ Safe. So you've got to build the replication yourself. And that's particularly hard if you..your recovery point objective is, you know, one that where you can't lose data. So you know, you're going to need to do some sort of streaming replication in most cases, or so forth, we can get into that in the detailed comments. So let's see, let me keep going.

So how do we think about data services availability? The first thing to understand is that the flows between data services sources and sinks fail 10 times more often than either the targets or the source. But you care more about the flows on the target, because presumably, that's your data service. One key thing to understand is that the data pipeline is code, it needs to go through CI/CD for changing formats, just the same way that all your other code does. You should monitor the queue length, transit times and the timeouts of the source to figure out when something's broken, why is it broken? Is it because the sources bad, the you're not able to reach your target, because the queue length is growing, you've got latency.

And then you really need to work on making the remediations automatic because if you start losing your flow, you can't really afford to drop the data on the floor. The most recent data is the data people want most. So you need to auto scale your queues, you need to restart the nodes Kubernetes as an asset in this case, and then you need some sort of mitigation. So I always talk about building escalators not elevators, let me explain what I mean by that. So an elevator is going to be way faster. But when it breaks, it turns into an elevator shaft. When escalator breaks, it turns into a staircase. And so it's still stable, it's just running slower. So for example, mitigation for reliable flow might be to push your data to s3 Read from s3. So at least you're not losing data. Because the flows down you're just stacking up things that you might later on be able to build up through queue.

Another thing that's a real challenge for data services in the cloud is customers typically want to run inside their own cloud account, not your account. And people sometimes tell me like, hey, you know, how does that make sense? You're running RDS. You know, people were, you know, obviously running inside your account and RDS. And you know, that's true, but it's AWS, they know your customer knows that. They've got a dependency on AWS and you know, whether it's RDS or EC2, it's kind of the same thing. That's different if you're an external provider. And why do they want to run inside their own account? It's because it helps them with soc2, PII, all the other compliance regimes. At the same time, they want you to manage the service because they don't want one more headache of things that they need to manage and understand that presumably are also changing fast. And they won't give you SSH access, right? Because that breaks the whole premise of running inside their account if you're you also have access to their account.

So what they so the way I solved that in the past is using outbound HTTP2 connection from the data plane running in their cloud account to the control plane running on yours. But in general, people have port 433 up and open. And using HTTP2 gives you a long lived bi directional connection, which even if they have it open just for outbound. Now you can interact in both directions, I think that that's a way better than what a lot of people do nowadays, which is, oh, I'm going to use email as my transporter. I mean, that's just ridiculous. It's not going to have the latency you need during a live event. The other thing I would recommend for all of you is to use envelope encryption with bring your own key for the data at rest. We've talked about data and transit before this. So what I mean by that is, you basically wrap the per database key that you generate, with their key that stored in kms, or something else, and perhaps even wrapped up yet again, with your own key. And the reason you do that is the outer wrapping ensures that you can rotate keys whenever you want without bothering your customer. But the reason you want them to have a key is that..that way they can disable access to you at any time and just turn it into garbage. And it now they're just protecting the key, not the data.

So you can't access the data. But these amount of data that needs to be re encrypted based on rotation is pretty small. And that's really important, particularly for backup, the last thing you want to do, if you're running a large cloud data set is to you know, mess with your backups. So you know, messing with the keys to your backup is a much safer problem. And because you know the dataset small. So those are some suggestions. All that said, it was remarkable to me when I was at AWS, how much better I couldn't run a database inside the cloud than when I was running it on prem. On Prem, you basically mailed somebody a CD and you hoped, right, you know, you didn't have any visibility into their workloads or anything like that. All you could do was run benchmarks like TPCC, or TPC H for transactional workloads or analytic workloads. But benchmarks are artificial, it has nothing to do with how customers are using their product on redshift, you know, what I did is every week, I got a Jupyter notebook that showed one hour of activity, the same local hour of time, in each of the regions. And each of the bases, we were running, you know, a high level sort of just shape, of course, not the actual course.

And we found, tried to find one project each week to sort of improve our performance, we got 700% better performance in 18 months doing that. And then large part that was because people were running systems differently than I expected. There is many insert, update, delete requests as the worst selects, which is mind blowing for a data warehouse. But you know, people are trying to do that. But that's the way it was. The next thing you can do is you can train models across customers, which gives you a large data set to work with, but you can apply them per customers. So different ways I use that was we would predict disk and network failures based on number of errors over a period. So like, I think a disk is going to fail within the next week, if I think if I've seen to read or write errors within the last day, something along those lines and or, you know, similarly for network card, similarly for EC2 instances, etc.

Now you need a large data corpus in order to do that. So it's hard to do on the number of databases most of us might have on prem, but it's easy to do in the cloud. The other thing we did was to predict query execution time based on the shape of the query, you know, how many joins there were, etc. And that's useful because otherwise, you know, you can run a query optimizer, which is perhaps more accurate, but you're already inside the database at that point. It's a little bit too late for you to, you know, stall things or to route to one queue versus another. And lastly, and probably most importantly, you can elastically scale up down to zero. You know, one of the more successful surprisingly, more successful services, I had to AWS vs Aurora server less. And, you know, it was basically, you know, a load balancer in front of the database, which let us, you know, behind the scenes, scale the database up, scale it down, make it, you know, scale it to zero.

And, you know, that was important, because, you know, there are a lot, lot of databases or data services out there, which are really only heavily used for a portion of the year. So for example, like your performance review process, you might be running that against a database twice a year. And the rest of the time, yeah, a handful of people need access to it constantly, but nothing like during performance reviews. And so the fact that you can scale that up and down is super helpful. And so you know, both scaling and then masking the downtime by using routing. So for example, scaling up shouldn't cost you anything, scaling down shouldn't cost you anything from a latency perspective, because you can keep the system going, while you're, you know, replicating it in a different instance. You know, those are things that are super hard to do, unless you have elastic resources.

The last thing I'll suggest for all of you, and this is a little bit self serving is, you know, automate away one issue each week, you know, when we're trying to grow services at AWS, we're just looking for 1% improvement in usage or revenue each week. And that gave us 68% over the course of the year, because it accrues, it builds on itself. Now, at the end of the day, if you're going to get 1% of you know, more usage each week, you're not going to get 1% more resources in terms of people to manage your databases each week, you know, there's no way. So you need to get 1% better at automation each week as well. And that lets you have your developers spend more time on innovation. And you know, creating value, because you really don't get paid for keeping the lights on, you know that people expect that from you. What you get paid for, what you get more revenue for is innovation. And so just putting in a constant process of continual of continuous improvement, for audit, you know, for your production ops automation, just as you do for things like QA, it just makes sense. Okay, that's my last slide. I'd love to take any questions.

Cameron: Perfect, thanks, Anurag. Just wondering if you can give us we have a question here, if you can give us a sense of the tools and the technologies that can be leveraged that can be brought to bear here? You know, it seems like there's a lot of techniques that you described, are there any tools that can help developers, give them some rails, I guess, to manage and automate all these cloud services and data and things?

Anurag: That's a great question. So I would say that, you know, the, what I focused on in the presentation itself were things one could do in code, right. But the changes you make in code take a long time to go in, right, particularly sophisticated things, you know, there were architectural in nature. And so you want to fund those projects, because you will surely pay for them someday, one way or the other, right? But in terms of production ops, you know, if you look at that landscape, I mean, there are a lot of DevOps tools out there. But most of them help you with observability. And reducing meantime to detect, then there are a bunch that help you with Incident Management, grabbing a ticket to the right person to work on it, or, you know, whoever's on call, and those are useful if you don't have those, you must have those. Now, if you the problem is is that there's very little out there to help you with automation.

That's why I started trolling, frankly. And so you know, you And if you're going to have a human in the loop, it's going to take you an hour plus, to fix something, you know, take some time to wake up, orient themselves, you know, figure out, maybe they've got other things going on, you know, so that's tough. So at least for everything that is ongoing, you really just want those things to get automated away, like, you know, no one should wake up at night to fix a broken disc, you know, that should just be taken care of for you. Now, and then you want to reduce the escalation so that the person who gets the ticket can fix the ticket. I mean, you know, it goes all the way to support, you know, people in support, want to be able to fix things, not just, you know, beg and plead amongst six dev teams for somebody to take a look at the life of that. So long answer to a short question, but hopefully that makes sense. That's the kind of stuff patrolling does.

Cameron: That's perfect. So, you know, if another question came from an audience member here, you know, I mean, we all sort of understand that the Cloud enables scalability, flexible, I mean, all these benefits, but it seems like people are still having problem getting their organization leaders to buy in, you know, to move into the cloud. And at least I imagine in some organizations, do you have any thoughts on where to start that conversation how to, you know, convince organizations to go ahead and do this migration to the cloud?

Anurag: It's a big transition, it's a big journey, a long journey. So the only thing you can do is take it one step at a time. You know, there are no teleportation devices out there, to my knowledge, to just say, you know, however much No, like, the cloud vendors will say, like, No, we just make it simple. It takes time, and it causes risk. And it's understandable that there'll be hesitation. So the question is, what are low risk environments that you can go into? Obviously, dev test is a clear one. Obviously, websites are clear one. Analytics is another clear one, you know, none of us have the capacity to, you know, like, grow our on premise environments for you know, and so that's a great use of Incorta, for example, right. And so I think those are easy ones. I mean, otherwise, you know, what I basically, but I think if you're trying to get your own service up in the cloud, what I'd suggest is start with Dev test.

Cameron: That's a great suggestion. Let them sample the benefits and move forward from there. Here's a Here's another quick one, hopefully, how can data engineers assess if their data is reliable? Any thoughts on techniques or tools or anything to assess the reliability? To measure it?

Anurag: Yeah, there are a lot of techniques there. And you know, what I start with is you want to apply the same SRE principles that you do for your services into your data pipelines. So for example, what's the error? You know, how many errors are you seeing per record? Or what is the time it's, you know, what is the latency rate? What is the throughput rate? Now, those are all obviously, the same metrics you'd put on, like a website, like that's what Google wouldn't measure for their search system, right. But that makes equally as much sense for data pipeline. And then the next couple of things are, absolutely make sure that no one can put something into to the changes of format, without also changing the pipeline. To get checked in together, they need to get go through code review together, they need to get tested together. And I remember once having a meeting with like, six CMU PhDs in AI, who are working at Fitbit, asked him Okay, so what are the things you're doing and it's expecting to get blown away by all the model development, they kept telling me, you know, what I do, I wake up in the middle of the night to deal with the fact that someone else I'll easily avoid the his term, you know, check out add one more field to uh, you know, their JSON string. So, you know, it's just broken as an engineering process. So if you're doing such things

Cameron: Super, we're near the end of our time here. I assume people can go to Shoreline.io to see some tools to help with site reliability

Anurag: Juat they can reach out to me other rug@shoreline.io if they have any questions

Cameron: Perfect. Perfect, thank you so much super interesting.

Delivering Reliable Services in a Multi-Cloud World

Speaker:

Anurag Gupta