ZG Session | What Data Engineers Should Know About Building Real-Time Data Pipelines

Data processing technologies are constantly advancing, having gone through big data, cloud, and machine learning evolutions over the last dozen years. Most recently, we've seen the rise of real-time analytics. Not only are there more sources and greater volumes of real-time data to be leveraged, but organizations are realizing more value from real-time analytics use cases, in areas such as risk operations, security analytics, logistics tracking, and real-time personalization.

What will it take to build the next generation of data systems and pipelines capable of handling real-time data? This session discusses the industry's move toward real-time analytics and the technical requirements this will involve. Learn how flexible schemas, support for complex queries, and the ability to handle bursts in traffic and out-of-order events help pave the way for real-time analytics.

Transcript:

Cameron: Our next speaker is the CTO and co-founder of rock set. He was an engineer on the database team at Facebook, where he was the founding engineer of the rocks DB data store. Earlier, he was at Yahoo, and he was one of the founding engineers of the Hadoop Distributed File System. He was also a contributor to the open source Apache a space, H based project. Dhruba previously held various roles at Veritas software, founded an E commerce startup and contributed to the Andrew file system at IBM Transarc labs. He's been a busy guy, please welcome Dhruba Borthakur. Dhruba, the stage is yours.

Dhruba: Hey, thank you. Thanks a lot, Cameron, really appreciate you inviting me and letting me talk to a lot of the data pipeline developers, we're super excited to be talking to a great technical audience. And I'm hoping my talk today, which is about real time data pipelines, at least to some of your audience. And they can use some of the insights I share to do data pipelines in their own real lives. So I have a set of slides. But again, I think I'd really appreciate questions, I'm going to talk a lot about real time data pipelines, data pipelines, in general, has been kind of being used by many developers over the last maybe decade since the time Hadoop started. But today, I'm going to focus mostly on the real time part of it. A short, or like a very high, high level overview about myself. So right now, I'm the CTO and co founder at Rockset. Rockset is a cloud data service for real time analytics. So if you have analytical needs of large datasets, and you want to get results quickly, rockset.com is something that you should you can try.

Before that I was a founding engineer of RocksDB at Facebook, it is a key value store that I helped build when I was a developer at Facebook, is widely used in a lot of Facebook backends. And then I worked on Hadoop file system and other backend systems out there. Most of those systems are associated with servers and storage. And that has been my kind of technical expertise all this while. The talk, today's talk, I'm going to talk a little bit about how data systems have evolved because I have seen Hadoop grow from like infancy to like a very mature software now, I was kind of maybe the second or third person in the team. So I'll go talk a little bit about how data processing. And then I'm going to talk more about the real time data processing and real time analytics and real time data pipelines.

I'll also go and explain deep dive into some of the technical requirements of how to set up a real time data pipeline. And what are the things you should be cognizant about as a data engineer to get a lot of these data pipelines deliver data or insights in real time? And again, love questions, so please, keep asking questions. And I'll try to answer them at the end of my talk. So yeah, so Hadoop started 2006. So I was a developer at Yahoo, I was a project lead of the Apache Hadoop file system, the open source one. At that time, I didn't really actually realize that Hadoop would be so big, Ten years later, I still remember kind of the first intern at Yahoo, who was building a lot of Hadoop search systems, mostly focusing on batch processing at the time, Hadoop was purely a batch processing system, you submit large queries and results come back, maybe in a few minutes or an hour.

Then came Spark back in like 2011, or 2012, Spark and combination with Kafka kind of move this analytic systems more into a streaming system, right, where things happen when new data arrives, and you can publish insights to other places. But then for the last two or three years, Rockset and some other companies, the focus has all been about real time analytics, and how can you get data in real time and insights in real time within a few milliseconds of data getting produced, and to get real time data pipelines is super important, right? Because that's kind of the pipe where the data is delivered from one place to your analytical system for it to get processed.

So that's kind of the play the part that I'm going to cover more in depth. Again, I remember back in the days at Yahoo, when I was working at Facebook when I was working there. A lot of these systems were batch analytics, for example, which some of you might have heard about PYMK- people you may know, this was an application that we developed when I was at Facebook back in 2008 or 2007 maybe. This talked about finding associations or connections between different people, and trying to suggest who can be your friends. And this is kind of one of the reasons this application was one of the reason why Facebook became so viral back in the day. But the key part for data engineering is that everything was batch. So there's the data pipeline wasn't that mature, and everything was batch, you copy things from one place to another one, once a day or something, and run all these algorithms and give results.

But then, when I was at Yahoo, there's a lot of ad placement, Yahoo had like 26 properties like Yahoo finance, Yahoo mail, Yahoo something, Yahoo Maps, and other things. And then you have to place ads in all of these properties. And this is where I saw a lot of streaming stuff happen, where, depending on where, what the person, what browser, what the person who is browsing on the on the desktop, and ads get shown, which are customized to the user. So more about streaming processing is starting to happen. And then in the last few years at Rockset now, what I've seen is that these analytical systems have become really real time. Take, for example, a fleet management system at Rockset that we have one large customer who is managing 1000s, and 1000s of trucks on the road, they're delivering cement all over the country, and then they use Rockset because if a coolant is leaking, or truck is getting delayed, then they need to react very quickly to.

So the data pipelines to do these kinds of real time fleet management is kind of built differently compared to other batch systems that we had earlier. So this is what I mean by real time, about the real time use cases. Similarly, real time gaming analytics, where people are playing games in an online game. And then your records are or your moves are being recorded in a transactional database. But then you need an analytical system where you can build leaderboards, for example, right. Or you can do matchmaking, when a new user joins the game, you have to match him to a game that is currently being played with the same set of skill levels. So there's a lot of matchmaking that happens in the analytical system, which in this case, is Rockset. But the key part is that the data latency between when you make the transactions in your transactional database, and when your analytics can give you results is essentially a few milliseconds or a few seconds.

So this is why the data pipeline, there needs to be really real time, you don't want to see a leaderboard, which is two minutes. All right, in an online game, if you're playing an online game, and you make a great move, your leaderboard needs to get reflected within a few seconds at the most. So this is these are applications which need real time analytics and needs real time pipelines in the middle to power them. Another one, real time computer vision, for example, people are it's like a retail checkout place where you have automatic checkout, people are picking things from the shelves. And automatically, there's video cameras, which understand what the person is picking, and it records it in the database. But the analytical system also needs to get updated very quickly so that you can kind of do upselling to the person at the time of retail checkouts. So again, the little latency between your transactions where the transactions are happening on your analytical system, that latency needs to be a few seconds at the most.

And that's where the data pipeline, the real time data pipeline becomes important. So a lot of these real time building these real time systems are hard, because you can't afford to have data latency. So the real time systems or the real time data pipelines have this concept of data latency, which is essentially the time when the event actually occurred, and the time when that event is ready for to be analyzed in your analytical system. So to reduce this, this is the challenge for building this real time data pipeline. And it's a hard problem to solve. But today, what I'm going to do is that I'm going to talk about few of these open source software, or maybe you're not really open source, but some of them widely use software as well. All of these systems claim that they're very real time, which means that you can actually process data as soon as it is, as it is created. And I would like to show a set of metrics or a set of technical requirements that you can use to compare and contrast these back end. So if you're using any of these backends, you might use the checklist that I have to see how they compare against one another based on a checklist that I have.

So what are the technical requirements of making these data pipelines and analytical systems be real time? So I have a checklist of eight items. And you should compare all the data systems that you do on the basis of this list and see whether your data system is meeting these requirements. These eight requirements that I have. Again, these eight requirements is something that I've come up with based on all the years of my work related to analytics in general. So the first four are what I'm going to cover today. Bursty traffic, complex queries, flexible schemas and low latency queries. I'm going to talk in great detail about each one of them. The last four, I will skip, but I will have a write up that I can share with you later if you contact me at my email address. So the first one is that bursty traffic, so your analytical system needs to support bursty traffic for real time for if you want your system to be real time.

So here's an example of Amazon launching the first Prime Day in 2015. Just imagine, imagine and work the workload that shot up like twice or thrice compared to normal on the day..this is just launched. So now, if your analytical system cannot handle such a spike, automatically, then you need to spend a lot of people resources to provision these things upfront. So this is the reason. And if you cannot serve the spike, then what will happen is your your users are going to suffer a loss of service or degradation of service, which is why ability to support bursty traffic is super important for real time analytics, you can batch it and keep it to be processed a few seconds or a few minutes later. Because people want their, in this case, want their goods or want their website or their app to be refreshed with the right checkout cards and right metrics as soon as they interact with Amazon.

So what is the technical thing that as a data engineer should look for when I'm trying to figure out does my pipeline or does my analytical system support bursty traffic. So one of them is that it needs to be flexible in nature, it needs to have a disaggregated architecture where you can kind of disaggregate different hardware resources or compute resources or storage resources from one another. So you can scale them up. This is what I mean by needs to have a cloud native desegregated architecture, this is the key for elasticity. And this is why some of the cloud vendors can do this well, where there's a lot of hardware lying around, but you don't need to use them or pay for them till a spike happens. But then your system needs to react to that quickly. And then start using more hardware resources when, and when it is needed.

Take for example, on the architecture that Rockset uses is something called the aggregator leaf tailor architecture, to provide this aggregation of compute and storage. And I'm going to explain why this kind of architecture is useful for real computer pipeline system and real time analytical systems. So the ALT architecture is not something that we invented at Rockset, this is an architecture that is widely used in loads of Facebook backends, for example, the Facebook news feed and those kinds of analytical systems, which are high volume rights and high volume queries at the same time. So the interesting part of this architecture is that there are three things in this architecture, which are desegregated. So on the left, you'd see data coming in from different sorts of pipelines from data lakes, or maybe transactional databases, then you need to serve a set of tailors or software entities called the tailors which transform them in one way in some shape or form.

And then they store data in a storage system, which are the leaves. Leaves are the places where the data gets stored. And then when queries come into your system, these are aggregators that on the right side of your screen that you see, which scales up or down based on the volume of queries. So the three things that are interesting here are Taylor's leaves an aggregator, so when there's bursty work, or bursty data, the tailor scale up. And it doesn't have you don't have to scale up the leaves or the aggregators. And the other thing is that even when there is bursty traffic, it doesn't impact your query latencies because aggregators continue to work as it is. Similarly, if your data volume increases, you increase the leaf nodes. And so you have more storage, where you can store all your data. And then if your query volume increases, suddenly there's a spike in query volume, then you can scale up your aggregators. So this very much follows the CQRS pattern where all the writes happen to the left side of that vertical line in the middle, and all the reads happen on the right side. So they're completely kind of isolated from one another. Then again, the interesting difference between this system and let's say a warehouse or some other pipelining system that you might be using, is that there's a three way disaggregation here.

This disaggregation is the storage and disaggregation from compute needed for queries and disaggregation from compute needed for writing into the system. So this is what I mean by this. This three way disaggregation is useful or needed for real time analytics. So again, if you're using a data system and you're trying to evaluate Is it really good for doing my real time analytical needs, you can see if it has kind of as architecture similar to the aggregator leaf tailor architecture. Now, again, like I said, Take, for example, this is an architecture used for by Facebook to build a Facebook newsfeed or the multifeed, where the multifeed is all the posts and comments from your friends. And that's very much an analytical system, right? You don't see all your friends posts, the back end system is going to use a lot of relevance matching, sorting aggregations, and then show you only the top posts that the system thinks are relevant to you. So it's a very complex analytical system.

But on the other hand, it's very real time, because the moment somebody makes a comment or post an article, it needs to be part of the analytical insights that is running to show you the feed. So the ALT architecture is widely used there as well. This is this is kind of the inspiration behind Rockset as well, which we use the ALT architecture for doing real time analytics. Similarly, LinkedIn used to use the ALT for the follow feed. Again, there's a good blog on it if somebody's interested in reading about it. So the first point that I cover is one basic requirement for a real time data pipeline where these two or three things need to be disaggregated from one another. The second point of how I am evaluating all the data systems out there is to see whether the data systems support complex queries. So what is a complex query? And why is it needed for real time analytics? So mortgage application, let's say you're applying for mortgage application, this used to run on Hadoop, I remember back in 2008, or nine, big finance companies started to adopt Hadoop. And this was one of the first very first use cases in the financial world. It's about loan data, you create an application and then company figures out, are you going to be like what, what credit history you have. So it has to join all these things.

This used to happen in Hadoop. It looks at data from many different sources, and it looks at your credit history loan data, and then gives you a result for your application. Now, joint are super important in these kind of analytical queries. And aggregations are super important. And joint are usually hard. Because if you have two tables, and he join them blindly, without actually understanding the, let's say, how should I say, the resulting set, or the transient set of data that gets created, you might land into trouble, a lot of a lot of data systems don't support really sequels to joint. But this is super useful for a user. For example, if you have two tables, and you do a join blindly, the system might just run out of memory, because the intermediate data set is so big. But the query optimizer and things like those in some of the analytical systems have to be powerful enough to be able to figure out how to do the join well, how to make sure that the transient data sets are smaller in size. So push filters, keep donating to the query.

And so this is why you need to pick a SQL system or system that has support for kind of joins. And that really powers your analytics. This makes life very easy for your application developers, if the system does it automatically for you. Again, there are sometimes there are two systems that you might be able to see them from far away, they look the same. But if you go closer, you'd see that some systems have SQL without joints, and some systems have SQL with joints. And they're really different. If you actually go start playing around with them, you'd find to build real time analytical applications developer find it really useful to be able to use SQL with full featured joins across datasets. Again, this is has been happening in industry. I remember Hadoop was the first time we were developing, we are just writing MapReduce code. And then Hive came along. Back again another project that we open sourced, when I was at Facebook, Kafka has something called KSQL DB. But then again, the joins maybe are not as powerful as the hive joins. And MongoDB also is trying to do something with SQL. So SQL or some similar complex query language is super important to be able to build real time analytics in general. The third thing that I thought I will cover today are flexible schemas.

So why are flexible schemas useful for a real time analytical application? And how does the flexible schemas reflect in all the data that comes through your data pipeline into an analytical system for analysis? So very early project again to the data at Facebook, it was called an actor logging. It was a data pipeline where applications are writing data, and then in a certain log format and then they go back to an Hadoop system to get analyzed. And we shifted, the first question that came from application developers is that, oh, this time span column, you're recording in seconds, but actually, I want it in milliseconds. It's the time spent on a web page. There's no web app at that time, it was just the desktop app. So how much time spent per page? So this is a schema change request, right? So these type of requests come very frequently into the data world. And how does the system handle this is what you can use to evaluate these data systems. When you're comparing which ones to pick for your real time analytical needs. Take, for example, again, two records, some records has additional fields, some records have fewer fields. So how does your real time data system handle these kind of workloads is something that you have to consider when building this pipeline.

So flexible schemas are sometimes difficult, because if you have a flexible schema, developers might find it hard to, to kind of use it. So this is the challenge as an application developer is that you have to combine the no SQL world and the SQL world together. And this is where the challenge usually starts from. So take, for example, what Rockset does, are some data systems, including Rockset do is that we take all this data that's coming in and kind of create the schema for you. You don't have to create a table with fixed columns. So you can just dump JSON. And then you can call describe table and it will show you exactly which records and what fields, because even the type is indexed in these kinds of systems. So handling smart schemas or handling flexible schemas is super useful, too, for real time analytics, Postgre SQL, sometimes is too much a fixed schema, cockroach DB has some changes, where if you can change schema, if it needs some downtime, and you can do one at a time. Hive is super flexible, but again, not the best performance, because you have to like kind of serialize and deserialize at every query. And the last point that I wanted to cover is low latency queries. So this is, really useful for real time analytics, because most of these things are powering live applications.

It's not a BI dashboard, that's powering right. So low latency query is best served by an index. Take, for example, if you have a book, you can look for looking for a string, you can scan all the pages to find the string, or you can use the index of a book. So using an index is a great way to support kind of real time queries in your real time in your analytical system. And most of us are familiar about indexes, I don't need to go into the detail, but you have to JSON docks, there are ways to create an index from both these datasets. And now when queries arrive, you can actually query this data and get results very quickly. I've seen this happen. Or when we're working on Facebook, and Facebook uses the social graph index system to be able to kind of give your query results very quickly. So indexing every field is a great option. So Rockset does this where if you have data coming in JSON, Rockset indexes all the fields in the JSON. And so now you can no queries are slow, because you don't have to create an index to make it faster, right? It's fast by default.

And you don't have to worry about the index, because when you make a query, and the query automatically picks which index to use, whether it should use the inverted index like elastic inside, or whether it should use the column index, like Redshift, or snowflake, Rockset builds these things together. And this is very useful for building any real time analytical applications. Because as an application developer, you don't have to worry about which indexes to build by yourself. Again, very so this is these are the two approaches. Traditional batch approach is very much MapReduce, whereas for real time analytics, indexing is the key to be able to get good queries, query results, when you're processing large data sets. So again, a quick wrap up, we talked about how batch, our analytics have changed from batch to real time, real time news feed Facebook and ad bidding. And these kinds of real time applications where the data latency is super important. And to get from batch to real time, sometimes you have to evaluate your data stack.

And my suggestion, or based on my experience, my opinion is that I usually look at this list of things that are listed here and evaluate the data systems that I'm using and see is the data system meeting these needs, and if they are, then it's a great way for me to make sure that I can use it for my real time needs. Also, SQL compatibility sometimes is really useful for making application developers move fast. And the systems typically are very much focused on human efficiency, like total cost of operations versus hardware efficiency of computer efficiency. So, again, I covered only the first four, but there are more items that are there in my list. And I'd love to interact with more of you. So please do send me email, and then I can send you all I can interact with you more about the other features that I did not consider here in my talk. So again, that's all I have, Hey, Cameron. How are you?

Cameron: That's great. I'm doing great. That's, it's a fascinating area of real time analytics. It's a whole new world for some of us, you know, I have a question here from the audience. They're asking, I'm trying to paraphrase this a little bit. They're basically asking is real time analytics like this? Is it practically restricted to like recent business activity like showing, you know, the recent transactions or just showing current aggregated metrics? Like I guess, not necessarily doing time series analysis like you would do with batch analytics?

Dhurba: Absolutely. I think that's a great question. So real time essentially means that there is a metric that needs to be shorten. And this is what it called the data freshness, or the data latency, right? Typically, people are used to query latencies, that queries have to be fast. But when you talk about real time, it's the data latency or data freshness, which is also plays an important part take, for example, financial companies, they're trying to find fraud, right. So when somebody makes a transaction within a few seconds, or milliseconds, they need to detect if it's fraud. So then to move these data from their transactions to their analytical systems, where the fraud detection algorithm runs. And so this is what I mean by it's really real time and analytical system. And the other part is very much customer facing applications, right? Let's say you are like playing in game or you are jogging on the street using an app, and the app needs to show that minute to minute, are you better than your run that you did yesterday? For example? Right? So instead of showing it one hour later, 15 minutes later, so yeah, so real time essentially means that how can I react to decision making as quickly as even get produced?

Cameron: Perfect. Okay, that makes sense. You mentioned it was a question about the technical requirement list data systems for real time analytics. The question was, is has these been validated with customers? Do you have any use cases to share on the success of maybe confirming this list of requirements?

Dhurba: Yes, absolutely. So this is a great question. That's why I keep repeating as part of my talk, saying that these are based on observations and my experience for the last 15 years, right. So we have good use cases at Rockset. And I can also look at some use cases from Facebook that I can point you to take, for example, the trucks that I talked about, right, we actually have a large customer who is the largest cement manufacturer in the US, for example, right, he has 1000s of trucks on the road. And the need to react immediately when a problem happens on the truck, there's five different sensors or 50 different sensors per truck, and they can afford to miss deliveries because the cement becomes concrete or hard or something if you don't deliver these things in time. So this is why I mean, they're very time sensitive. Then there are some hedge fund who is looking at all these search queries coming from like Google and Twitter and other things to figure out how their machine learning model runs to place financial bets. So they're looking at trends in real time. Like early days, people used to use Trend analytics or in Hadoop, right, that comes yesterday trend. But now somebody needs to do it to trend as soon as it happens, and then react to it. So yeah, I have great examples, please email me, I can send you more.

Cameron: Last question, I guess, you know, we so we have the worlds of, you know, like, analytics that, you know, look at historical data over time. And then this, you know, real time analytics, do you have any thoughts about it, you know, framework or techniques for integrating these two worlds together at all?

Dhurba: Yeah, absolutely. I think these two worlds actually is one word, right? It's not two separate words. Real time, it's like, you want everything real time today. It's like watching movies, right? You don't go to the Blockbuster Video to pick up a video, you switch on your thing, and you watch it on streaming. So but you can still go to the movies once a while to watch movies. So this is how I think about the relationship between batch and streaming. I think they're very much together. They're all sequel or some complex query based analytics. But the only difference is this data latency where batch can afford to wait for many hours or days whereas real time needs to happen immediately.

Cameron: Makes sense. Makes sense. Well, thanks, Dhurba. But we really appreciate you coming on really fascinating topic.

Dhurba: Thank you. It's great to be here. Thanks for the invite

What Data Engineers Should Know About Building Real-Time Data Pipelines

Speaker:

Dhruba Borthakur