ZG Session | Real-Time Analytics: Going Beyond Stream Processing with Apache Pinot

Apache Kafka forms the backbone of the modern data pipeline and its stream processing capabilities provide insights on events as they arrive. But what if we want to go further than this and execute analytical queries on this real-time data?The OLAP databases used for analytical workloads traditionally executed queries on yesterday's data with query latency in the 10s of seconds.

The emergence of real-time analytics has changed all this and the expectation is that we should now be able to run thousands of queries per second on fresh data with query latencies typically seen on OLTP databases.

This is where Apache Pinot comes into the picture. Apache Pinot is a real-time distributed OLAP datastore, which is used to deliver scalable real-time analytics with low latency. It can ingest data from streaming sources like Kafka, as well as from batch data sources (S3, HDFS, Azure Data Lake, Google Cloud Storage), and provides a layer of indexing techniques that can be used to maximize the performance of queries.

Watch and learn how you can add real-time analytics capability to your data pipeline.

Transcript:

Cameron: So our last two track speakers are Mark Needham and Karin Wolok. Both currently work at StarTree, a startup founded by the original creators of Apache Pinot. Mark is an advocate and developer relations engineer at StarTree. And Corinne currently leads developer community programming in the Developer Relations team. Please welcome both Mark and Karin.

Karin: Thank you, Cameron, for introducing us and thank you all for joining. I'm Mark. I'm Karina and this is

Mark: You can't be me. This off to a good start.

Karin: Hilarious. Okay. So, yeah, we'll just get right into it. Because I think our time is pretty limited, and we have a lot of content to fill in. So we'll start off with just kind of going over some of the foundations like the basics, we'll try to make this quick. But basically, we're gonna start off by talking about the different types of analytics use cases. So the first one, so there's generally speaking, there's three types of analytics use cases, right, the most popular and most well known one is dashboards and BI tools. This is mostly used for internal purposes by like BI analysts and this is probably the most well known when people think of analytics, this is what they think of dashboards and BI tools. The second one is user facing analytics. So, user facing analytics is basically when you are taking your analytics capabilities and exposing them to your end users, your customers or end users of your applications and we will also dig into some examples of this shortly.

And then the third one is machine learning. So this is kind of when you are exposing your real time analytics to a system of sorted, that's the one that's actually doing the processing. So things that could be like more automated, like fraud detection, or any kind of like anomaly detection things that are powered by the system. So you're feeding the analytics into essentially a machine to be able to take action upon receiving those insights. So those are the.. sogoing over these, like the over arching, generally speaking, three types of analytics use cases. So for this specific example, we're going to dig a little further into user facing analytics. Mark is my slide changer. Gonna give him like a note. Okay, so why should you care? I think this is something that's pretty important. So over time, the world of real time analytics has actually evolved businesses. And I mean that in like a very, like, it's a very realistic sense, like, companies have literally productized the ability to give real time analytics to their end users, which I will share some examples.

So initially, it started with providing, you know, having internal uses for dashboards and BI tools. And then there was like the ability to expose these analytics to your end users. And then there is this like business metamorphosis, where like, companies are actually productizing, this capability by enabling and empowering their end users by giving them real time analytics, and insights. So digging into that a little bit deeper, I'm gonna go over the topic of actionable insights. So one of the things that's pretty powerful that a lot of organizations have taken advantage of, is not just offering your end users real time analytics, because, essentially, I mean, you can offer your end users dashboards and things like that but when a user when your end users have the ability to take action, or make some kind of decision based on the real time insights that they're getting, that's when you're empowering them. And it's also obviously increasing, you know, engagement inside of your application, because they're getting fresh insights, then they're able to take action, you know, almost immediately and that's where you kind of have this like market advantage, I guess you could say. So actionable insights, I think, is something that's like a good thing to kind of take away.

So a couple examples of real time analytics systems is more like a user facing side. LinkedIn has a bunch of different use cases that are user facing real time analytics. One being who viewed my profile? Most people are probably pretty familiar with this. LinkedIn has. Actually, it's more than what's written here at 700 million, I think it's like 800 million plus users, and to their end users, they're providing them, you know, the ability to view who viewed their profile with dashboard capabilities. And they also have the ability to slice and dice this data, by company or geolocation and things like that, to these queries can be a little bit complex. But that's an example of user facing analytics on LinkedIn side. Another example..

Mark: I guess the other thing to point out is the data on that chart is live survived. And when viewed Karin profile, it would say 129, and it would go up on the last. Last one, it hasn't gotten precomputed that.

Karin: Thanks, Mark, I'm trying to like speed through everything, because I'm like, we only have 30 minutes. Okay, another example of this is a little bit of like a kind of an offset of real time analytics, user facing is a user's news feed. So every time you go to your homepage, you have a refreshed, real time news feed. Now, this news feed, it has to be like the freshest data, which means it was posted like a second ago, you need to be able to see it, it also has to be relevant for you. You know, LinkedIn engineering team has to make sure that you haven't seen this content before. So there's a lot that goes into this. So they have to be able to have this like fresh data shown to you. And essentially, what it does for the end users, it increases the engagement, because if you can make sure that they are seeing fresh information and the relevant news feed, then they're able to take action and do things like like or comment.

And that is where the power of actionable insights, which was what we were talking about before really comes into play, in terms of like user engagement, and addiction to the application and all that fun stuff. And this is a little bit more of like this product metamorphosis kind of thing, where, you know, LinkedIn offers the ability to have premium features for their recruiters where they have access to talent insights, which is real time information about what's going on inside of their talent pools. And this is a premium option. So they have literally productized this capability by providing real time analytics to their end users. Another example of user facing real time analytics is Uber Eats, Uber created Uber Eats as kind of like a restaurant managers for the restaurant owners to be able to have a kind of high level view of all the things that's happening inside of their restaurant. So this is an example from the dashboard.

So you can see there's like the missed orders in there, inaccurate orders, item feedback, things like that, this, all this information is provided to the restaurant owners in real time. And then going back to like the actionable insights, you know, the restaurant owners have the ability to take action, if they see something is out of whack, it could be good or bad, you know, but maybe all of a sudden, the delivery times are extremely long. And they have to be able to take action on that kind of stuff. They can't know that tomorrow. They need to know it now. Like as soon as it happens, right. So like, part of the premium capabilities by providing them actionable insights with fresh real time information, insights. That's where like the big value is. I really liked this use case, this is an example of real time analytics. It is user facing, but it's more internally using user facing analytics. So stripe, if you don't know them, they're like a payment processing system and they're pretty big company. So I wouldn't be surprised, most people know that.

Mark: just a small just a small payment processor.

Karin: Like you know, the company, LinkedIn, it's professional, social, whatever. Okay, so stripe, their use case is really interesting. So for one, they're focused on financial data, right. So there's a lot of really complex things that like, is not that fun to play around with, but like a lot of like logistical things that they have to really be careful of..Click again. So what stripe uses the real time analytics, with their user facing analytics, what they're trying to solve is like, basically, Stripe has dozens of engineering teams. And all these engineering teams are all working on different facets of the product. And they're all working with some capacity of real time and batch data or like yeah, like real time and historical data and stripe needs to be able to basically consolidate all this real time and batch data from all these different engineering teams, and then make it accessible to all the other departments that need to have access to this data.

So things like the accounting teams, data science or fraud detection teams. I think there's like the finance teams. All these teams on the other end need to have access to this real time fresh data. So you know, they're trying to pull in consolidated all this data and then make it accessible in real time. So in terms of the process, it's very similar for the user facing analytics. In this case, the users are internal. But I think this is a pretty cool use case. I really liked this. And this is just kind of like some of the challenges and opportunities like the challenges that stripe faces. Again, it is a financial data platform. So they don't really have a lot of flexibility in terms of like, you know, having that they have to have like, it has to be really precise and really accurate. And it has to be super compliant. There's a lot of security requirements and things like that. So slightly complex problem.

Properties of real time analytics systems. So there are three basic things that you need when you're building a real time analytics system. The first one is speed of ingestion, you need to be able to pull in fresh data, like as it happens, you'd have to have to be able to ingest this event doing streaming data in real time. So you can do that very fast ingest it. The second one is speed of queries, you not only have to be able to pull this data in real time, but you have to be able to get it out. And that is the importance of the speed of queries in the real time system. And the third part of it is you have to be able to do this at scale. Right. So we're talking about, I mean, I don't know how many, but you have been some cases, hundreds of 1000s of queries per second, that are either being run for like, for like the variety of different use cases that we were talking about. So those are the three things. I'll let you take this from here, Mark.

Mark: Yeah, so what so this one's kind of expanding on the previous slides, we're looking at what are the components that would make up a real time analytic system, and you can kind of see on the outside was sort of covering some of the properties that you would expect to see. But then if we go into the middle, this, these are..this is kind of the ecosystem that Pinot interacts with at the moment. So you generally need some sort of stream processing engine. So the most popular one is, Kafka. But there's also receive users working with kinesis, pulsar, and then also have support from Google pops up and then that data will be being consumed by Pinot and then that's where that's who's serving the queries. So I'm going to use just in case of just pick one of them. So just to pick Kafka, just in case you've, you've not come across it, but I guess a lot of people are reasonably familiar with it.

But this is the model and most of the stream processes are following like a pretty similar model. So we'll have some sort of producer or producers, they're putting messages or events onto a topic because I typically like a name of something that'd be like events, and it's capturing like could be JSON events could be an Avro format could be in just plain text. And often those topics will be split up into partitions, so that we can kind of put in like more data concurrently. And then over on the other side, we'll have a consumer taking those messages off. And then I've got consumer groups on here. This is like a way of when you're consuming it, keeping track of what offset you're up to. Pinot actually keeps track of it, itself. So in health multiple threads consuming, so depending on how many partitions you have, that's how many threads it will be consuming with. And then each one's keeping a track of which offset it's got up to. So what is a bunch of you know, Karin, you want to do this one?

Karin: Sure. Okay, so Apache Pinot, it. I mean, I guess I could mention this later, but it was initially created at LinkedIn, actually, similar to Apache Kafka also. And it is a distributed OLAP data store that is specifically built for low latency, high throughput queries. So basically, what it's where its power is it has the ability to ingest data from a variety of different sources? So you can do streaming sources, or you can do batch, you ingest it. It has all these really, you know, it has the ability to do to merge these pieces together and has pre aggregation, and lots of really cool special indexing, and then makes that data accessible for querying scale.

Mark: Yeah, I guess. Yeah, I guess what you just pointed out, as was also mentioned in Dhurba's talk just before this is about like in these real time analytics systems, can you also process batch data? Yes, absolutely you can do that in pinot. And actually, nearly majority of people using it are doing that because otherwise you only have the data that's just come in. So we'll offer and see people building what they call like a hybrid with a batch load all the data that's happened like in the previous years, and then they've got the events on top of it. And they will then combine those two sets of data together in the query results. So it's quite, it's quite a nice way of combining the two, the two ways of collecting data. So we're just quickly run over, like what the architecture looks like. So it's quite, it's quite a modular architecture. So there are lots of components that generally sort of take care of one thing. So sort of sitting on the top, we've got the pinot controller, so that's kind of in charge of the management of the cluster, and you usually like an inner cluster, you'd have maybe three of these and one of them will be acting as the leader.

And that's ones in control of like, the metadata and making sure that where the incoming data, which server, it's being a scientist, for example, we have data coming in from Kafka, it'll be assigned, but like one partition goes to a segment, and then the segments down the bottom will be assigned to a server, and the server is where the data lives. So the actual data coming in from Kafka or being loaded in from Spark ends up on the Pinot servers. Pinot servers at the data and that there will be like, for example, if you have a replication factor, that those segments will be spread out across the servers. And then the final piece is of the final main component, I suppose, is the broker. And so that's where the queries come in. And they are then sent out by the broker to the servers using like a scatter gather, approach, and it will then aggregate the results and send it back to the client. So we want to run you there's three quick demos. So we've got, we've probably got about, we've got about 10 minutes. So I'll try and keep this reasonably quick.

But just to give you like an idea of what sort of things you can build with it. So we're just gonna build just a little application using eventually using a tool called streamlet. So this is like a Python web framework for building kind of data intensive applications. So you can basically write all your web app in Python, and it will generate like HTML, like almost like a React style app for you. The other bits of the architecture, we're going to process data from a streaming API, we're going to take that we're going to take that off this streaming API. Using the Python, Kafka client, put that onto Kafka, and then Pinot is going to ingest it from Kafka. So that's the that's the idea. There is a nice dataset, so I'm sure most of you will have come across Wikimedia, they have quite a cool thing called the recent changes feed. So this is a continuous stream of structured events. So they basically have a HTTP endpoint, and it's returning what they call server side of events. And it just sends loads and loads of events in, you can create, like a wrapper around that, and get all the data.

This is what it looks like. So you had three keys here and event telling you it's always message an ID for the message, and then the data itself. That's what the data looks like. It's basically JSON. So that's quite nice. So you can take the JSON, stick that on Kafka, and then take it from Kafka interview. So that's the plan. So if I just quickly, I just quickly show you, I've already got it loaded, but I just show you what, what it looks like. So zoom in a little bit. Just minimize this. So this is how we can go about processing the feed. So I'll just maybe I'll just copy that for a minute. So if I come over here, this is what the data looked like, you can kind of see that the data is coming in, lots and lots of messages. And so what we then do is, we're going to we've got this wiki to Kafka script, you don't need to remember how all this code works, I'll share the link for where this code is afterwards. But we're effectively processing the messages here.

So we're using the Python server side events client, it's then just creating like a basically infinite stream of events. We've then got a bit of code to handle like disconnecting from the wiki endpoint, that's all this while the loop is doing, and then we're pretty much just putting the messages into a Kafka topic called wiki underscore events. That then gets all the events and every 100 messages, we flush them to Kafka. That's pretty much it. So the messages make their way into Kafka. I guess, yeah, you can then do whatever you wanted with them in Kafka, and what we're doing is we're then loading them into Piano. Piano works kind of similar to relational database. So you create a table, you have to define a schema for your table.

And by default, whatever fields you name in here, so like columns that you name, like, it will then look for a value fields in your data source, so I Kafka message, and it will map it directly. So if you want to use exactly the same names, then it works perfectly. We don't even have to write any mapping schedule. So ID, wiki, user That title, blah, blah, blah. Those are all dimension fields. So those are generally used for doing filtering or grouping. There is then another type called metric, which would be if you had like some sort of like numeric values, like if you if it was a movie maybe keeping track of like the budget, or the, the amount of money that the film The movie made anything like numeric, and they're basically just hints to Pinot for how it can optimize that data in terms of storage and optimize it in terms of how it goes about queering it.

And then finally, we've got a date time field. So if you're working with a streaming source, you need to specify a timestamp. If you don't want to use the net, the names of the fields that are on the data, like the source, you can apply transformation functions. But for the purpose of this demo, we're not going to do that. The other side is we then can define a table. So you'll notice here that we've called the table a real time table. So just to link back to what I was saying before about being able to combine batch data in real time data. If we were loading a batch, like a batch of data, say I had like a big file of messages, and I just wanted to load that straight in, I could set I would want to set this table to be offline. And I could use the same schema. So you can have same schema, a real time table and an offline table. And then Pinot will work out where exactly is the boundary when it needs to search each one. If you've got a real time table, you need to specify time columns. In this case, it's called TS, a replication factor. So we're doing it on a laptop. So we've just got one, the name of the schema. And then the other important thing is down here, so we just need to tell it some information about where the Kafka like the Kafka, where's Kafka running, so it's running and we're in Docker assets, there's a docker container kafka called wiki, we're telling it low level, that means it processes each partition individually.

And then wiki from the wiki events topic, there is other stuff you can do. So Pinot will apply some basic indexes for you to try and make it quicker for you to query database if you can, if you kind of know your data. And you know, hey, I want to apply these particular indexes, you can also do that, but again, for this query, sorry for this demo, we are not going to do that. I'll just close this. And if I just come across to a patch Pinot so this is what the UI looks like. A little bit bigger. So we can basically do a query, it shows you how many events we've loaded, we can do, we can do reasonably so we can see how many are bots to get your query like that into the a group by bot. So you can do like basic queries like this. At the moment, you can only do fairly basic joins. But in the next version, like there's an in progress, PR for doing joints between tables can be quite a requested feature, as you might imagine.

And then the last bit of the demo is I just wanted to show you what a streamlet dashboard on top of this would look like. So this is a dashboard using streamlet and the Plotly, Plotly plotting library. And so we're building like this is this is kind of keeping tracks, we've got a query that keeps track per minute, how many messages came in how many users did made those changes in how many domains were those changes made on so you can kind of see it sort of. I've been running it for about an hour, it's kind of changing over time. And then down here we can see these are the latest changes that are being made. And then usually what we see people do is they do like kind of these sort of top level, stuff that refreshing and get to the next stage. So they'll do these top level aggregates, but then they want to do some drill down.

So like I want to like drill down and do some sort of filtering. So show me like which bots are doing the most things which users are making the changes. Where are the changes being made. It's mostly on wiki data, which is the structured event store. And then finally, we can do some sort of like interactive drill downs, hey, I want to pick one of these users and see what exactly they've been doing as well. So those are the sort of things that we see people doing and I guess Yeah, like, as Karina was pointing out, in all the examples, we want to have some sort of like info like on these dashboards, where we're like, oh, there's something that we can actually go and do or a way that we make the application more interactive for the users.

I guess we might just skip through some of this, but we'll share the slides afterwards. So we've got like, kind of, we've probably only got about a minute left. So I'll probably just skip through some of this. So this is just talking through like how the data gets stored. So basically partitions partition is the data into the segments, and then multiple segments make up a table. And then there's like all sorts of different indexes. Most of these are fairly, fairly standard ones. The only one, I guess, which is unusual is there's this. This is where the company name comes from. So there's a starchy index, which kind of gives you the ability to tune, like the space time, like how much you want to use it, you can basically say, hey, I want to limit how many like it's kind of a way of controlling the latency of your queries.

So you can kind of say, hey, I want to make sure that every query, the maximum number of records that's going to be queried is, and I'd say 10,000. And then you can indicate like which aggregations you'd like it to pre compute. And then you can kind of choose like, along the spectrum, okay, I want to, I know, there's going to be these queries, I'm going to pre compute these bits. But if somebody does these queries, I'm happy for it to take a little bit longer. So it's sort of trying to fall somewhere between only indexing and pre aggregating everything. And then yeah, I guess so. Yeah. Like the, I guess the overarching, like idea is that you can simply point I mean, most people are putting Pinotsin Kafka. And you can get those two to two pieces of software working together really nicely. And I will just let Karin finish up.

Karin: Yeah, I mean, this slide kind of speaks for itself. Like, Pinot is like, you know, it's a pretty mature product. It was originally built at LinkedIn. And it's growing pretty quickly, I would actually say, like, really, really quickly. It's like on fire. But yeah, that's used by over 100 different companies. And like in terms of performance, it's, I think, like the largest like pinot cluster, or at least one that we know of, is like ingesting over a million records. I mean, a million events per second, with over 200,000 queries per second, but still maintaining millisecond latency. So it's pretty powerful. And then, yeah, the takeaways.

Mark: So these things ____ to remember.

Karin: Yes. Did you guys can read this or review afterwards? Yeah. And probably

Mark: So yeah. actionable insights, fresh data, fast growing at scale. And then if you want to get started, like I guess, like, do you want to come and chat to us? That's probably the most interesting thing. Stree.ai/slack is where you can join us if you want to, like to see some guides or like some material to help you do stuff. dev.startree.ai Yeah, I guess we got we got a few minutes

Cameron: Yeah, that's great. Guys. We do have a few questions that have come in and some of the questions revolve around you know, what are the essential differences between Apache Pinot and starter? Folks are asking what's the difference there?

Oh, I see. Yeah, so Startree is the company, which has many of the contributors to Apache pinots. Apache Pinot is the open source product. Startree is the company that has like people who work on it. And then there is then a we didn't cover it in here. But there has been a Startree cloud product which is like a Yeah, like a hosted SAS version of Pinot.

Cameron: Okay, so say the same Apache Pinot but hosted and managed and some services are okay, perfect. Yeah. And then

Mark: I think there's some extra features in that version as well.

Cameron: Of course, that makes sense. And another question that we have here is they're asking, I'm paraphrasing, they're asking about, you know, how do you do integration? Or how do you connect to data sources, you know, that aren't necessarily, you know, streaming, Kafka or whatever? Is there? Is there a framework for that?

Mark: So you mean, like a framework for general streaming? Or you just mean any, any?

Cameron: Well, they didn't say, but I would imagine things like application systems and other types of systems. I think folks are wondering, yes,

Mark: I mean, I mean, yeah, it is all it is, there is like a press. Yeah, so there is. Yeah, I mean, all of those, there are kind of two ways of getting data into this, like a like a batch way, like, stream and then a streaming stream is as well. And they're all like, I mean, they are Java code that is doing that. But it is possible to write, like, your own code. If you had like a system that's not covered many of them. The popular ones, they're already there. So I think I mentioned a few of them. And then on the batch side, yeah, there's already connectors for it. Spark for Flink. Obviously, for standalone loading stuff from just from a server. I'm sure there are other ones I just can't remember. Remember all of them, but it's designed to be like that. We didn't actually have a slide on that. But there is like another slide which kind of shows that the architecture is very pluggable. So it's intended that you can kind of there are basically SPI is you can go, hey, I can quite easily Well, I guess easily if you know how the code works, go and implement, like another data source of it.

Cameron: I get one last question. We're almost out of time. But we somebody asked, Does Pinot support stream ingestion with deletes?

Mark: It doesn't do delete. So when you're so like, the way it works, so when, when the data is coming in from the stream processor, it's building up like the segment in memory. So and then, at some point, it'll reach what I call a segment threshold, and that segment will then be flushed, like, eventually, effectively flushed to disk. And so, you know, we didn't talk about this because again, it will be scrape in this sort of talk. But there you can then choose like, what's your like store? Source of Truth backing stores, you can use like S3, or Google Cloud or Hadoop, or whatever it is. And then the level of where you could if you then see Oh, actually, I need to go and delete a record the level at which you could do it would be okay, well, I need to go and replace a segment. So it's kind of it's kind of a little bit tricky at the moment, because you need you need to move there is a way to do it. But the segment would need to move from the real time table into an offline table. And then when the segments in an offline table, you can then go and choose hey, I want to replace it. And as long as you load in data with the correct timestamps, like with the with the start and the end the same it'll be able to replace that. That segments it's not Yeah, it's not quite as simple as, hey, I want to go in and just delete a record. But I guess that's partly why how it's able to get the speed. It's because it's assuming it's always append.

Cameron: That makes sense. All right. Any other questions? I assume they'll be able to reach out to you in the chat. Yeah. Definitely wrap up. Thanks so much, guys. Thanks, Mark. Say stay here.

Mark: Thanks for having us

Cameron: Interesting topic. Thank you.

Speakers:

Karin Wolok

Head of Developer Marketing and Community

Mark Needham

Developer Relations Engineer