ZG Session | Technical Keynote: When One Size Does Not Fit All: General Purpose vs. Purpose-Built Data Pipelines

Companies today are operating in a world where urgency and uncertainty are both at all-time highs, creating a surge in demand for real-time access to data for analysis. Delivering data to meet this demand is no trivial matter — particularly when you consider the variety of data types and workloads. For instance, you can build an ETL data pipeline to get 300+ petabytes of clickstream and impression data ready for analysis. However, billions of rows of operational data with hundreds of table joins will crush that same pipeline.

Watch this session to discover whether it’s possible to build a general-purpose data pipeline that meets all the analytics needs of the business.

You’ll learn:

The pros and cons of general-purpose vs. purpose-built data pipelines
How organizational structure and culture can impact data architecture
The common characteristics of every data pipeline
How to build a pipeline for real-time data access

Transcript:

Shane: So, my name is Shane Collins as Nick mentioned, and I'm from Meta currently. I would like to take just a moment to talk a little bit about some of the things that I've gone through with some of these companies that I've worked with in the past. We've got initially MCI was where I pretty much grew up at. And one of the interesting things about MCI is we built a real time OLAP system, they're running on extremely, extremely old technology in comparison to today's standards, we had maybe 150 gig of total drive space in comparison to petabytes that we are dealing with now today in my role at Meta, when I was at MCI managed a group that built a product called rep level database. It did this real time OLAP processing, and it required huge amounts of very, very specific coding to make sure that all the right elements were being processed and updated at exactly the right time in the process. And that way, people can see information in a real time way. It seemed to be kind of a recurring theme, though, everywhere I went, the management that I worked with, they all wanted to have their information yesterday, and they wanted to have it with, you know, with their breakfast and with their dinner. And whenever they wanted it at exactly that point in time, they didn't want to have to wait on it.

Processes back then sometimes took well longer than a day to complete in order to get all of your transformations done for some of the more complex querying. And I just got a chance to watch it evolve over the last 30 years or so I roll the clock forward a little bit, I have found myself at Gateway computers. For any of you that remember, Gateway. Gateway computers had the same need. And we developed this product that we called advanced monitoring platform that did literally screen scrapes off of various different systems and database all of this information in a way that made sense. And but at the same time, we were able to generate this kind of sense of real time reporting, and there's a recurring theme there. And at NCO, financial systems, they had a very, very similar business need. But it was around basically needing to understand their employees process through, you know, through whatever kinds of progressive action or performance management processes they needed. So we built a system at NCO that was very, very similar in nature, where every time something came into the system, it was immediately available to end users. And the end users being in this that case, pretty much the executives and management of the call center organizations. When I when I went to Ticketmaster, they had another very similar problem. Ticketmaster was doing a lot of work with their interactive voice response system.

If you've ever called Ticketmaster Live Nation and tried to purchase tickets from them. I used to work with the team that that did most of that original collection of data and also, they needed to they had a need to understand is the interactive voice response system, working in real time the same as the agents are? Is there some massive gap? Or are we able to actually deliver on making sure that the interactive voice response system is capable of dealing with massive change in something real time. And once again, there was this need for real time information in a reporting format where people can easily understand it. At Meta, when I first came on board, one of the biggest problems that they had was very, very similar. So I'm going to spend some time talking a little bit more about my personal experience at Meta as we go forward here. But before we do that, I'm going to take just a moment to calmly reflect on my home and where my heart is. I'm originally from Kansas, but I've spent most of my time and raised my children in Colorado and this is from the rooftop of my house in Colorado that we Airbnb, we call it the sky shrine sanctuary for some obvious reasons. But you know, I spend time here and contemplate how to make real time data processes move faster. Next up on the list would like to talk a little bit about the Meta data engineering journey.

So initially, we have, you know, circa 2007, we have about 15 terabytes of data in MySQL. Now, I'm not sure exactly on the exact number around that. But the MySQL implementation is really designed to be OLTP. But for any of you that have worked significantly with MySQL, it's not necessarily the fastest or greatest engine for delivering, say, an enterprise data warehouse. And it certainly wouldn't be a go to stop for something like, you know, big data platform, although it's a great OLTP collection repository.

So then we end up with 18,000 servers dedicated to MySQL and 4000 partitions, then they started stacking on to that hive to answer the original question of how do we deal with all this OLTP information, when we need to get it into a big data environment and make it work. So we moved there to the work with presto, because Presto adds to the hive stack, a compute layer, that gives you the opportunity to bring in datasets from other places, as well as leverage some really, really amazing advanced features, such as you know, for different kinds of ways that you can do analytic processing. And to stack that on top of Hive, you get the best of both worlds there, you get the power of Hive to get at really, really large data sets and then Presto creates yet another layer on top of that for creating more advanced analytics and the ability to consolidate and create that information in one place. But it also scales much, much better. And, you know, in total, the scalability of the Presto hive stack combined is amazing. Now, in 2013, there was the open source release of Presto that everybody's aware of, and that open source release has, you know, brought many, many other members of the community into helping advance Presto and make it what it is today, Netflix announced that they were using it for 10 petabyte data stored in Amazon s3.

And then by the time you know, we got through there, we have performance issues with Oracle Applications on Oracle Exadata. So we're asking ourselves the question, okay, so if we have this Oracle Exadata platform, and, and it's, it's not delivering the results that we want, it's not updating as fast as we'd like it to. And for any of you that have used Exadata, exadata data is extremely fast. And it's very, very, very good. But the other problem that we ran into with it in particular was that the amount of data that we're beginning to process in exadata, even with being simple ERP financial data, was taking more than the allotted time, the two hour a lot of time for the ETL to even finish. So they and as that started to happen, the reports basically just became next day, once they once there was enough data in the system. So the Oracle Exadata implementation didn't fit for us, and we needed to find some other kind of solution. So the solution for us, we started taking a look at various different RFPs, Incorta POC, that happened roughly in 2016. And we worked with the Incorta implementation team and created some really cool product. We saw the RAM 2016. They also said, Hey, we solve this 10 Plus petabyte issue with ADW, with presto, let's look for a business application or you know that where we can, you know, can we implement our business application data into this environment, which would have included the data that we were originally having issues with from Oracle Exadata. So what came of that was, we eventually converted our AR counts receivable ERP data into Incorta. And this required some interesting engineering.

And we'll talk a little bit about why we made that choice here and a couple of minutes about why the different platforms why this was a really, really good choice. But in 2017, I built the accounts receivable, ERP conversion and Incorta. And then in 20. Sorry, there, that so the next thing up, though, we did we have our third normal form data, we have basically the third normal form data has business, it's the business problem is we have low volume. And we also have, you know, a relational data model, it's single node, it doesn't scale very well in comparison. So it really doesn't matter which Third Normal Form system you're using, you're gonna run into scalability issues, when you start dealing with billions of records, they won't process those joins very well. And they also will have a hard time with a really, really massive amount of data. And being able to filter it out, especially on anything that is an index, but they do have indexes. And you know when you're dealing with impression and click data on the other side, which is much, much larger data, and it's much less volatile, then you end up with you know, just this massive stream of information that's extremely large.

But it's also very, very hard to get the impression click data to join across many, many dimensions. So if you have several ancillary dimensions that you've added to your impression and click data, you're going to have a hard time getting the joins to process with any kind of reasonable amount of time. So we take a look at this on a little bit deeper level, you know, the real problem that we're running into with third normal form is that the scalability isn't there. And the distributed query engine isn't there. And the other issues that we were experiencing with big data is that you really have a hard time getting joins for highly normalized datasets. And for those of you that may not know what normalized means, if you have one really, really big table with all kinds of fields in it, something like a name attribute might be associated with your you know, or a street address, or those kinds of things might be associated with you as an individual. And that data, if it was in a record set would require that you have highly redundant information, because every record that comes across that you are part of would have to have that information In order for somebody to be able to cross reference, say, I want to know all of the transactions that occurred in Vermont. So by looking up, but by actually having that into a separate table, that's that holds your address, it only holds the address record once as opposed to holding it for every single other click, you ever clicked inside of whatever application you're working through. And it's very useful to be able to normalize that information so that it takes up less space in this system.

And it processes much faster if you're dealing with a single node or a single machine. But when you're dealing with a really, really large scale environment, such as what Facebook has with their impression and click data, it's very, very good to have something that's capable of dealing with it at that scale. So all is cool in the metaverse. And really what we're looking at here, there's a couple of different architectural setups there. And we talked a little bit about the data problem versus the business problem. The data problem being how much data can you actually process and something like Presto is very, very good at dealing with large amounts of data. The third normal form data, which is the normalized that we were talking about just a moment ago, that normalized information where you have all of the information only stored in as with as little redundancy as possible, that information being stored in an RDBMS type platform or relational database management system is optimal. So to speak a little bit about RDBMS versus what you might experience with a big data platform. Big data platforms usually have some ability to do bucketing, of course, and also partitions. So you do have some tools that you can use to minimize the amount of data that's being read before you actually start processing the joins. But outside of that, anything that you do that joins two tables together is going to require nested loops in a big data environment. So in a nested looping situation, you have two records that have to be compared to each other, that's great, it's just one to one it knows right what they are. Once you have two records on each side, you now have four loops, because you have to have compare the first one to the first one, the first one to the second one, the second one to the first one and the second one to the second one.

So the bigger that gets, you know, you have a basically, an In squared problem, you've got to you've got an issue where you're going to start creating more and more and more processing cycles to accomplish the same thing. And it will do it, it'll get through it because it's going to basically brute force it and scale it out and put it to as many places as it has to finish that. Whenever you have two tables where you have, where you have one table that you have partitioned and broken out, the system will take the second table, whichever is the smaller of the two, and distribute it all across all of the different nodes it is going to do this join processing on and it takes the wreck of the original table the larger of the two and breaks it into however many nodes worth of pieces you have. So if you have 10 nodes, it'll break it into 10 pieces, ship each of those 10 pieces to a different place and compare that 10th to the entire second table and do the same thing. The join processing at that point will be done through nested loops. In contrast, Third normal form data lends itself better for business problems. And the reason it's better for business problems is because in the third normal form data, you have that compacted information where your address only occupies one line as we talked about earlier. And having that information compacted at that level means that you only have to read it once for that one ID and that'll be in memory at that point. And then every other time that Id shows up in the original data, it just looks in memory and says oh, that's the value that I need.

So third normal form data is very, very fast doing joins. It's also indexed and without going into, you know, big discussion on how indexes work, indexes make it so that the system only has to do you know 2,3,4 reads usually or maybe you know you can get Add a billion records with something less than seven reads and find any specific individual record that you need in most cases. So, in third normal form data, the indexing ability that it has, is superior to using bucketing and that sort of thing and in a big data environment. So if you have highly redundant information in your data set, third normal form data is absolutely faster at being able to get at what your users want to see. So modernizing our legacy analytic stack, what we did for the accounts receivable portion of our of our finance implementation, we have Incorta. Now pretty much fully operating in place of what we used to use for OBIA, and Vertica and OBIEE, Tableau and our stack, there are still some Tableau reports that's that are legacy that are that are still involved in that. But one of the things that's really, really great about the product is that it's interchangeable. So you can stack Tableau on top of Incorta, and be able to pull out whatever you need to from either location. So if you prefer the tableau user experience, with, you know, visualizations that that are extremely robust and have all kinds of flexibility, you can do that. One of the huge advantages to Incorta in that comparison, and one of the reasons why it's become so popular is that we actually cross trained a bunch of our employees on how to build reports.

And this is something that can't be overemphasized. End users are constantly wishing that they could have the report just slightly different than you built it. And you know, this, if you if you do this kind of work, because you get tasks that are surrounding things like can you just move this column three columns to the left. And if you have a tool that has a really, really easy to use flexible front end, it makes it very, very easy for end users to do that. So there's a handful of things that are very, very hard to replace, if you're trying to do something like that, in a big data platform, it takes a tremendous number of skills to do that. And, and you have to understand a lot about how the big data environment works. And then you have to have a visualization tool that that's capable of handling that. Whereas with a product that has a solid, you know, what you might call a business model layer or business schema, you can hand this, this to your end user and say, Hey, here's your data, here's all of the columns that you're used to, here's the report, you can copy and paste it from my folder into your folder if you want and make whatever changes you want. And we did that with several 100 users at Meta, and they started making significant implementations to it. In fact, the last time I checked, my initial report had like 180 different versions of it that had been created by a variety of other people in the organization so that they could have exactly what they wanted. And then that would be more than 180 additional tasks that would have been assigned to me to try to create, maintain different reports and create massive, you know, massive amounts of tech debt for things to maintain.

So every time something changes, if you have some kind of schema or way of controlling the accuracy of the data that comes out of it, and you have a user interface that allows for that as well, then you can meet a business need from a user interface perspective that that's really really important to most end users as well. Our next level pipeline engineering, we have a use case for migrating from, from PL SQL to PySpark. And originally, we actually rewrote a bunch of our PL/SQL code that was doing the processing for accounts receivable aging, utilizing PySpark inside of Incorta. This is where we're going to talk a little bit about how we make the two systems work together. Because in that original use case with the PL/SQL versus PySpark, PL SQL was relatively performing at it, especially using Exadata. PySpark was also relatively performing at it because the data set was small enough that it was able to crunch through it. But we figured that we could scale it better if we weren't limited to the number of nodes that were beneath the Incorta stack. And this is where we said you know, it probably makes sense to do this and a more hybrid approach and take a look at using Presto for some of its advantages with some of this bigger data processing. And the original pi spark implementation also involves taking all of the keys from all of the relational dimensions and pulling them into the primary fact table and holding just the changes in balances for accounts receivable in the fact table with whatever the keys were for those transactions, and this is significant. The reason being is that reducing that table to the smallest possible size, made it very, very lightweight, so that you could actually get high levels of performance out of the user interface whenever they were querying this information. So if the user went in and requested something that was relational, Incorta's direct data mapping process was able to go from the dimensions to the facts quite quickly, everything's a one to one relationship there from the fact that the dimension anyway, and so that one to one direct data mapping, filtering capability allowed us to be able to produce results for our end users quite quickly, something less than 10 seconds, usually, before you start seeing the graphs pop in. And that's without anything even being pre aggregated, it really boils down to having a really, really good solid center to the star schema.

So we created that in PySpark, initially, that initial star schema central table. And then after we did that, and we decided that we needed to try to bring it up a notch, we moved the code that does that back into presto and that gave us the leverage of a much, much larger system. And this was useful for the really, really large datasets that were being compared there at that point. So it because the data got much, much longer when we're dealing with all the possible entries, and you can have 1000s of applications and or adjustments to a single transaction or vice versa. You could have one application that involves one cash receipt, for example, that's, that pays 4000 invoices. So you have lots of many to many relationships there. And it was a good environment at that point for our big data to take advantage of our big data environment to really create a product that worked. So the result of that was we were able to reduce the amount of time that it takes to process that central star schema to a much, much smaller number, bringing that back in, have everything loaded in real time. Now, one of the things now we're going to loop back to the original concept that we were dealing with, which was that real time reporting, and executives and managers everywhere want their information yesterday, and they want it as soon as it happens. And in order to deliver that you have to be able to process real time on the fly, or pretty close to it. So having a process that would actually chew through the data and five minutes or less on the Presto side and then feed that back into our Incorta environment where it was doing the relational mapping, unable to pull in all of the ancillary attributes and give you all of this really feature rich added text detail that you needed to know. And then display that in a in a robust graphical model that people can easily copy and paste and make their own copies of was extremely powerful for us. So the end game isn't always dashboards either. So one of the things that you can you can see from the way that the process works is that if you can get the data back out of Incorta, into another system, then you have the ability at that point to start doing other processing on it.

So we do, as everybody knows, Facebook is pretty big into machine learning and the things that we do around that are not limited to just the impression and click data. They also want to be able to do machine learning processes on finance information and things like accounts receivable detail. And so setting up feature sets that that run off of this information can be done by simply exporting the information from Incorta back into whatever system you have of choice. So if you get it back into a big data environment, or Presto, or Redshift, or any of the other, or Google's Cloud Platform, any of the platforms that you choose to use, you can port that information back and forth and end up with stellar results and also be able to support your other downstream bigger data applications that are not specifically dashboard related.

So how to be a data engineering hero? I'm not sure if I have this one totally figured out yet. I'm not sure if I'm the hero or the villain, I think it depends on who you ask. But when you take a look at it as a whole, there's a handful of questions that everybody has to ask themselves as they develop, or figure out what it is that they're going to be working on. Are you going to pick whatever is the business critical path, that's probably the safest, but it's also the one who the squeaky wheel gets the grease there, right. So in the case of the business critical path, where you have deliberate engineering, what you have is end users that have basically requested something that's extremely important to them. And you have to build whatever it is that is what they asked for. And we all know if you've worked in this industry for a while, that what they asked for isn't always what they really want. So being able to see around corners, a little bit helps, and that but that can take you into a really, really large list of things that might be you know, not necessarily easy, and you might not have I get a whole lot of credit for it initially.

But the truth is that if you have other things that you need to do, and it's not specifically exactly what the end user is asked for, even if it is, you have to start taking a look at how do we prioritize these? Are we going to shoot for whatever it was, that was a quick win? You know, I think the tendency is in the in most environments that people get close to the end of their periods of performance evaluation, and they start looking for how do I get more ticks on my chart here. And so the quick win becomes the low hanging fruit becomes important to them or you might try to take on the most onerous of tasks, you know, the swallow the frog kind of concept where you pick up the most difficult issue and work at, or you might have a new use case, which is Greenfield, or you might pick whatever is the most problematic, meaning that the one that's creating the most downstream maintenance. So that would be more driven by your department say instead of the business critical path component. So you might be trying to eliminate tech debt, thereby reducing your highest maintenance issues. But regardless of which path you pick, or why you pick it, it's going to lend itself generally to you know, either it's a solid big data problem, and you can solve it completely with big data tools, or it's a, you know, RDBMS Third Normal Form problem and you need to solve it with third normal form tools. Or, in some cases, a hybrid approach can actually generate a stellar result where you can use multiple tools to do you know, if your only tool is a hammer, everything looks like a nail. But in this case, if we want to put a box together with both nails and screws, we need multiple tools. That's all I have. Thanks, Nick. Fantastic.

Technical Keynote: When One Size Does Not Fit All: General Purpose vs. Purpose-Built Data Pipelines

Speaker:

Shane Collins