ZG Session | From Data Mess to Data Mesh: Building Data-as-a-Product

Data mesh is a fast-emerging approach to managing data in organizations where autonomy is critical. It advocates that companies organize around data domains, with each department or business unit owning their data and productizing high quality and fresh data sets for sharing with data consumers across the company. However, to deliver data as a product, companies need infrastructure that helps them build, scale, and maintain data pipelines, and that can handle modern challenges such as nested data, stateful transformations on streaming data, schema evolution, pipeline orchestration, incremental ETL, and more. In this session, you’ll learn how to productize your data to deliver high-quality data sets in minutes by automating the manual tasks of pipeline development, maintenance, scaling, and orchestration.

Transcript:

Joe: Next we have Roy Hasson presenting from data mess to data mesh, building as data as a product. Just a little bit about Roy. Roy Hasson is the head of product at Upsolver. He works with customers globally to simplify how they build, manage, and deploy data pipelines to deliver high quality data as a product. Previously, Roy was a product manager for AWS Glue and AWS Lake formation. Roy, I want to welcome you, we're excited to hear about your experience with the data mesh, the floor is yours.

Roy: There my Joe, how are you? Cool, thanks a lot. That was actually a good intro, maybe shorten some one of my slides. But uh, anyway, I just want to say thank you, I think this is a really great opportunity to kind of teach and to learn about some of these technologies. So in my session today, I really want to focus on the idea of building data as a product as part of the overall design pattern or principles of data myself kind of go through some of those things and you'll see that. So Joe did a pretty good job, but had my introduction. By the high level, again, I'm the head of product at Upsolver, came from AWS was there for a while worked on many of the analytic services there. So spoke to many, many different customers, help them build their data platforms with a data lake started working on data mesh, a lot of different design patterns and best practices. So I want to be able to kind of share some of that with you today.

And see how we can advance the technology and advance the way that we use data in a more efficient and self service way. And you kind of see what that means. So starting at the top, I wanted to start with talking about what are some of the challenges that we see. And we hear this a lot from our customers, we hear this from just users in industry, on LinkedIn, on mailing lists, etc. So you're gonna hear both from the business side of the house and the engineering side of the house. On the business side, you hear things like, you know, long lead time for new data sets, you know, I'm getting stopped, I'm getting hampered or bottlenecked by some central IT/data team, how do I become more self service, so I can act on my data by myself, and also find ways to easily access my data.

That's one of the things that we hear at a high level from businesses. On the engineering side, you hear similar conversations around there are just too many tools and technologies for me to choose from, it takes me a long time to test and to evaluate performance and security. And like how do I get to picking a tool very quickly and be able to just get going, and then the integration, right? There's many different tools, and they don't all integrate the same way. And they don't all integrate fully. So that's a challenge for engineers. But if you listen to these questions, and you kind of boil them down, at the end of the day, it's really about the difficulty to deliver insights quickly to the business, right. That's kind of what we want to do at the end of the day. So how do we solve these problems? How do we think about them?

So there are different ways that the industry and us, vendors and technology providers were thinking about it? First, we're kind of looking at how do we address that challenge with new design patterns. So there was a data lake, right. Data Lake can think about it as a big hardware store, right? You come in, there's a lot of stuff for you to choose from, you have lots of flexibility, you have lots of control, you can pick which tools you want to use, but you got to build, you got to build everything, right at the end of the day, you may end up with a beautiful house that has all the features and capabilities that you want. Or you may end up with a shack that's kind of falling apart and not quite working. And that's where we see some of the challenges. So again, it's an open platform and a way for you to get started and do it on your own. But there's some challenges if you don't know how to do everything.

Then we have this kind of emerging idea of this data cloud, right? You hear it as the lake house platform, or, you know, the data lake platform, but it's kind of like a one stop shop for your data needs. You can think about it as a pre built a prefabricated house that comes in with different types of configurations, different sizes, different square footage, maybe different colors, different features. But for the most part, it's contained, right, you can choose from a number of things, but you don't have a lot of choices. So it makes it a bit easier to pick something and to go build it so that you can pick it you can get started. It works quite well. But there's some things that you have to think about right? Is my data truly open? Can I access it from anywhere or can I only access it from within this cloud? So there's some things you got to think about, but for the most part, it makes it easy for you to get started and use the system.

And then on top of that you have this new, I would say an emerging design pattern called data mesh, and it's really about organizing your company or your organization around your data. And it's you can think about it as kind of like is your planning board in your town, saying, Hey, you want to build a house, you want to build a house over here, feel free to build it however you want, you can buy a prefabricated house, you can go to Home Depot or your hardware store and, and buy all the parts and build on your own. You can do whatever you want. But there's certain controls and certain best practices that we want you to follow. So that the rest of the houses kind of look the same, you know, have the same capabilities, everybody got to have power, everybody gotta have septic, right. So there's certain things that you got to do there. So that data mesh helps put structure around that, but it doesn't say only one company must build the house for all for everybody. So maybe I kind of went a little too far with that analogy.

But again, data Lake, you can kind of do whatever you want a Data Cloud, you have flexibility, but it's a bit more contained a little bit more of a walled garden, where a data mesh is just a way for you to organize those things, and give each of your business units a bit more autonomy and control about how they solve their data problem. So again, these are some design patterns that the industry have come up with that charted alleviate the challenges of getting data and getting insights to the business quicker. And other way that we're seeing these challenges being trying to be addresses through new tools and technology. And I know it's a little bit of an eye chart, but this is this is a diagram from Andreessen Horowitz, kind of they're the unified data infrastructure(2.0). And all you need to really take out of this is for each component in the pipeline, whether it's storage, query, processing, transformation, analysis, there's a slew of different tools that are available in the market today. And there's probably a lot more that aren't listed here that are in development, they're coming up. You know, we're not talking about them.

But all this says is that there's a lot of different tools to try to solve these problems. And it becomes really hard for your team, whether these are data engineers, or even business people who are trying to stand up some data platform so they can run some queries, they got to think about it, they got to make decisions about it, it's not that easy. So it's not just a matter of introducing new tools, now we have a lot of tools. And one of the things that we've noticed in the market as well, is that all these different tools are trying to solve this same problem, but are coming at it from different angles. And what that means is that you end up with different tools that like sound different, they look different but in reality from functionality perspective, they're very similar. So how do I decide between one tool over the other, when they're both kind of telling me they're solving the problem differently, but feature functionality wise, they look the same or they look very similar. So it's a challenge and it slows organizations down because need to spend a lot of time thinking about it, and testing it and evaluating it.

So another way that the industry have tried to solve some of these challenges, especially the time to come to Insight is through repurposing existing solutions. So when we look at the broad market, we say, Okay, some companies have a lot of data engineers, they know how to operate a data, how to operate a data engineering organization. And for that, we should have some solutions that are tailored towards the data engineer persona. Commonly we see this with ETL, right, extract, transform and load. And these are represented as engines that are typically backed by Apache Spark or some other data processing engines. And they're focused around the decoupling of components, and giving you the flexibility to plug and play based on what your exact needs are. And this is tailored again, towards data engineers, people understand data, or understand data platform distributed data solutions.

So it's more tailored to them requires more heavy lifting, but gives you a lot of flexibility. On the other side of that we have tools that are focused towards the database admins, right, tools that are basically designed to help those users do more with the business. So for that, we've been using the data warehouses that exists today, and actually provide more capability. So we have faster data warehouses. Data warehouse that has separation of compute and storage in a warehouse that are more autonomous or automated. So those are kind of the things that give the existing database admins, the ability to do more with the platforms that they're already familiar with. So again, we extract the data from the sources loaded into the central place. And then we can use the same technology, same tools, same experiences and skill sets that we've had from the past to transform the data and deliver value to our business.

And then on the far end of that, if I'm a company that doesn't have DBAs. I'm a company who doesn't have data engineers, I don't want to hire them, but have data analysts, then there are solutions around saying, well, maybe I'm going to leave my data where it is, right. And then I'm going to query that data in place. And by doing that, I'm giving control to the analysts to say, you don't need anybody to help you to get this data, it's already inside of our operational databases, just go and query it where it is, and join multiple system, multiple data sets across these data sources. So again, it's a solution that works. It's tailored towards one specific type of persona, trying to again alleviate the bottleneck of data engineering and having other people help you design your data models and things like that. But it introduces its own challenges. So these are kind of these repurpose solutions that are trying to solve the problem in slightly different ways that are targeting different personas.

So with all of that, right, we have a number of different ways to solve the problem - design patterns, tools, solutions, how do we put it all together? One approach that I'm kind of recommending, it's not the only one. But of course, this is one that I've evaluated with other customers and got some good feedback on, is bringing all this stuff together and saying, let's break it down into three layers. The first layer is the layer of data as a product. And what that does is basically enable me to build consistently reliable data sets that can be delivered to the business quickly, by ensuring that they're accurate and fresh, and always available, so that the business can drive value out of it as quickly as possible. So data is a product, and I'll talk about that in a second kind of what that means. But it really focuses on the reliability and the accuracy of the data, right. And the ability to add new datasets through this data as a product solution reduces the dependence on other systems or other users to help you get more data into the business so you can do more.

Now, that we build this kind of idea of data as a product, how do we offer it to the business, this is where data mesh comes in. So each line of business, each team within the business may have different requirements. And they may have different needs. We don't want to say well, okay, it's data as a product great, are we going to bring it into a central IT team, and they're going to be in control of it, we're going back to the same problem of autonomy. And I want, to be able to have more self service self control over my data and how quickly I get the data moving, or how I change my data. So data mesh gives us the design pattern that says that, you know, if I'm, let's say the one of the micro services, and I have a bunch of data that I want to share with my marketing team, or my ad team, I want to be able to create my own data as a product through my own tooling, but then share it with the rest of the organization using some of the best practices defined through the data mesh pattern. And then underneath all of that, you still layer that data lake, right, that open source of truth or lending zone for all of your raw and even your semi raw data, transform data for the organization.

So we're not duplicating data, we're not recreating it, we have a single place where we can store the data, so anybody can access it, and actually also discover it. Now each micro service, each kind of data mesh domain may have their own landing zone. And that's okay, right. But the data that we want to expose to the organization should be accessible through this data lake. So what does that look like? I tried to do my best to have some colors and some diagramming of kind of what my idea of this thing and how this looks. So you can see these in these dotted boxes. That's kind of our data mesh data domains, right. so we have a data domain A, we have a domain, domain B, we actually have a data domain it. So this is the data domain that holds our corporate wide data lake, maybe does datasets that are not owned by any one team, per se. But these are data sets that may be coming from our partners. Maybe there's some corporate data that needs to come that you know, not one team is responsible for, which is probably not ideal, but could happen. And then you bring this into the central data lake.

And it's just a data lake it just storage, right, we're not putting any kind of transformation layer on it, per se, but it's primarily storage. Now, each of our domains. In this case, A and B have the flexibility to build their data platform however they like. Now, one of the things that's important with this data as a product concept, is that you want to build these data pipelines, right. These transformations that take raw data from its source systems, transform it and exposes it or publishes it to the rest of the organization through a set of approved or preferred interfaces. So if you don't have any applications that consume this data, you only have users, you may choose to publish the data to the data lake and expose it that way, you may choose to publish a dataset into a data warehouse and expose it through the data warehouse interface to your users or if you do have applications that need access to the data, you may expose it through a set of API's, right? These could be API's that you built as part of your micro service.

This could be API's that your data as a product tool is offering, right? So this gives you the flexibility to say if any user wants to come in and say, Hey, we got new data, we want to publish it as quickly as possible, I go to my tool, and I'll publish the data processing and publish it to the rest of the organization in a reliable and consistent way. Now, what is data as a product? There's a number of ways to answer this question. One way that I chose to answer this, and feel free to kind of add your thoughts on what this means is, it's a way to deliver reliable and accurate and fresh and also measurable data to the business based on the business needs. Just because I have a data set does not mean the business wants it, does not mean the business needs it, maybe I do for my needs.

But that's okay, it can stay in my landing zone, I don't need to publish it. But if it's a set of data that the organization needs, I need to be able to deliver it. And it's not enough to say here's a table, go use it, and then don't have any SLAs behind it, don't have any rigor and control behind data. So if, for example, something happens, the data is no longer being updated. Because the, you know, the Kafka stream stopped working for some reason and then somebody forgets about but nobody updates it. Now we get dashboards and reports being built on top of stale and old data. So there has to be rigor, there has to be SLAs, the quality and accuracy of the data is really, really important, which is why needs to be measurable. So that's kind of what the idea of a data of a product is. But again, it's a concept.

So bring it back to how Upsolver solve this. So Upsolver, we are a service that allows you to build data pipelines using SQL. And here in this couple of screenshots that I show you, we kind of take that idea of a data pipeline, and we break it up into two parts. The first part is how do you connect to a data source? See you can understand that source? You can do some observability. Hey, how many events per second am I getting? Do I have any drop in events, do I have a spike in events, I also want to understand the schema, right? I want to auto detect the schema, and then I want to see it, I want to look at it, it did a change, you know that something happened, somebody added an event, that schema now change, I want to be aware of that before I'm pushing those changes downstream, and potentially impacting my, my downstream systems.

So again, be able to connect to a source, Kafka, Kinesis, you know, s3, whatever that may be observed that data, bring the data and learn the schema of it, and then move to the next step. And the next step is now that I know what my source data looks like, I can go build my transformation. And there's no need to, you know, to write Python code or Scala code or set up my spark environment. I simply provide the SQL, the business logic transformation that I want defining SQL inside of Upsolver. And that's where the primary focus is. There's no orchestration here, there's no scheduling here, there's no, you know, Hey, I gotta plug this one, in this one, I don't have to deal with external orchestration systems like airflow, or things like God, I simply pull in my business logic transformation and then I run this pipeline and it just runs. Right. And that's extremely powerful from a self service perspective is that any user who understands SQL can come into Upsolverputting their business logic transformation, say I want to take the data from Kinesis.

And I want to push it into redshift. And with one simple pipeline, they can do it, there's no need to talk to data engineers or anybody like that. There's nothing to schedule outside of the system. It just runs. If anything happens to the pipeline, you have the ability to look at it, right? Hey, did I have a spike in events that I didn't count forward to have a drop in events that didn't count for? Is my data getting delayed? Right? It's taking me longer and longer to process it, for whatever reason, which result in the output being delayed as well? Do I have some data quality issues in the system because I can see data quality metrics on my data as it's coming in, right? So this is extremely powerful. I'm not sending the user to different tools to look at observability of quality or state of the data. It's all in one place. So that makes that data as a product feel like I'm actually have a reliable and responsible data that can produce.

Now I can write that data into Redshift and offer it to my users that way. I can write to a data lake and offer it through Amazon Athena or whatever tool you want, or I can actually with Upsolver over expose it as a materialized view, that gives you real time or almost millisecond level access to the data via API's. Right. So it's a serving layer built into Upsolver that you can plug your applications into directly. So you don't need an extra system to get access to the data. So again, it's about being easy, fast and reliable way of delivering data to the business within the context of data mesh, or, open data leak, because all the data is stored in open formats, metadata is exposed to the glue data catalog, or a hive meta store. So it's open and accessible. Just a quick example. So ironsource is a customer of ours.

You know, the way they're kind of looking at it is that they had a ton of ton of data, well, they still have a ton of data, and it grows and grows every day. And they kind of explored doing this via Spark. And it took them a long time, it took them a lot of data engineers to kind of put this together. And they were able to bring down the amount of work that would have taken a whole year, down to about a month, which is extremely powerful testament to say that if you want to deliver value to your business quickly, this is an approach to do that. They don't have to know how to operate Spark, they don't have to know how to troubleshoot it, right. As you scale with volume of data, using Spark becomes harder and harder. And you need to know more about it. And then the other piece that I think is really important here is that this is not just data engineers, there are over 100 Data domain users, data analyst, data, scientist, data engineers, different folks that are accessing the tool and building different data sets for themselves in a much more self service than what you'd have otherwise, if they went through the central IT team.

Alright, quick summary. And then I'm gonna jump into questions here. So again, starting up as a data mesh, data mesh gives you that organizational structure of how to organize your company around these data domains, and who should have control over wide who should have ownership and accountability over the data. The data lake is that open, scalable, ubiquitous source of truth of your data. So it's not locked into one platform. But it's available for you to access whichever way you want. And then the data is a product platform, in this case, I'm referring to Upsolver here gives you the flexibility to have self service to declarative SQL. So it's easy for most users to use the data and the infrastructure is automatically managed in data's optimized. So you don't really have to worry about best practices and things like that is just done for you. And it's flexible in the way that you serve the data.

So you can actually offer the data to your users in whichever way they need it. So that's kind of the overall gist of what I'm trying to get to. And how do we simplify? And how do we get closer to the reducing the time to insight, which has been a challenge for us for a very long time. And this is one way to do it. Okay, so I just want to say thank you very quickly, if you have any questions for me, or you want to learn more about Upsolver, data mesh, data as a product, you know, go join our Upsolver community on Slack. I'll be there. Several other folks who also will be there. Love to have you and continue this conversation. So I think there's a few questions here. Oh, Joe, nice to see you.

Joe: I've gathered a few of the questions, and I'll read a few of them off, if that sounds good. Okay, perfect. So we had one participant asked, one of the main challenges of data mesh I see is that data mesh complements a decentralized data architecture, which at one side would improve data consumption by enhancing domain autonomy. And on the other side becomes very difficult to govern the data. How do you ensure that the data being documented ownership established and quality maintained? The governance is not centralized, but federated?

Roy: Yeah, I mean, it's a great question. And to be honest, like I don't I don't think we have a really good solid answer on exactly how to solve it. There are different design patterns of how I've seen companies solve this. One design pattern is, you know, simply distributed governance. Right. Each domain has accountability and responsibility for managing their data both on a security level, you know, publishing those datasets, according to a certain standard guidelines of the company, documenting and exposing it through a central catalog. So that's one pattern. And other pattern is to basically kind of like what I showed a little bit earlier, we have a central data lake that these data domains are reading and writing data from, and that's how we publish that information.

We can also expose governance controls on top of that. So there's different kinds of ways to solve this problem. Again, depending on your organizational structure, it may be different. At some company as I worked with, you know, kind of grew through acquisitions, and things like that, so having central governance is gonna be really, really hard. So you have to do distributed governance, each entity is kind of responsible for their own thing. But you do have a corporate mandate to say, these are the things that you must do and then you have periodic audits. And I've seen, you know, governance boards, data privacy boards, that are the ones responsible for maintaining that high standard, and working with the data owners, or the governance leaders in each of the data domains to make sure that they implement those best practices.

Joe: Right, thank you. Another question from a participant, they said that sounds awesome writing business logic and sequel that gets deployedin Upsolver. How is the performance on larger scale datasets with this approach?

Roy: So actually, the performance is great. And the example I showed you with____-, you know, we're talking about, you know, four, I think even more than that 4 million events per second. That's massive, that's massive scale, be able to take that and transform those business logic, transformations and output it, we've proved that we've showed this with many of our customers. So I would say, you know, claiming that my you know, the performance is really, really good in this. The thing I think that you've also got to remember that I maybe didn't do a good job at presenting is that when you move the business logic transformation into Upsolver into that, that middle step, right, you kind of cut out the data transformation step in potential in your data warehouse.

Now, I'm actually generating my business metrics inside of the ETL step right instead of Upsolver, and I'm pushing those results into my warehouse, so their query about right away. So you cut out a piece of that, of that extra step. Right. So now my data is available to query right away, as opposed to say, well, here's a semi raw data in my data warehouse, then I gotta run another step that transforms it, and materializes it, and it makes it query. So I would ultimately say that that end to end the performance is actually better with following the approach that I described with Upsolver, then it isn't in some of the common patterns that we've seen.

Joe: Great. One of the points that you brought up to the conversation that I thought was kind of interesting is, from a data as a product perspective, wanting versus needing right, like, you don't need to put everything into your storefront for people to serve on right. What can you do to help differentiate what the organization wants or needs?

Roy: So I think the, the gist of that is, you got to have conversation with your business stakeholders. Just like building an application building a product, you don't just build a product and hope somebody will buy it, right? You talk to your customers, you talk to your potential customers, you understand the market, you need to understand what you need to offer and how you should be offering it before you build it. So I think it's really important to talk to your business stakeholders, your consumers, internally or externally, what they need and how they want to consume it. And that's where you come back into your organization, you say, Okay, this is how I'm going to offer this data.

And we've seen customers, and actually, I did a webinar just a couple of days ago with, you know, Proofpoint, one of our customers that uses Upsolver to output two things, first, into the data lake for reporting and ML and analysis, and another one to Elastic search that they're actually using inside of their application to drive their dashboards, to drive their logs searching. So I think that's a really great example of, you know, I got two consumers of the same data. They spoke to those consumers and understood what is the best way to deliver that data to them. But what they know is that it's the same raw data coming in, it's the same exact transformation. The data is reliable because they see it. And then they're able to output the same thing into two places. Right. So it's not like they're monitoring or dealing with two separate systems. It's one system, two different outputs. Great. Hopefully that answers your question.

Joe: It does. And the one, the one additional thing they'll add on top of that, as a curiosity is adding context to the self service, right? We've built our Chinese storefront of data, but if someone stumbles into our domain, looking at for the first time, what can we do to set ourselves up to make sure that people are really understanding what they're ingesting for their analytics?

Roy: Yeah, so I think that's um, that's a very important step. Didn't really talk about it here. But there's, you know, there's patterns for that as well. Right? As a as a data as a product or a Data Domain. However, you are exposing the data that data needs to be discoverable, right and understandable. Any approach that I described because Upsolver is that that self service mechanism for you to be able to produce datasets you can also go into Upsolver and you can say, Okay, this particular job is generating these data sets. Let me understand what it is. Let me look at the lineage, let me look at it the schemas etc, etc. And then you know you're able to publish this stuff through through data catalogs and things like that will give you more information. And there's plenty of tools around that as well.

Joe: Great. Well, thank you, Roy. I really appreciate the discussion and leading us on this, this new world of data mesh, and kind of bring that thought leadership into the zero gravity conference.

From Data Mess to Data Mesh: Building Data-as-a-Product

Speaker:

Roy Hasson