ZG Session | Data Mesh: From Concept to Code

Data mesh is the big new paradigm in data architecture, like microservices for data. Let's break through the hype and see why it's catching on, what's involved in building a mesh implementation, and what tech is involved. We review the range of mesh use cases, how cloud providers and commercial data platforms can help, and what specific types of tools are available for fulfilling data mesh capabilities.

Transcript:

Kaila: To talk about data mesh, we have Ryan Dawson, principal data consultant from ThoughtWorks. Joining us today, Ryan Dawson is a technologist, passionate about data who works with clients on a large scale data and AI initiatives, Ryan helps organizations get more value from data. His work includes strategies to productionize machine learning, organizing the way data is captured and shared and selecting the right data technologies and optimal team structures. We are excited to hear from Ryan. So with that being said, let's get all give a warm welcome to Ryan, as he joins us on stage.

Ryan: Thanks, Kayla. Thanks for that introduction. I'm super excited to be here too. And thanks, everyone for joining us session where we're going to be doing a deep dive on data mesh. Data mesh is the big new paradigm in data architecture. It's kind of like micro services for data. In this presentation, we're going to see what that means and why data mesh is catching on. We'll also look at what tech is available, and tried to get a feel for what a data mesh can look like in practice, and some of the challenges involved in building a data mesh. This is a lot to cover in a short time. So let's get right on in that. You've probably heard a bit about data mesh as it's the next big thing in data architecture. And there's a lot of hype. I've counted at least two talks at this conference about it.

And as a consultant, I'm seeing a lot of organizations that are looking to adopt data mesh. But there's a lot of confusion about what the most important parts are, and how to make it work in practice. So let's understand the essence of data mesh, and what's involved in applying the ideas. First, let's talk about what problems data mesh solves. We can understand data mesh as a response to certain problems in current data architectures. Popular data architectures tend to be centralized and monolithic. With one team looking after lots of different types of data, and often in just one place like a white lake or warehouse. The pain points we see with these architectures are generally connected with the central team becoming a bottleneck, or the Central Tech becoming a constraint. The team then can't keep up with the details of all the data and don't have enough time to properly collaborate with the operational teams who maintain the systems where the data was coming from. So what you get is bad data and broken pipelines.

The central lake or warehouse becomes a place where data goes to die. Data mesh diagnosis the problem with monolithic data architectures as aligning data architecture around technology, and not business needs. So instead of having a data warehouse team or big player BigQuery team, we see teams in data mesh aligned to business domains, not technology. This enables teams to really know and own the data and make decisions with more autonomy and ultimately deliver more value. The big problem with the monolithic architectures is they don't scale well at the people level. Centralization either creates bottlenecks, or creates too much distance from the data to be able to really get full value from it. Data mesh offers a way to scale both the technical data ecosystem and the people that create that ecosystem. Maybe these themes sound familiar to you. Centralized monolithic technology, and tech align teams moving to centralized or decentralization. The left domain alignment and empowerment.

Those are key themes of microservices, right? Well, there's certainly parallels just to make the microservices parallel, super clear. We're talking about moving from centralized monolithic architectures, like data warehouse or data lake to a distributed ecosystem. The distributed ecosystem can be polyglot, meaning that teams can choose whatever technology fits their use case. Think of how in a microservices architecture, teams are able to choose the most suitable language and libraries. Likewise, data mesh lets two teams choose to use the best type of data storage for their use case, rather than having to use the lake. And domain alignment means that you don't have a data warehouse team or with microservices, you don't have a UI team or a back end team. Instead, teams are devoted to a specific business area, like customer on boarding, or stocking order or marketing.

Perhaps the most important thing that Microsoft says and data mesh have in common is that they are both architectural styles with wide ranging implications for the organization. So Zhamak Dehghani, creator of data mesh describes it as a socio technical approach. It's not just about technology or systems design, it's also about organizational structures, governance and culture. Actually, it's often the softer sides of data mesh that proved more difficult for organizations to get to grips with. Likewise with microservices. microservices are useful comparison for data mesh. But there are also differences. All architectural decisions involve tradeoffs. Here's a great way to think of the core tradeoffs of microservices courtesy of my ThoughtWorks colleague, Chris Ford, If we break things into small pieces, we can change them independently. So long as we don't mind the added integration complexity.

The so long as we don't mind, it is important. If you break things up, then there's a bit more cost to making the pieces work together. The data mesh tradeoffs are a little bit different. If we devolve responsibility for data to the people who produce it, we can rapidly incorporate new data sources and use cases. So long as we don't mind distributing skills and investing in self service infrastructure. Again, the so long as we don't mind bit is important. We're breaking things up again. But what we're breaking up now is mostly people. So you're going to have data engineers, and load of different teams rather than one big team. That means you need ways for them to connect with each other and be aligned with each other. You might also have a larger demand for data engineering. So you want to try to lower the bar for what skills are needed to be effective as a data engineer. So that's the point of the self service infrastructure.

Here, you're going to be spinning up lots of different data products. So you need to make that easy to do and easy to do in a standardized way. So let's get into more detail now about what it means to implement a data mesh. Let's start with data product. A data product is not simply a dashboard, or a report or a data set. Though these things can be parts of data products. The data product isn't just a specific analytical output, but also everything that is necessary to make that output happen. It's the data storage and processing code, the transformation logic, the code that handles reading in the data, the documentation that people use to find out what the data product offers, and any metadata that the data product exposes so that it can be discovered on the mesh. A data product should be focused on a specific business function should have a core use case.

And the data product should be implemented in a way that is consistent with the rest of the mesh. Here's another view of what a data product is, and what it contains that feels how much variety there can be two data products. There are common elements shown here, the input layer, the output layer, the processing, layer, and governance. But there can be a lot of variety about each layer. The inputs to the input layer could be an operational system, or it could be another data product or it could even be an external data source like a public dataset. It could be that the data product calls an API to get its data, or it might read from a stream or read from a file as a batch job. The processing layer can have a lot of variety too. The computation might be in Spark or airflow, or in a SQL database, or a different kind of database. And there might even be some machine learning in there. The output layer can have a similar variety to the input layer.

There might be streaming or batch or an API or there might be multiple outputs. Governance features in a data product are easy to overlook. The data product might expose metadata for discovery, and for showing its listing and a catalog of data products. It might also expose quality and performance checks, as well as logs to indicate the data products health. Here's a view showing the arrangement of data products. This one from the original data mesh article by Zhamak Dehghani on martinfowler.com. This shows how data products can be divided up and how they can relate to each other. Here we see three different domains for music streaming company. There's the artists domain, podcasts and users. In the top left, we have operational systems to onboard and pay artists.

These two systems together feed the artists data product, which can offer insights on which artists have been earning the most in different time periods. At the bottom we have the users domain with a single operational system for user registration. In the user profile data product is fed by that one operational system. It has two output ports, one for monthly summaries of user profiles and one for feeding user profile updates, say as a live stream. So we see here that data output ports can be live feeds are the record level, or they can be aggregated the data on a schedule. The stream of user profile updates is consumed by data product in the podcast domain up in the top right. The data product up here in the top right takes inputs from two operational systems. And one is in the podcast domain and the other in the users domain. So we see here that data ports can belong to a certain domain, but still take data from a different domain.

We're now going to move towards what capabilities you need for a mesh and for the data products on the mesh. But to get that, we first need to recap, the four principles of data mesh, as the capabilities that we're going to discuss will flow from those principles. The first is domain ownership. We've talked about this as a lot quite a bit already. And we've seen it in the previous example for the music streaming company. This one is mostly about how you organize your teams and identify and carve up your data product. It's a super important principle. But actually, for our purposes, right now, it's the other three principles that have more implications. They have more implications, in particular for the capabilities that we build into the data products. So let's move on to the next principle. Data as a product works hand in hand with the domain ownership principle, teams need to be given autonomy and allowed to get close enough to the data to really own it.

This enables the team to handle the data product as a true product. And not just as data fall away and forgotten. The autonomy side of this means you need a variety of tech to be available for implementing data products. You can't just impose the same database type on every data product, for example, and treating data products as products means teams are empowered to delight their consumers. So data products will have capabilities to expose data in various ways, and report on their quality and expose high quality documentation that explains the meaning and origin of the data as well as its format. The self service platform principle is about empowering data product teams, and also enabling some standardization. This provides a foundation for the mesh implementation, like a soil that can help all the data products to grow. The platform should provide ways to bootstrap new data products. This could mean templates or starter kits. or for an advanced mesh, there might be a dev portal with a wizard that provisions resources using templating.

We've mentioned in other principles that data products should be easy to find, and that there should be standards and policies holding them together. The federated governance principle makes this explicit. Part of this principle is also about people and culture. Federated governance is best coordinated using a governance board with representatives from a range of data products and other interest. So it's quite different from the centralized and top down model of governance that many of us are familiar with. Federated governance also has implications for the capabilities in play in data products and in the mesh. You want all data products to be findable in the same way and from the same search facility. You want them to use a common format for reporting their quality, and monitoring and any audit trail. So now this slide here brings together the capabilities that we've just talked about, and breaks them down by principle. At the platform level, there's provisioning of resources using templates or wizards in a dev portal. For good governance, we want a catalog that can make all the data products visible to potential consumers.

We also want standards and policies and monitoring tools. For data product teams to have autonomy, we want a variety of data storage tools and technologies, and formats for input and output port. There may also sometimes be specific integrations that apply to data product interactions, or which cut across multiple data products, such as, for example, federated query or data virtualization. Now let's think about how these capabilities fit into the user experience of the mesh. There are different user experiences for different personas interacting with and working on the mesh. It's helpful to think of the mesh as having different planes aimed at different user personas. For data product developers, there's the data product Experience play. This is a kind of platform where data product developers can find other data products to connect to, and tools for bootstrapping new data product.

This plastic platform has to be built out of some underlying tools. And somebody has to build it. The somebody that builds it is the data platform developer. They build the data product experience playing using tools from the platform infrastructure utility play, where if you don't have a data product experience playing, or it doesn't have everything that data product developers need, then there's an alternative data product developers could just go straight to the platform infrastructure utility play, short cutting the sort of speeds of experience. So let's get a bit clearer on what a data product experience playing can be like. Imagine you're a developer building a data product, the platform would give you a way to say you want one graph database for storage, a Spark instance for ETL, and also permissions to load specific files from s3 as an input from another data point.

Then the platform would bootstrap the resources for you and apply further templating to give you a skeleton of endpoints for discovery of the data products, and exposing metadata or maybe you choose a different set of options, like a MongoDB database instance, instead of a graph database, or different kinds of output ports. Well, if we're lucky data product developers, then maybe we'll get a UI with a wizard to help us out with this. There are developer experience portals for microservices, which are a bit like this. Actually a lot like this. Spotify has one. And they've open sourced the foundation of it as backstage, actually backstage in the portal so much as a starting point for shaping your own in house portal. Developer portals like the ones that you can build with backstage, enabled developers to use templates to bootstrap new microservices, and pick from a configurated set of technologies.

Unfortunately, there's no off the shelf equivalent available for data mesh right now. So to build a data mesh experience playing with a smooth UI experience like this, then you have to build a lot of it yourself. So we've talked about the two planes aimed at data product developers, and data platform Developers. There's one more plane left, the top level plane is an experienced playing for consumers of the mesh. Data analysts and data scientists and business analysts. This is a very different audience with a different set of needs. They want to find data products, understand the data, understand the format of the data, request permission on data products, and ultimately use data products to answer questions and solve business problems. You might now be wondering whether you need all these capabilities? The answer is basically it depends. Meeting all these capabilities can be a lot of work. Not every mesh needs every capability.

But there's no definitive, like, if you don't do this, and you're not a match. So there's a lot of flexibility on what different projects might do. But on the other hand, if you don't have any of these capabilities, then it really isn't a match. So how do we tackle this work? Well, there are at least some off the shelf tools and platforms that can ease some of the burden. Unfortunately, it's not exactly easy to pick which tools. There's a lot of them. And it takes time and expertise to understand what they all do, and how they relate to each other. The logos in this slide are just a selection of platforms and tools. If you look at the ____ landscape that's quite famous on the internet, then on the ___landscape, there's just way, way more tools. I've attempted to put some order on the chaos and have set up some tables in a GitHub repository. The link to the repository is at the bottom of this slide, and I'll share it at the end.

The table is looking at features of commercial data platforms, one shown here anyway, but the feature categories roughly correspond to the categories of standalone tools. We don't have time to go through all of the details right now. What we will do is go through some of the key categories and understand how tools in these categories can be useful for data mesh implementations. Let's start with a look at the data catalog or data discovery category. Basically, a data catalog is an index of what data the organization holds, and where all of it is and metadata about the data. This makes the data findable and usable. The Catalog may have discovery or call a service for finding data.

But typically data can also be explicitly registered. Having data products registered themselves fits better with a mesh philosophy than using crawlers. As this way, the ownership of catalog data is on the dev product. Once data ridge is registered, then users can search for it. If a user of the catalog finds data that they want to access, then they need to go through some permission request flow. Setting permission for datasets and even tracking the lineage of datasets can be a data catalog feature. Or sometimes platforms have permissions and lineage as a distinct concern, separate from the catalog. Either way, these are capabilities you're going to be looking for, as allowing users to discover and request access to data is a key feature of the mesh experience plane, and a key capability of the mesh under the federated governance principle.

A data platform naturally needs to provide data storage, and a way to query data in the storage. Most platforms provide lake or warehouse option, and some provide both. There are also a most commercial platform says there are also hybrid Lake and warehouse types of storage available. Pretty much all the platforms in the GitHub repo that I mentioned, provide ways to store, transform and query both structured and unstructured data. Warehouse and Lake storage are both likely to be needed by data products and polyglot storage is a key capability flowing from the data as a product principle. It's worth noting that most commercial platforms provide options for this, but not necessarily in a data mesh way that lets you easily define particular transformations as belonging to a particular dataset. The polyglot storage aspect of data mesh also means that document graph and time series databases might find their way into a DataMesh. Even though these are less common in analytical data stacks.

Typically, data platforms have some sort of no SQL option. But how many options and how mature they are can vary quite a lot. The cloud providers have them because the cloud providers are also catering for operational teams. Where operational teams want to use these kinds of databases more often. Data platform specializing in analytics are less likely to have these, this can be an area where picking a data platform to support your mesh might not be quite enough. And you have to bring in some custom tools for particular cases. We've seen that the data product input and output ports may involve streaming of data. Actually, this is a very common way to do the input and output ports, as it allows for an event driven architecture. So it's important for data product developers to have this available as an option, ideally provided by the platform. If your mesh platform is using an underlying commercial data platform, then you'll want to at least have one streaming technology option to be provided that, and the platforms that I've included in the GitHub repo do have a streaming option, though they can vary quite a lot in the details of how they work.

The reporting and BI tools are easy to overlook, as I think many people in the data space have a back end bias. But building reports and dashboards is super important for data products and super important for getting value from data. This is a key capability flowing from the data as a product principle, namely that we should delight consumers. So that's some of the most relevant categories of data platform features and tools for data mesh implementations. There are more categories I'd love to discuss, we don't quite have enough time for that. What we've seen is some of the most relevant and most useful to cover hit. But we have to be cautious because what data platforms provide right now is not the same as the data mesh platform playing that will empower data product developers. There's nothing like the wizard or templates to build a data product that we talked about as part of the data platform experience. Though some platforms are starting provide some templates. The fundamental gap is that these platforms have been designed around an idea of providing specific services and not data products.

So to build a data mesh platform on top of a commercial data platform takes some work and can be awkward in places. There is an alternative system starting from a commercial data product, you can simply start by selecting a range of individual tools, maybe a mix of open source and specialist commercial tools. These might be like a curated and assembled into a platform layer. Naturally, it's a lot of work to get the platform experience that delights developers that we talked about earlier. The integration work can be tricky as the tools probably weren't originally designed to work together in a mesh. Much like services and commercial data platforms weren't originally designed to be used in the mesh. Just to give a small taste of what the integration work for data mash platform can involve. I'll just highlight that I worked for a consulting firm ThoughtWorks, which partnered with Saxo Bank to build a self service data infrastructure. Data Hub was chosen as the data catalog. But at the time, there were limitations on being able to show quality metrics around both data sets and data hub. So ThoughtWorks, and Saxo contributed to an open source plugin, the integrated data quality tool, namely great expectations to make it work with data hub.

So now to start wrapping up, we've seen that there can be a lot of work involved in setting up a fully featured platform playing to support data product developers. In some situations, putting in that effort upfront can be overkill. Maybe you won't have all that data, that many data products, maybe they'll all be pretty similar. It can also be challenging to justify a lot of investment in the platform upfront when you want to get early data products live to demonstrate value. In these cases, it can be okay to be pragmatic and provide a lesser experience to developers, and users. Remember, the data mesh promise, the investment required for self service infrastructure is part of the trade off. If we make the investment, then we can reliably devolve responsibility for data to a distributed set of teams who are close to the details. The more teams and the more data products, then the more we need to invest in self service infrastructure, we don't make that investment, then teams may struggle to get going. Or they may build things that diverge more than we want them to.

That's the trade off. So before deciding to invest lots in self service infrastructure, it makes sense to assess where you're going to get value from your data mesh initiative. At ThoughtWorks we recommend undertaking an exercise to map out the key areas where businesses value can be unlocked and identifying the highest value data products, then you can focus your initial efforts to ensure that your initiative can get some big wins early on. In this presentation, I've mostly focused on the technical side of data mesh, platform capabilities and tech to implement them. But it's important not to let tech questions become a distraction from the biggest questions of your data mesh implementation, which don't tend to be on the technical side. There's lots of organizational pain points and data mesh requires organizational change. That isn't easy to achieve and you'll have to break down barriers to do it. Teams will meet resistance and challenges and sort of points, they'll be asking where they're meant to be going. You want to be as clear as possible about why you're going data mesh, what pain points is going to solve and which data products you'll get the most value from. So thanks for joining me on this in the session and others that GitHub link again, paste it in the chat just a sec. And yeah, hopefully we can discuss a little bit now. And also feel free to get in touch with me afterwards by email or LinkedIn.

Kaila: Thank you so much for that amazing chat, Ryan. We have a couple of questions from our participants here. The first one going into it is you make a great point about the higher cost of making microservices work together. With data mash, though, don't you need a comprehensive high quality data catalog?

Ryan: I certainly recommend having some way to bring all the data products together like make them visible. Yeah. So when the presentation is covered the relevance of data catalog to that. Yes. Certainly value in catalogs for data mesh.

Kaila: Gotcha. Thank you. Here's another question too. So data mesh and principles, a significant shift. And I know you talked a little bit about this at the very end of your presentation. It's a significant shift and a lot of work to set up. But what kind of business case and justification supports this shift?

Ryan: Yeah, this is a diagram that I didn't include in my slides, which maybe I should have borrowed like Zhamak Dehghani, the creator has this diagram where it's like. Basically, it shows business value flattening off as the as the organization's gets bigger, because the data infrastructure is just not able to, not the infrastructure, but like the data team isn't able to scale. So that's where it like the where the business case comes in. It's like, basically, over large organizations plowing lots of money, and today, they're not getting value back in proportion to the increased investment.

Kaila: Gotcha. Well, thank you so much. I think it is time to head right into our lightning talks. So thank you so much for your time, Ryan. Thank you for joining us today at zero gravity, and have a great one.

Ryan: Thanks for having me.

Data Mesh: From Concept to Code

Speaker:

Ryan Dawson