Today, metadata is everywhere. Every component of the modern data stack and every user interaction on it generates metadata. Apart from traditional forms like technical metadata (e.g. schemas) and business metadata (e.g. taxonomy, glossary), our data systems now create entirely new forms of metadata.

All these new forms of metadata are being created by living data systems, sometimes in real-time. This has led to an explosion in the size and scale of metadata. However, this has made it difficult for centralized data engineering and reporting teams to fully leverage metadata that holds the key to the elusive promised land — a single source of truth.

Teams today struggle to keep pace with “service requests” and to ramp up on domain expertise needed. The net result is lost opportunity. But by effectively collecting metadata, a team can unify context about all their tools, processes, and data.

This session covers how can data teams leverage metadata to power better collaboration, enabling organizations to accelerate onboarding and build data products outside the central team. We cover active metadata and how its application through embedded collaboration, open API interface, powered by knowledge graphs, programmable-intelligence bots, data process automation, reverse metadata, can help meet enterprise governance standards.

 

Transcript:

Kaila: Joining us to lead this session is Prukalpa Sankar, co founder of Atlan, a modern collaborative data workspace. She previously co founded social cops, a leading data for good company where she lead data teams for large scale projects with organizations like the UN, the World Bank, the Gates Foundation and projects with several large governments including building India's national data platform. So without further ado, I want to welcome Prukalpa to the stage..Prukalpa.

Prukalpa: Hi, everyone, super excited to be here. Thank you so much. Before we start, I thought it would be really helpful to actually start by saying, you know, who am I and, you know, why am I quantificating about this topic of activating metadata? And what does pipeline chaos really mean to me on a personal basis. So my name is Prukalpa. I, the best way to introduce myself as a lifelong data practitioner, I'm the co founder of a company called Atlan. We're pioneering active metadata. And what we think of as the real collaboration layer to the modern data stack covering teams today like Postman, Plaid, Wework, Unilever, and so on, but most importantly, the journey to get there with tons of successes and failures in building data culture. We failed three times ourselves in building a data catalog and finally got it right before time.

Build India's national leader platform, which is used by the Prime Minister himself and ran over 20o data projects, and dealt with all the challenges that come with that. And a lot of those learnings are what had brought me to where we are today. So I think we started as a data team ourselves, using data science for social good. So we were dealing with a wide variety and scale of data, we were processing data for 500 million Indian citizens and billions of pixels of satellite imagery, working with organizations like the World Bank, the United Nations, the Gates Foundation, and so on. And while on the outside, a lot of these projects seemed like incredibly cool dream projects for a data practitioner.

The reality internally, this was like the story of my life. So what you see on the screen, these are actually slack messages, that we you know, that I would see on a daily basis, right, like, what does this column name mean? Which DSS should I use for this analysis? You know, what, which, you know, BII data? Is it accessible to the right people or the wrong people? And, you know, all of that, and, and honestly, the dreaded question that all of us as data practitioners, always like wonder about, right, like, the, that number doesn't look right. I still remember this one time where the cabinet minister called me at eight in the morning, and he said, Prukalpa, the number on this dashboard doesn't look right. And I still remember, I opened up my laptop, and I'm frantically opening up the dashboard, hoping that nothing's wrong.

And then there's a 2x spike in a day. And so clearly, something's wrong. But there was nothing I could do at that moment, like I had no answer to give to my stakeholder who I built trust with over a couple of years. And so I said, you know, I'll call you back. And then I called my project manager, who called my analyst, who called my engineer who pulled out audit logs. And he could troubleshoot it, because he didn't know what that really was actually meant. And it took us four people, eight hours to even figure out what went wrong. And so obviously, I do lost agility at that time. But most importantly, at some level, right, like trust broke in our team. When I wasn't able to answer to my stakeholder, in this case, the Cabinet Minister, why something didn't look right, the way it should be. Trust broke for me, and at the same time, just broke between me and my data engineer, I was even questioning and saying, you know, is my.. Did my pipeline even break or did my data engineer do something wrong.

And that really started the journey of what we call the assembly line project. We got to a point where our time team was spending 50, 60% of our time dealing with this kind of chaos on a daily basis. A large part of it is pipeline chaos, collaboration chaos, that we would see every day and we said we just can't scale like that anymore, we have to find a better way to scale and started reading internal tooling and experimenting with cultural experiments to really see how do we find a better way for our team to work together and eliminate this chaos from our lives.

Over two years, our team became six times more agile. We went on to do things like build India's national data platform, which the prime minister himself uses, it was the fastest and largest public sector data lake of its kind was built by an eight member team, in literally 12 months, we work with the United Nations on the SDG agenda, and how to incorporate your science into that. And we're able to even do things like, you know, predict affluence down to an individual building level, all with an incredibly small team at a speed that just would not have been possible without the tools that we built for ourselves. And the solution that we finally I treated our way towards, is what I think of today as activating metadata. And metadata for a very long time, and the modern data stack has been the uncool child in some ways.

But if used the right way, in a modern data stack, can lead to an incredible amount of efficiency and gains in the way that team operates. So before we go into saying, what is activating metadata actually mean? You know, I think it's helpful to actually just take a step back and say, What is metadata? Right? The simplest way to think about metadata today is just data about data. I, if you think about your data itself, as a record, or a customer, you know, customer 360, if as you think about a customer, you think about all these, you know, properties about a customer, right, you think about, you know, your account size, the rep that's working with that customer, and a whole host of data points about that. Similarly, as you think about every single data asset, these could be your tables, your dashboards, your, you know, pipelines, your models, all of those data assets, the data that's really surrounding these data assets is what I think of as metadata.

Traditionally, metadata has often been about two kinds of metadata. So technical metadata, this is really where the origin of metadata comes from, you know, for the last couple of decades, schemas data types, models, you know, things like that, you know, just metadata that we get from our database systems itself. About a decade ago, we started layering on top of that business metadata, which were things like data tags and classifications, business terms and things that steamboat ships that are driving in the in the ecosystem. And then from there, you know, I think, the most recent forms of metadata and I think the more important forms of metadata are really, you know, social metadata, right? Every time, we're having a conversation on Slack about, you know, which datasets should I use for this analysis? Or what does this column name mean? Or is this the right data? Does anyone know? Then, you know, how do you actually create? How do you bring back that social context and bring that back into your overall metadata layer? And then finally, operational metadata. I think this is the most recent form of merit and what I'll focus on a large part of today on but really things that are operational processes and systems are generating on a regular basis, right. So this is right from things like pipeline status, process, metadata, lineage, and so on.

Now, the traditional way that we use metadata is like this, right? You take a bunch of metadata from a bunch of different tools, and you go put it in this tool, you typically call it a data catalog, or a data governance tool. And the data catalog broadly tends to be sort of like an organized inventory of data assets inside an organization. And the challenge with this approach is that it's fundamentally broken. We failed ourselves three times in getting a data catalog up and running inside an organization. And the reason for this is because it inherently actually tends to solve a problem of silos by creating one more silo tool in the ecosystem, right? If you think about from an end user perspective, when I'm in a dashboard, and I say, Can I trust this dashboard? The last thing I want to do at that point is actually jump from the dashboard into this one other tool and search for that dashboard and trying to understand the lineage and see, okay, can I trust this dashboard, right? In today's world, users are used to experiences where they want context where they are and when they need it. Second, this typical approach actually is incredibly generic. It treats every consumer as the same with absolutely no domain context that's associated with it.

But if you think about it, context means extremely different things to different people. Context for a data engineer means, you know, typically, you're looking at Pipeline status, did the data quality run failed or not? And a whole different kind of metadata that you would want as context that is for that same table, if you look at a business user, the business user is likely going to say, what does this you know? What are the metrics that are associated with this, the business analyst is going to say frequency distribution of columns? What are the analysis has been derived from driven from these tables? Um, so how do we actually create experiences where end users can start leveraging this universal ________ context to make their daily lives better? And then finally, automation, right? Traditionally, there's been minimum automation, creating very, very manual systems, lots of manual overhead for users, and ultimately, poorly adopted systems. And so what could the future really look like? Right? And typically, when you think about the future of B2B, or data tools, I actually like to look towards experiences in the consumer world.

Because in the consumer world, you typically get experience, right? Earlier on, right. And so let's take the example of you know, what I talked about is generic experience that data engineers and analysts have. Today, for you and me, actually have extremely different experiences on Netflix, right? Personally, it's curated to a personal level of taste, using a whole bunch of again data to do that, right. And so, similarly, why can't I create personalized experiences for the data analyst, which are very different from the data engineer, which is very different from marketing analysts is different from the financial analyst. Similarly, you know, another element of thinking about this, you know, as we talked about siloed context, how do you go from siloed context to really embedded context, but context is not available in an individual tool, but it's actually embedded into the day to day workflows of that people are using already researched, this could be, for example, directly available in Slack as a Slack bot, where people are having conversation about a particular data asset, and right then in there, the slack bot can actually add all of the universe of context to that conversation, right? Or let's say, for example, if I'm in a dashboard, and I see a number on this dashboard doesn't look right, the context from the pipeline, about whether the pipeline was updated on what is actually available right then and there in the pipeline itself. And I think that brings us to what I think of as the, you know, the next phase of what metadata can do, which is going from passive, traditional ecosystems, to truly active systems where meta data is, you know, acting as a layer that is driving action on a daily basis as a part of the day to day workflows and tools that people are using.

Passive to understand this, you know, again, like going back to analogies and just the english dictionary, right? What does it mean to like, be passive, so if you like search for in the dictionary, if you describe someone as passive, you mean that they do not take action, but instead just let things happen to them. And that's what the current system of metadata really is, right? You'll basically take metadata from a bunch of different places, you could put it in this one or the tool, but it's something just there. It's not actually really taking action. It's just that it's letting things happen.

What is you know, being truly active is about being engaged in action that's characterized by energetic work and participation as a part of, you know, so you, you take that same, you know, meta data and say, okay, How do you actually activate this into the day to day workflow? Just understanding what active means becomes really a_______? And want to take a moment here as we think about, you know, what does it mean to be an active system to actually answer the question right of agility? How does being a truly active data team drive agility? And how do you really measure agility, in fact, in the first place? And so thanks for those questions. But, you know, what we ended up seeing was, you know, we actually started measuring a bunch of different things, but, and Lena from our team actually shared a blog post that's associated with it. But most importantly, what we ended up measuring were few things.

First, you know, cycle time or velocity. So how long does it take for us to get something done? So let's say if I'm pushing out a dashboard, or a new dashboard, or a view, how long is it taking me from day zero to, you know, to actually publishing and against that, we would actually measure how much time or people is going into that, right. The one other interesting parameter that we started measuring, because we always found that it's like a two edged sword, because you know, when you go after agility, sometimes quality suffers. And so we started measuring things like the number of iterations that it's required to get to the final output. So, you know, typically, if, for example, when an analyst is working on an analysis project, if it takes three iterations between the business and the analysts to actually get to the final output, versus if it takes you one iteration. The second is actually you've done a better job in building out an agile ecosystem in the way that it works, right. And so we ended up taking that as the core metric, we started tracking a lot of this on a daily basis. And then that became the basis for actually measuring how we improved agility over time across the projects that we were running.

And so going back into what does it mean to be like a truly active metadata platform, right, or, you know, the way I see this is, you take a bunch of metadata. And it comes, you know, not just from your tables, but it comes to your lakes, and your warehouses, your BI tools, your data, you know, your pipelines, your code. And it brings that all in into one central platform, which stitches it all together, make sense of this metadata. But then teach that back into downstream experiences. These could be in flow embedded experiences, which I'll talk about in just a bit. But second, automation workflows, so things like data quality data, observability, alerts, cost optimization, integrating into CI/CD pipelines, and a whole host of use cases that can really start getting automated by leveraging metadata in the right way inside ecosystems.

So when back into pipeline clears, right? How do you actually help the humans of data work better together, it is really only one core pipeline chaos. Now think about pipeline chaos pipelines inside organizations are typically held by just one type of person, it's typically your data engineer that owns the pipeline. But the challenge is that the people affected by the pipeline are a whole host of end users, right? It could be your analytics engineers, it could be your analysts, but it's also all the way right to your business user who's consuming the final analytics products that add pipeline as pattern. So let's take something as simple as finding a pipeline break.

If you're a data engineer, all you can do you know you can come in and like you know, a typical metadata platform will actually you know, be able to show you things like lineage and like literally down to a column level, which will be layered with things like did my pipeline break or did not break what is the status of my pipeline? And from there, right there in flow, you could typically create like, you know, the way I think about in flow, how can you go create like an announcement which acts a flow pipeline fear, and that technically can get, you know, propagated downstream, literally down to your slack, your you know, it's available directly in your BI dashboards and so on, so forth. Second, like creating, for example, a JIRA ticket, you find a break. There's this whole workflow that happens where you find a break in your system, you jump from that into an issue management tool called JIRA.

Then you go jump from there into Slack. A simple optimality the platform will actually bring this all together in one place, right? But what you could do is truly sort of bring together your JIRA available directly as a part of your workflows and say, when you find an issue, you actually create a, you know, associated bug with it. And it's available right there in JIRA for your end users and data engineers to start working on right? The same workflow for being able to go in and like actually share on slack on issues that are happening. And so one real core element of activating metadata is really about saying, how do I take what is available in a single siloed system and make it available again, in downstream systems, be it slack, JIRA, you know, my, you know, BI tools, and so on, so forth. This second element of it is really about the tools of data. And how do you really have the tools of data work better together, to conquer again, that challenge around pipeline choas.

Now, I want us to take a moment and like, sort of just envision this, right. Let's assume that a pipeline broke somewhere in your ecosystem. If you think about this, what you could do is you could basically bring in metadata from a bunch of sources, refresh it, detect the changes that happened with it. From there, you could identify your dependencies. This is where something like end to end lineage or column level lineage build reroll. And then directly from there, you could notify customers, consumers, right. So if you know that these are the dependencies, these are the dashboards that are covered by this particular pipeline, you could easily start notifying your end consumers, these could be you know, end consumers that are available and tools like again, Slack or Teams or JIRA, whichever makes sense for that particular kind of end consumer. Now, if you just step back and look at this flow, it's just a series of steps.

It's an orchestration workflow of ____. Gartner said something really interesting in their recent market guide for active metadata, where they said, the standalone metadata management platform will go from being a augmented data catalog, to really a metadata anywhere, or restriction platform. So if you take that, and again, take that workflow, and bring that in into context, something as simple as notifying on downstream impact. If you think about that, it's actually a series of orchestration steps of workflow. So if you have metadata that is unified and comes together from your pipelining tools, your data warehouse, your BI tools, as well as your, you know, social context tools, you can actually bring this together as an orchestration layer and run almost what is Zapier does for SaaS, right? Where you could come together and run this series of steps to actually notify consumers downstream of impact, which actually takes us one step further into what is the future of metadata actually look like. But also the future of your data stack itself look like, right? Like, you could technically truly achieve the intelligent data management platform dream by being able to like bring together a bunch of best in Class tools in the ecosystem, but unify them together through this common metadata layer, and take that back to actually make these tools work together extremely well.

And if you further take this two steps further, like there are so many use cases that this can, right. So we have things like dynamic data pipeline optimization, you know, how do you if you think about this, I have, if I know that my users are logging in into a BI tool at 10am on a Monday morning, 90% of my users log into a BI tool at 10am. On a Monday morning, I should be able to take that and actually use that to update my data pipeline at 9.45. And auto scale up compute in my in snowflake, so that my users have the best experience. And then when users log out maybe about say two hours later, everything auto scales now. And so in such you know, if you know that you can actually really start optimizing data pipelines, let's say, you know, you have a dashboard that's used once a month, you shouldn't be updating that dashboard every single day, right? And you can actually start building dynamic data pipeline optimization systems and it's basically a series of steps automatically orchestration workflow, in some ways or automated quality control of data pipelines, right? Let's say you actually go in, you know, if you think about that you have a series of tests that happen in your data warehouse that go from basically bronze, silver, gold layers. In each of those, you run the tests against predefined thresholds. And then you can connect back that metadata signal all the way to your pipeline itself.

So only if your data passes the checks in the bronze layer, only then does it get promoted to the next year, only then does the pipeline actually run to push the data into the next phase. In case it does not happen, it go actually goes into a human loop, where the human basically you automatically create a JIRA ticket. And a human needs to review that table to see what went wrong, right. And so then you could actually prevent bad data from every when ending up on your business layer or your business dashboards through that process. Purging stale or unused assets, if you know that you have, you know, data assets in your ecosystem that are not even used over, you know, a significant period of time, you could actually just automatically run a purging algorithm, which is calculating asset usage.

And then that activation workflow goes into managing your life cycle, and automatically archiving data assets on the basis of a certain set of preset rules, right? I could go on forever, about the kinds of use cases that metadata could really have in the ecosystem, right? Right, from things like cost optimization, which and you know, the current environment is more important than ever, to security, defining access control policies at scale in one place to another, continuous validation of metrics, you know, a whole host of use cases that metadata can really power. And so while we don't fully have the answers to what data infrastructures really going to look like, in the next decade, what is increasingly clear is that metadata is really going to be the glue to bring together and power the modern data stack.

And I hope this gives a very high level sense of what it takes to actually activate metadata as part of your daily workflows, and how that can actually go back into improving agility on a daily basis and reducing the chaos, that your team is what have personally for me, what that meant as a data leader was better sleep. I would I remember, there was a period where for almost a six month period, I did not leave office, because something would break every day. And I will be woken up with a crisis call almost every second day. And as we systematically work towards improving this, and as we systematically work towards making our team better, and leveraging metadata to slowly optimize and automate every aspect of our data stack. What that ended up doing was we went from being an extremely reactive data team to a very proactive data team and we were able to sort of almost take control of the destiny and the roadmap that we were working with.

So with that, you know, thank you very much for having me here. If you're interested in talking more, I'm accessible on Twitter @prukalpa, and I write a weekly Sub Stack called metadata weekly on substack. So would love to continue the conversation.

Kaila: Thank you so much Prukalpa, for that really insightful chat.

 

Speaker:

Prukalpa-Sankar

Prukalpa Sankar

Co-founder

Atlan

Interested in partnering with us for next year's event? E-mail sponsors@incorta.com.