This session looks at the ways that you can easily generate metadata and then use that same metadata to drive data engineering, analysis, governance, and security in your day-to-day workloads.

Tagging data can solve a myriad of problems when using the data (especially in large databases with a wide range and variation of data).

With all of the regulatory requirements around data today, tagging data can introduce an easier methodology for dealing with governance. You can use tags to identify data that is either PII or constrained under regulatory requirements. You can then use those tags to manage user access to that data based on their roles. This data in turn can then be used to demonstrate compliance to auditors and other regulatory bodies.

Fortunately, tagging goes well beyond the access compliance use case. You can use tags to delineate lineage as well as data transformations and even downstream processing. In addition, tagging is also useful in helping to build and maintain data catalogs, which is critical as modern data volumes continue to grow in both size and criticality.

Invoking a proper tagging methodology can deliver many benefits and this session is designed to show you how to get started on that journey.

 

Transcript:

Cameron: Our next speaker began his journey in data and analytics 25 years ago. He spent 13 years leading the data and analytics efforts at Alaska's largest telecommunications company, and also worked extensively as a consultant. He's currently a Senior Solutions Architect at Snowflake where he works directly with Snowflake customers, helping them to discover and facilitate the newest capabilities in cloud computing. Please welcome Matt Childs, Matt, the stage is yours.

Matt: Thank you, Cameron. And I am super excited to be here. All right, let's jump in. So that was a great introduction, Cameron, I am a Senior Solutions Architect with snowflake. I have been doing this for a long, long time. I've worked in a lot of different industries. And as you can see here, this is my contact matt.childs@snowflake.com. And there's a question that doesn't get answered something that comes up after the presentation. Something else you want to know about? About anything, feel free to reach out and let me know. What we're going to do today is we're going to talk about why we would tag a little bit about the magic of metadata, creating a tagging structure, looking at maybe some potential tagging categories, and then how to track those tags. I'm going to demo a little bit about what we have, and then show you a little bit of analytics as well around those tags. So why do we tag and why do we care, right? One is one of the bigger arguments, and not necessarily specific to the data pipeline or workflow, but sort of a peripheral part of that.

And one of that I think, is very important is governance, right? And I heard a great quote about governance once I totally stole this, but I'm gonna pass it on to you. Governance, you know, when a lot of us, we hear the word governance, and we think we're being told something that we can't do, right? Something that we shouldn't, you know, we're getting constraints put around us. But the way that governance has been best described to me in the way that I look at it now is governance are like the brakes on your car, How fast would you be willing to go? If you didn't have brakes, right, they let you go fast, safely, right. And tags can play it play a large part in that because most of us have some oversight around our data that we are responsible for putting in place, obviously, automation, right, one of the great things about tags, and it's one of the things I'm going to demonstrate today, one of the things I'm super excited about in this space is I can use those tags to automate things like workflows, or meeting requirements, building out peripheral data to my additional data, so I can see what's going on, right? And then obviously, regulation, tags allow us to create an auditable trail so that when that SOX auditor comes, and they will come and says, show me this, you can pull it up pretty quickly, you can show them where you've marked, you know, where you've tagged data, how you've got lineage in place, how, where it's being calculated, what those calculations are all of those things.

And then there's a huge growing demand in the cataloging space, right, we've seen the rise of companies like Alation, and Collibra. And these amazing cataloging tools, because data is no longer the purview of just IT, or a handful of analysts, right? We expect almost everyone throughout our business organizations to be working with this data. And in order for them to be able to do that and do that successfully, they've got to know where that data lives, they've got to have a place like a data catalog where they can go and understand what they're looking, you know, define what they're looking for find it? No, you know that they're looking at the right data, or how to get access to it if they don't have the access to it, who to ask all of those things. And those are, that's another thing that can be accomplished through tagging. So when I say the magic of metadata, right, it seems like a fancy term. It's a little bit more common now than it used to be. But what we're really talking about is scaffolding for data, right? That's what the metadata is. It's the data that tells us about our data.

And that's what makes tagging so important, right? I can use tags to do things like tracking, right, I can follow data through our business lifecycle, from the time it comes in, through a data entry system, to the various systems that it gets processed through out into our data warehouses and data mart's and down the stream to our analytics, folks, right? I have transparency. If I am tracking data at that level, then I know its lineage. I also know when it's being transformed, right? I can basically follow data through that stream and see how it's affecting other data or how it's affected by that flow during the entire time. And one that we don't talk about enough that I think is important is understanding the timeliness of data, right? We are in absorbing and generating so much data now. But it's important to know, you know, what was in the past? Where are we currently? What's, you know? How is that? How is that flowing? What does that look like? What is the velocity of that data? Right, getting the idea of how timely that data is. So that I know that I'm looking at the right things, I'm understanding it in the proper context. Because the same data might mean something different six months to a year ago than it does today, because of a fundamental change in the nature of the way that we've done business or the way that we've approached some aspect that touches that data.

All right. So creating a tagging structure, right? When you do this, you got to you there's, this is the probably the hardest, heaviest lifting of this process, right? Because you really need to sit down and think, right, I want to put together a tagging structure or a strategy, that's going to make the most sense, what do I want to achieve? What am I trying to do here? Right? Is this purely a regulatory exercise? Is this purely a workflow exercise? Right? Or is it both? And then it are there are some aspects of that that are going to overlap? Right? How am I going to get that data in place? What do I need to know about the data? Do I have the proper knowledge, right, that the upstream systems, how is the data getting consumed? And then from a technical standpoint, I need to look at my third party considerations as well. Right? How is this data getting consumed by third party tools like those cataloging tools that we talked about, or even the querying tools or the front end or the analysis tools, right, the those downstream tools that our end users may be using to get it out from Excel to Tableau to Power BI, you know, where are the hooks for those in place? Can I meet that requirement?

Can I put that together? You know, and then when I approach this tagging structure, does it make sense? Have I built it in such a way that I haven't made it complicated enough to meet my requirements, but I haven't overcomplicated it to the point where I can't really look at it and understand what's going on with it, right? It's a real balance, it's really going to be the, it's really going to be the part where you're going to want to put the most thought in because this is the thing, that will be the hardest to change later on. And the part that you want to get right up front, right, so this is the part where you sit down and say, Okay, we're really want to understand this and build this in a way so that it's flexible and malleable. So that is my organization and my data and my business processes change, right? This tagging structure can change along with it, right? I don't want to build anything that's brittle, or on elastic. So a lot of different ways that you can use when you sit down to approach this, right, you're gonna think about tagging maybe as categories, right? I'm gonna show you some today that I built out in a demo database just for this presentation.

But I built out ones that are, you know, that I call workflow, ones that I call lineage ones, one that I call PHSI, right? Because I need to know that privacy information. And all of those have some keys underneath it, right? They're really more I took more of a nested array approach, you can go with a key value approach, you could have a tag and a single tag value, right? You can have tags that evolve as a tree, as long as you understand where those relationships are, and that you might need to, you might need to augment the internal tagging structure to allow for something like that. But you can build those out, right? So you want to sit down and really think through again, what does that going to look like? And how does that make the most sense, given my environment, my business meet needs, and what it is I'm trying to derive from putting this work in place. Because there really is, you know, you're going to have an upfront amount of work that you want to get, you know, you want to understand and then bring that to bear as you're putting the project together because this isn't something you do after the fact you can, but I wouldn't advise that right? This is something I think knowing that you know, for processes that go forward. I mean, obviously most of us have some system we want to go back and add this on and we can't go back but as you go forward as you build out new data assets as you build out those new those new data Mart's or data warehouses or data lakes, and you want to tag those, right, this should be a first class citizen in that project.

When I first started working in BI way back when it was mostly just reporting, there's what like one of the crazy things that would happen is there'd be this big development project. And I started life as a developer. So I saw this happen on both sides of the fence, there'd be this massive development project. And then at the very end, like the last two weeks of the project, right, no matter how long the project was, whether it was a six month or a three month project, there'd be like this two week chunk, where they'd be like, Oh, that's reporting, we'll bring the report writers and then No, right. And it never worked. We never got successful or valuable reporting out of that, because it was too late in the process. Today, we want to we treat analytics, like a first class citizen, we want to bring them in, throughout the process.

And I think tagging is something that we need to look at in the same light, we want to bring that in, as soon as we start to think about building out a new data asset or the business comes to us and says, Hey, we want to put this new warehouse together this new data mart, or we need this new, you know, functionality for all we want to add this functionality to our existing processes. How do we, you know, we want to incorporate it from the get go. So, one of the important parts of that, then is you have to have the ability to track the tags, and you have some trying to be funny, meta-meta data, right? So am I getting to that point, right? Have we gone down the rabbit hole? Because I've got data about data about data about data? Not not quite right. But it is, you do want to understand that's that there is a structure to your tags, you want to document that, right? You want to have a good clear understanding of that, it should be something that you can explain to the business users, so they know what they're seeing, right, even when it's sometimes tags that won't involve them, because we built them out for workflow purposes, or data engineering workloads, those types of things, right. So most databases these days have this capability. This is why I'm so excited about this.

This is why I'm sitting in front of you today, because 25 years ago, I wanted this capability, right? The very first data warehouse I built was an enterprise data warehouse that had 36 data sources. And we were exposing that data via web based tooling, very early web based tooling, it was the late 90s. And the tool that we were using had the ability to expose metadata, right. But we couldn't figure out how to make it work. And it wasn't something that was being driven heavily at the time. But I looked at that. And I thought, how cool would that be as a user? If I were putting together analysis, and I can even just hover over a field and get a little blurb about what that field was maybe where it came from? Right? Or when the last time it was loaded? Right? Those are the things that, you know, it's been sort of that panacea. I mean, because I'm a data nerd, right. So this is the stuff that keeps me up at night. Most modern databases today, understand the concept of metadata, most of the tools that we use have it, the biggest problem was that for a long time, it expected you to put it in there manually, right.

And nobody was going to fund that part of the project, or it was going to end up getting shoved into that last magical two weeks with the report writing, right? And then who was going to keep it up, because there was no automated way to do it. I think most modern databases today, the one I'm going to show you obviously, which is snowflake has that ability to track those tags for you and give you some important information because you want to be able to have that. And then the very last thing that I'm going to show you after the demo of looking at those tags is I'm going to show you some analytics that we put together around those. All right, so and then yeah, because you want to be able to, while it might not be crucial to the business, right, tag reporting is going to become that the tracking and the tag reporting is going to be that thing that assists you when somebody does come and say, Hey, where did this data come from? Right? Or this doesn't seem right, or why is this changing this way, right? And you can go back in and look at lineage and understand, hey, there was an upstream change to one of our data systems. And the fundamental nature of this data has, in fact changed.

But it's easier to find that out rather than trying to go find the ETL developer who did that work, hope that they wrote some documentation if they're not around, and get them to remember what it was that they did, right? And then just that, you know, the idea that you could do tag reporting, so when the SOX auditors or what other kind of auditors show up the financial auditors, whoever your regular regulated by and I've worked in businesses where that was the FCC, the FTC, right? A ton of regulators, when they say, Hey, where's this data coming from? Right? Or show me how this is protected? You have that ability, because you've got not only have you tag the data, but you've got a repository of those tags, to be able to tie that back. And then we have the ability to even, you know, look at our query history and see using the tags and the queries see which of those fields are getting pulled, right? So if someone said, Hey, who's looking at this data, we'd even have the ability to do that.

All right, so let's move into the demo, which I have cleverly called 'tagonomy'. And I'm going to escape out here and move over to my worksheet. And what I've done here is I've built out a demo database, right. And there's a lot of stuff here that you don't generally care about, I built this as a script. And if any of you are interested in having this, again, email me, this is definitely something I'm more than happy to share. But this is where it gets most interesting here, right? So here's where I've created tags. I am able to create tags with a relatively simple straightforward SQL statement, I'm going to blow that up just a teeny bit more for you guys. And hopefully, that's legible. So I created replace a tag and this one PHI, I'm not going to have any values, I'm just going to have that marked as a privacy tag, right? I'm going to create a tag called lineage, and then I'm going to say, Hey, I'm going to create some values.

So once lineage is created, it can only ever be represented by these values. So you have some control over it, right. And this is where you could build out that tree structure, or the key value pairs, that type of thing. So here I've done, you know, on premise Oracle, on premise SQL, on premise ERP billing, you know, I've created a calculated tag, right? And I've just basically given it some general information is this an aggregate and average, I mean, I could also get more in depth with the types of calculations there if I wanted to, or knew about them. And then if I'm obfuscating data, I'm going to create some tags. And here I've used shifted tokenized and masked and later on, we're going to use shifted to do some actual actionable things with the tags, and then workflow, right? So if I'm doing data engineering tags, where's the state of going? And then maybe I can go and find that data, you know, and, and do things with it right? So am I sending data out to Incorta? Am I sending this data over to the warehouse to the Data Mart, is this finance data? Is this data related to a specific project? And then here, in this database, I'm able to tag columns, tables, schemas, databases, in this instance, I'm only tagging columns, but I go through the process.

And I just alter the table and modify the column and I set the tag and I give it a proper value, right? And relatively straightforward. And these are things that can be built into workflows, and obviously can be set up with environment variables. And, you know, so that this work can also be automated. Once I've done that, we have a database as sort of a metadata database called snowflake that all of our customers get, it will propagate into that, and you'll get a table called Tag references. And when I run that, what it's going to do is it's going to bring me back all the tags I've created and some interesting information about them. One of the great things I love is that you'll see here, even after a tag gets deleted, or the object that was tagged gets deleted, I still have a record of it, right. And I'll still know that that happened. But I can see that I've got my schema, my database, my the tag, name, the tag, value, all that good stuff, right. So this gives me an actionable concept. And then I did create a view with these, this data can take a couple of hours to propagate. So I created a view, because I have a couple of different ways to get at that data, the information schema has that data almost immediately available, but I have to do it table by table.

So I created a quick view to just pull those in. Normally, that propagation down to this snowflake database wouldn't be an issue, likely or not having a tag related emergency where I have to have tags immediately. So then I create a stored procedure. And inside of the stored procedure I've created I had at his is a use case that came from one of my customers, they have dates and the dates need to be date shifted per one of their regulators, right? So when you look at the dates, they want to aggregate that even when they aggregate the data they want the date shifted 10 days, right? So my initial pushback was hey, we need to probably randomize that to like minus 10 or plus 10. Because if it's always just 10 days, you're not really shifting the dates because you just take the date and subtract 10 So I created a little function that gets a little take a min and a max. So it has a little bit of elasticity to it and will return a number and in this case, we're calling it with negative 10 or 10.

And I was a little bit worried about zero, I taken zero out and if zero was the return value and added another value, but a coworker pointed out to me that since all the dates are shifting, no one will know when they're zeros, right? So the zero and so I'll go back and change that. And then it just goes through and it shifts these dates and creates a view for them. Right. And if I run this, it'll return it'll recreate that view. And we'll be able to see the date shifted value because I had a return. Right, so it's going to create the SQL statement for me. And here, it did a 10 day date shift. Right, if I ran that, again, I'd get another number. And then I joined this back to the original table. And you can see, my original discharge date was 12/18. But my shifted date now is 12/28. Right. And I can use those shifted dates to expose those views downstream to the users who will see the shifted dates, but this is a way that I can use that tag. Because what I did is I went in here, and I just said, Hey, find me all of the tags. Let's go over here where my tag value was shifted, and my domain was column, right? So I have this this incredible automation feature.

And this is like just scraping the surface of this, that once the customer saw this, we began work on a whole bunch of other processes where we're now pulling these tags and doing more automated things with them. Okay, so from that aspect, so let me jump out last quick, too. Let's see. There we go. All right. So this is my last Thank you, I want to thank the folks that count, for allowing me to use this, this is a tool that's in private beta right now, if you're interested in it, you can go to www.count.co , the website is still pretty, pretty quiet about what they're doing. But you can request more information directly from them. And you can, you can request to join the private beta. This is the next level of visualization tools, and it's going to be a game changer. You're seeing the work on this canvas, where multiples of a multiple people can come and work. And I can create cells, like a notebook, and then create visualizations off those cells and build them out. This is it. I've been using count now for a while. And it's pretty amazing.

And I wanted to use it as part of this demo. So here, what we've done is we've built out an initial tag query just to basically see, hey, out of all of our tags, how many do I have, and I've got 28 workflow tags and 23 lineage tags, right? And then I can scroll down and I can say, Hey, show me these by the table. Right? Show me which tables have which tags, right? So patients got 41 tags, and I break it down a little further. Well, it's got 11 workflow tags and eight PHI tags, right. So I can begin to see which tables have you know, a lot of tag data with. So I might be, you know, I can, again, I can narrow these queries down by looking at where my workflow tags, where's my data flowing down to.

And then I can take that down to a further level where I've got that broken out. So I can see the tag values. So patient has 11 fields that are going to the warehouse, it's got four fields that we're tokenizing, it's got eight PHI fields, it's got 11 fields that came from the on premise SQL database, right, one of those one field that has a calculated mean to it for that are masked, right, I can get to dig and dive down further into the values that are associated with those tags. And then this last visualization basically shows me Hey, for all of my columns, right, which have the most tags, so I can see a tag distribution across the database. And this way, if somebody comes to me and says, Hey, show me all the PHI tags, I can, somebody comes to me and says, Hey, show me what we're sending to Incorta. I can. I can do this very, very quickly. All right.

All right. So I'm gonna put us back here. And I think we are ready for questions. Will tell you that one of the questions I get pretty regularly is how much does this add to the overall project workload, right and there's not a is never a definitive answer around that. But I will tell you that it usually runs in the 10. You know, you're looking at about a 10 to 20% additional workload, but that work will typically fall in the line of business and the data stewards, along with your ETL, and your BI developers working together, right? As you build out the tagging structure, figure out what it is you want to where you want to trace, and then put that workflow into place. So it's not an..there's work that needs to be accounted for, but not an overwhelming amount of work.

Cameron: I see. In Sorry, man, I wasn't on for the first question there. They got me on back on live now. So, so who's the best person and responsible for implementing this? Like, who does the actual work? Like IT, Engineers, data owners? Who do you task this work with?

Matt: Yeah, it's gonna be a little bit of a group effort, right? Because you want to get your data stewards out in the business involved in this process. But obviously, and that's going to be for the anything, that's the tagging that's around PHI, or those types of things. But then your ETL developers for lineage, right for workflow, right. And then typically, you're going to have some sort of database administrator who's going to get involved and help get these things built out. And your database developers, your snowflake developers, in our case, or whoever, whatever database, you're using the developer on that side to implement the tags, however, that database puts those tags in place.

Cameron: And then interesting, so then how do you go about sharing, you know, the tagging that's done the tagging process? The, you know, the naming convention? How do you get out to the to your user base?

Matt: So I think one, you're going to want to put some good solid documentation around about that right outside of the data environment. But then, because you've got this repository, you can also even build a dashboard that says, hey, here are all the tags. And here's what they do. Right? You can build, you know, that one's going to be at a much higher level than a little bit of what I demonstrated, maybe that last one, right is a little bit more applicable for that, where you see the distribution of tags across the columns, but you could go step up one step and say, here are the tags we implemented. Here's why we implemented them. And then you know, and here's how they're put into the database. But I would also couple that with some external documentation as well, right, something out, like in a confluence or a SharePoint site, that gives somebody an overview of the tagging structure that you've put in place.

Cameron: Cool, we have an audience question here also, and they're asking, Can you tag the data as you bring in? Can you tag as you bring in new data sources? Or does it have to be done when the data is accessible in the snowflake platform?

Matt: No, you can do it as part of the as part of the ETL or ELT process, however you're bringing it in, right? So you're bringing in the new data source. This is why if you set up a tagging structure, unless you're you know, if you're building out new tags for that data, you want to have that in place before you do the ingestion. But as if you've got to, if you've got the tags in place before you do the ingestion, when you're doing the ingestion, you can apply them then. Right. So as you're bringing that as you're bringing that in, you know, and if they're if you're tagging them with already part of the existing, that's what that's what's so great about having that central repository, because I can use that as a place to to automate the process of getting those tags propagated throughout the data.

Cameron: Oh, perfect, Matt. We have a few more questions coming in, but we're just flat out of time. I want to really thank you. That was a super interesting talk. Thanks so much.

Matt: Thank you

 

Speaker:

Matt-Childs

Matt Childs

Senior Solutions Architect

Snowflake

Interested in partnering with us for next year's event? E-mail sponsors@incorta.com.