ZG Session | Pipelines in Practice: How Practitioners Operate using the Modern Data Stack

The Modern Data Stack has ushered in an unprecedented wave of modular, interconnected data tooling. An occasionally overwhelming amount, in fact. How are successful data practitioners navigating this new world? Margaret Francis, CPO at dbt Labs, hones in on best practices, offering practical strategies for teams looking to tackle data problems at scale while staying on the cutting edge.

Transcript:

Joe: Up next, we have Margaret Francis, the Chief Product Officer of DBT labs. Prior to DBT labs, Margaret was President, COO and Board director of Armory, GM and SVP of Heroku, and Salesforce data platform and Salesforce, and VP of product at ExactTarget. She also has co founded two venture backed startups and Margaret holds a BA from Yale University and an MFA from SFA. Today, Margaret is going to cover pipelines and practice. How practitioners operate using the modern data stack? Let's give Margaret ,a warm welcome to the stage. Margaret, welcome.

Margaret: Joe, I really appreciate the intro. I'm going to try and live up to that that bio there. I'm pretty excited to be here today. I watched a number of the talks this morning. And one of the things that I loved was not only how interesting they were, but how rich some of the question and answer sessions were afterwards. So I'm going to try and keep my content interesting and short, so that we can all get to a Q&A session and exchange a little bit of knowledge. And the topic that I really chose for today was all about kind of real world examples of how people are operating pipelines within the context of their companies, the problems that they face, and some fixes that we have seen for how to address them. Now, I'm a vendor, there's a bunch of vendors here, we've all got our vendor stick, I am not trying to sell you any DBT today even though I love it very much.

I'm more want to provide a little bit of context and observational bias around where I sit in this ecosystem of really fantastic products, and the views that I then have of what kinds of problems people have in their pipelines. So today, I would say that, on average of the 1000s of customers that we have a DBT. These are like the most common tools we see that they build data pipelines with. It is by no means an exhaustive list, it is by no means a recommended list. I'm telling you, this is kind of like the ecosystem we're operating in. And these are the products that we largely see people building and operating pipelines with. Now, as for us, you know, we're a relatively young company, we do have a great open source project that has a lot of community adoption and usage and love as well as a number of cloud customers that are kind of in the Fortune 500 range of things.

But I would say that the data teams that we actually see kind of range in size from maybe like one person on the team to hundreds on the team. And so the problems they encounter that I'm going to talk about today, these are kind of applicable, I would almost say no matter what size of company and what size of team is actually building and operating those pipelines. So one more short word about kind of where DBT fits in the stack. So you know where we're coming from, because this presentation is not solely my work product, I consulted with a bunch of our professional services consultants at DBT. Because I wanted to get a feel for where people are really feeling pain right now, from that practitioner perspective. So where DBT really comes in is for those companies that have really already consolidated quite a lot of data into one of the more modern data platforms, we execute, we compile SQL that people write, we apply it in the data warehouse, it never goes outside the data warehouse.

It's not a pipeline in the traditional sense of, you know, bits and bytes are moving. But we are creating a transformation pipelines that people then use, often with their BI tools, or for other purposes inside their data warehouse. So it's not really data movement, it is data pipelining that DBT does, by virtue of providing this, this interface through SQL and application of that SQL to the data in the Data Warehouse. So with that backdrop, here are the problems that we very much see people having practitioners having in the modern world. It's that pipelines can really develop some kind of twisted plotlines. It's kind of like you know, Season Eight, Episode 14 of you know, Game of Thrones level plotline difficulty, it's that those pipelines often develop some tricky kinks. It's that incomplete data lineage can really lessen trust in the enterprise when it goes to use that data. And that the data teams don't always work.

The data tools that people use don't always work the way the teams do. There's a real mismatches conceptually sometimes that inhibit data teams work. And that there's this crippling fear that a lot of the teams have around wrong data or being wrong in production. I mean, if we're on measuring ourselves by how useful and accurate we are, then any kind of mistake really kind of strikes terror into the heart of a data practitioner. And the last but not least one is that we are really grappling with a whole new set of concepts in data pipelines right now that have to do with how we incorporate machine learning data, which is much more cyclical than the conventional dag kind of model that we've been using in this industry for quite a long time. Okay, so, pipelines can have difficult plotlines. These are real life examples that, you know, have been sort of suitably blinded of what an actual kind of customer setup might look like. It can get really, really, really complicated in there.

I am already seeing some really fancy, fascinating questions actually come in via the chat. So thank you for that, Carl. Sorry, Kayla. And I'm going to come to that question of handling transformations with things other than SQL a little later in this talk, because it's a hugely important one for the future of DBT, in particular, but I'll note that even in our conventional DAG, we see some real complexity arising. And probably the thing that I see people, using DBT, for the most these days is really reorganizing that data into logical models that can really help the teams manage complexity, and improve their productivity. If you really start to take that 30 Day cyclic graph and break it down into usable components that can be thoughtfully kind of pipelined, everything is just going to get easier over time. I like having this bank of questions to answer that. So if you're on this, if you're on this session, please feel free to keep them coming.

The second thing that we see happen is that pipelines often develop some pretty significant kinks. Once you start getting into this sort of like triple join category, it's time to sort of step back and think about whether you really set this pipeline up for success long term. There's a fantastic blog post on the DBT Dev blog out a couple of the consultants produced. And I'll happily share some resources that are for the industry as well as DBT, more specifically at the end of this talk, but really shows what kind of productivity gains you can get, when you start to unkink these hoses as you go. Really, everything makes so much more sense and overall flow of data through your pipelines can be pretty significantly increased. So here's a really interesting one. Incomplete lineage can really lessen every data consumers trust in whatever it is that we produce.

The example that I often hear from the consultants that work inside of DBT is, if that data doesn't look right, and the farthest back that you can get is the table that it originated from, you don't really understand the complete lineage of where your data came from, and how it came to look, the way that it does. What happens is people start creating their own sources, their own dashboards, they stopped trusting the work of any kind of centralized data team or centralized data infrastructure. And you end up with silos, you end up with work duplication, you end up with conflicting numbers. And mostly we're all in this to provide data and insights, that we need to run a business. And so that's a very distressful atmosphere to operate in. I heard the founder of Outland talking about this a little earlier this morning, you know, in the event, and it was so true, like the minute the data feels not trustworthy, and people feel like they have to go do their own work from scratch, you've really kind of broken something that's hard to put back together inside your organization.

You know, if there is a fix here, I'm going to suggest that it's documentation. Okay, that sounds really deeply unsexy. But it's kind of a form of institutional magic, whereby others can retrace the work that has happened, they can go duplicated themselves, they can recreate results, and really participate in a collaborative data infrastructure that accelerates the company over the medium and the long term. Feels good to build your own dashboard in the short term, it feels better in the longer term to trust your analyst. I pulled in a couple examples here of documentation, whether it's, you know, inserting a code block, whether it's a Jupyter Notebook, whether it's, you know, DBT autogenerated docs, there's a lot of ways to share this kind of institutional knowledge. I don't want to suggest that you know, the product I work on is the only one. I am suggesting that it is a very real fix for doing collaborative work on data in big companies over time, you got to get good at docs, got to get good docs.

All right. I'm blazing along so I will pause on this one, which is a tool related than some of the talks that I was listening to earlier, there's some pretty great technology out there that's going to help everyone in this industry, I think, continue to get better at what they do, and deliver real value to the business. However, sometimes the business doesn't cooperate. Here's a really simple example. You know, hey, I want to create a dimension out of a fact table, that's a pretty standard operation in data world, I was just listening to a really great talk, I think from a ThoughtWorks consultant about, you know, data mesh, and how this may obviate the need for some of the work that we do today to create, you know, dimensions and facts. And I'd be pretty psyched to try it out myself.

But what I'm gonna note is that, in reality, even this symbol of an action results in organizational workflow, that often looks something a little bit more like this, you know, an analyst goes to actually create that table, they perform a transformation, they generate a value that doesn't fit within the schema of the warehouse, they happen to be operating in, I'm not picking on snowflake, I'm just, you know, this happens all over the shop. And that might be because the schema for that warehouse is independently controlled by some other team or person or system that doesn't know why anyone is even asking to make changes, and isn't even really sure that they want to make those changes. So now we get into the data wrangling and the organizational wrangling part of being a data practitioner, which is quite frankly, the least fun, especially because the whole time, there's some other business stakeholder executive that probably person A and person B both report to who would really just like to have the data already, and does not really understand why this is a problem that is holding them up. Just, you know, this is the lived reality of a lot of the teams that we talked to every day. And this is probably the most fuzzy answer the most prescriptively fuzzy thing in this entire talk.

But I don't know that there's a way to circumvent this other than to embed the analysts and the engineers within the business teams to really empower them to do their work in a streamlined way. And to elevate that function within the org because it has so much impact. I have a pretty deeply held belief that organizations that get good at data that get good at using it to inform their decisions, to drive their analyses have competitive advantage over the long term, they really do. Probably, if you're here at this event, you may share a similar opinion. But when the people who actually do the work are not empowered to do it in sensible ways, such as being able to modify a schema to accept a string, that's 20 characters instead of 16. And the whole pipeline gets stuck. It's just not helping anyone. It's just not helping anyone. Especially because data teams, the ones that build and operate pipelines have a pretty crippling and well founded fear of mistakes in production. Again, the founder of Atlan had this wonderful anecdote she told this morning about, you know, having a I believe the Prime Minister of India, call her and tell her that the data looked wrong, you know, at eight o'clock in the morning, and how powerless you feel to do anything about it, when you're not really sure what's gone wrong. This is a fear we 100% all live with, you know, I would like to think that I'm pretty good at this data thing.

And I will tell you right now, I went into a quarterly business review with my boss, the CEO fully two weeks ago. And you know, a look at timed out or you know, something else was stuck upstream. And, you know, it's a source of fear of stress of shame. And what we are really beginning to see in practice is that the data teams that are able to apply best practices from software development, to improve kind of the safety of release to production, the level of automation they can bring to bear on their work, the number of people who can really participate in doing that work, because there is automation or checks and balances built in through use of Git or CI/CD controls. These are actually paving the way for people to work faster. Excuse me, and better over time. What happens when you try to give speeches from hotel rooms in Denver, I suppose. But at any rate, it's a very meaningful and material improvement in the productivity and the safety of those teams. Very, they're very much safer once they adopt these more modern app DEP practices.

Okay, so now, here's the one that's really interesting. Machine learning data is really kind of an ourobouros. It's not a DAG, the ourobouros is that mythological snake that sort of eats its own tail and represents almost the symbol of infinity in mythology. And it's very different from a DAG, directed acyclic graph that has like the sort of clearly articulated beginning and as it clearly articulated end, you put raw data in on, you know, the left and you get like a dashboard out there, right. It's just a new way for all of us to be thinking about how we build and execute data pipelines, particularly when those machine learning cycles have. They could have both sources and outputs in what is your like more conventional Deck, there may be tasks that reference these machine learning loops, building models, executing models, training them getting results back. And I am here to tell you that it is time to think about how this works with your conventional tasks because we are seeing more and more evidence of people wanting to use these machine learning techniques inside even very, like operational data pipelines today.

So I don't care whether you're, you know, looking for spam or improving product recommendations, or, you know, it doesn't matter what people are actually doing in that pipeline, symbolically, what they are doing is beginning to incorporate these loops. And now this is where we get to the question. Carly asked, that came from one of the participants in the session audience, which is can DBT support complex transformation that cannot be handled via SQL? Well, I'm here in Denver, I'm actually at a product and engineering off site for DBT. Where we are reviewing, we're at the very earliest stages of thinking through how we bring Python into DBT core, because we are hearing from our users that you know, there are certain kinds of complex transformations that really do require a different kind of tool.

So we are hardly the only people in the industry who have had this revelation. And I'm, I'm pretty excited to finish up my session and go back to listening to everyone else. Because the impact that we're all trying to have here is one of elevating the practice of working with data, the results we get out of it to help the companies and the organizations that we are in whether they are for profit, whether they are not for profit, it doesn't really matter. So all advancements to the craft to me are quite welcome. Another really good question just came in here from a participant, which is, what type of data is best suited for DBT transformation? Is it more of relational structured data or more of unstructured machine data? I can, I would probably say that today the preponderance is relational and structured. But DBT is absolutely suited for doing more of the unstructured and machine data. I often get these questions from people that are around sort of like, you know, like max size, or like max volume, or like max speed.

And you know, what's really unique about the way that DBT helps people transform their data is, if it's all in your cloud data warehouse, and you're basically compiling SQL and applying it against that model in the warehouse. The performance of that transformation and the complexity of it has far more to do with what the warehouse itself can support then with DBT as your method of interfacing with it. I'm going to cover off a couple of these real world advice pieces as a summary and then come back to some of the questions because there's some really good ones in the chat. So I think we'll have a good couple of minutes here. But I just want to reiterate that tool set aside DBT aside, though, of course, you know, I'll be delighted if you all come try it on the website. reorganizing your data into logical models will really help you manage complexity and productivity, particularly if you can then reason about how to use those models to unkink the kinks that inevitably occur in anyone's data processing pipeline, it's really a you are making Legos so that you can make a building. And if it looks like sort of spaghetti or string and you can't reorder things or decompose them differently. It's going to keep the team in a place of suffering for too long. Documentation wherever and however you do, it is going to be a powerful lever for getting your team and all of the pipeline's that you build and execute on behalf of the business into their best possible shape and prevent sort of fragmentation and silos and duplicate work.

Empowering and validating that analytic function is incredibly important for kind of long term success. And it helps you do things like create a data council or think about your data as product, things that we're seeing in pretty sophisticated organizations that are very analytical and how they run their businesses. Applying those best practices from software development to data development, development and maintenance of pipelines, really unbinds the organization Shouldn't makes it very, very effective, and gives the team a lot of safety, which, frankly, they're going to need in the new world of incorporating machine learning loops into what have conventionally been sort of, you know, DAG type data pipeline constructs. So, let me pause there and take a couple of questions from the audience. And while I do that, I'll flash up. This is a very personal curation of links and things that I follow.

Some come from my company, some come from other people, I follow an open source data, whether that's sort of like a, you know, a Sam Ramji, or an industry, you know, leader like Benn Stancil, I encourage you to sort of poke around and see where these links lead you. And I will pick up a couple of these questions, which is, let's see, can you share the most complex customer problem that had to be solved with a new feature in DBT? There's a fantastic talk that our CEO Tristan handy gives around trying to find the price of a, I should say, the margin on the price of a on the on a bunch of spring onions. So when you think about like, like a grocery store, and the incredible assortment of skews that it carries, you know, how those skews exist in relationship to each other? Are you really gonna sell chips without salsa? Like, you know, very, very small, but important questions that you can best answer with data. That is a fantastic model. A fantastic talk to listen to, to kind of understand how you solve deeply complex questions using DBT. Joe, I hope that's a good answer for that participant's question. What do you think?

Joe: Think that's fantastic.

Margaret: Okay. Ask another, what else is in the hopper there?

Joe: So I'll ask a question. You talked about, you know, pushing out analytics into organizations, domains and enabling practitioners to facilitate analytics. You've got the product, you've got the process, right? Where do the people fit in? How do you put in programs to help enable these individuals to making sure that they're doing high quality analytics, because we know, different organizations have different maturities? You know, different domains seem to have different maturities within an organization, we see. Finance tends to be very strong, they're working in Excel sheets every day, and they get a lot of the analytic concepts, whereas HR might not be as mature in other organizations. So how do you ensure that as you pass that out to other domains, that we're enabling people and empowering people to execute on those analytics?

Margaret: I think it goes a little bit back towards the embed, and empower and elevate the analysts question. I would say that in a lot of the more sophisticated organizations I see, there is actually a very data literate analyst, and or data engineering team that is very well partnered with all of the functions in the business GNA, product, engineering, you know, business operations, it doesn't really matter where it is. And some of the data teams centralized data teams that I see; they could have hundreds of people in them. And then people sort of like matrix out to the various functions. And what that means is that the most sophisticated kind of functions can help to elevate the rest when they share best practice, or when they share frankly, common infrastructure, you know, so it's really kind of a rising tide lifts all boats situation and in the good ones.

Joe: Great. Question from a participant just now. Basic question, does DBT integrate with Incorta?

Margaret: You know, I saw that question flash up, too. And, you know, I have to tell you, I don't know the answer is that I'm relatively new to Incorta space myself, I was listening to one of the talks this morning, and kind of pulling up tabs, so I can go do my own investigation there. It's entirely possible since DBT, is open source, someone has written that integration already. It's one of the magics of DBT that it is, the organization supports a core open source project as well as a commercial cloud offering. So I mean, when I think about the number of database adapters that are out there in the wild, I think it's somewhere between 50 and 60 right now. And every day, you know, I go check. There's a new one written, and I'm like, oh, okay, we've got a new database we support that's fantastic.

Joe: Great. Awesome. Another question from a participant is, this is a big broad question. What's an ideal modern data stack? We see in why there's so many versions out there.

Margaret: That is an excellent question. To some extent, I think it depends on the business you're in and what you already got. I see a lot of legacy businesses that are you know, as they move With just standard more things, standardized more things like on a snowflake or a Databricks, or even a Redshift, which was, you know, more common maybe five years ago than it is today, it would look like fiveTran for the ingest, it would look like one of the top three or four data platforms in the middle, your big query or your data bricks or snowflake, it would look like DBT on top, it would look like a few workloads written in Python that could not be standardized on DBT, which is what I worry about, you know, prone in the earlier questions, and then it would look like any one of a number of pretty best in class BI tools on top. There's a remarkably strong Power BI contingent out there. There's a remarkably strong looker contingent, but it could be Tableau could be thoughtspot. Even some of the more modern entrants like a moat or a heap, we see pretty good traction and good success in our current customer base with. So I'd say it's more about which of these interchangeable parts is best going to make a modern data stack for you in your specific business.

Joe: Thank you. One of the other questions that I had is we talked about this event thematic you mentioned I think, in every single session I've heard today is that crippling anxiety and fear of something being wrong, right? and people distrusting the moment that one value is found wrong, the team's authority and trust has been degraded at some extent, right? What do you see the roadmap to be to building that back? Right? You talked about documentation, building these logical models? What other things can we be thinking about to help restore that trust and say that we are someone that you can keep coming back to time and time again?

Margaret: Data provenance is always, always going to be a concern for people in our industry. Where did the data come from? Was it good data? Was it somehow altered or redacted or only partially transmitted? Like data, provenance is always going to be a theme. So you know, there's tools that can help with that there's documentation and lineage that can all help with that, I think that what really is going to make a difference is things like the use of Git to version how we interrogate or filter or clean or transform data. Because now we can provide this very clear chain, like logical chain of events, across all of our data that hopefully will give everybody sort of like confidence in the in the end product, when you can really trace it all the way back, it becomes clear where intervention or changes have affected it. And you begin to trust not just the dashboard you're looking at today, but your process for making them and your process for getting that data right in the first place. So that's going to be an area I think we're all working in the industry for quite a long time to give people better solutions.

Joe: And the thing that you mentioned, too, is everyone's running fast with this right? Everyone's sprinting towards solutions. The question then becomes like, how do we take the scissors out of everybody's hands while they're running? And give them a blunt spoon, right? To work with? What can we do to empower that? So you talked about adopting software methodologies have moved faster, move more safely, right? And bring those two principles together? Do you see challenges in organizations adopting those software principles today?

Margaret: Unquestionably, the data team often has not been as formally trained as the app dev team. And even in app Dev space, like it's not like everybody has like a master's in CS. So getting to kind of, you know, basic fluency with some of the principles of change control and environments, is going to unblock everybody. It is a great starting point for everyone. Whether they're an analyst who knows, like every excel shortcut, and like, not at all, you know, how to write code, or they're a pretty sophisticated, you know, data engineer, just having the same conceptual familiarity with those concepts can help us all play well together in the same industry.

Joe: Great, thank you, Margaret. We have now reached the end of this session, and I appreciate you fielding so many questions. And if you have more, I'm sure it sounds like Margaret's been in chat throughout the sessions. I'm sure you can keep asking and she'll be poking around there a little bit.

Pipelines in Practice: How Practitioners Operate using the Modern Data Stack

Speaker:

Margaret Francis