Learn how to incorporate machine learning solutions into existing data analytics workflows with Juan Malaver, Data Scientist at eCapital Advisors. In this 5-min lightning talk, he shares eCapital's approach to delivering ML insights leveraging a unified analytics structure to streamline data science workflows.
Hi everybody, and welcome to the five minute lightning talk - SPARK your organization's data strategy with scalable machine learning. My name is Juan Malaver. I'm a data science consultant at eCapital Advisors. And my work mainly consists of developing and deploying data science solutions for my clients. So let's talk about why Apache Spark is a really great technology for developing machine learning solutions. Apache Spark itself is a unified analytics engine that has some functionality that makes it work really well with large volumes of data. And today, we'll be talking about some of the machine learning functionality that's built into Spark itself. Now this is what a typical machine learning workflow might look like. You're pulling in data from different sources, passing that through some ETL.
And then taking your clean and structured data, passing it through your models, tuning your models with cross validation, evaluating those with the appropriate metrics. And last but not least, reporting your results to your end users. Now, if you've done machine learning before, you'll notice that this is pretty similar to, frankly, any machine learning pipeline, and that's intentional. Spark is not really meant to substitute other machine learning workflows necessarily, but rather complement them and help you scale them to larger volumes of data. The building blocks of machine learning where spark come from it, MLlib library, which is embedded into every spark instance, and some of the structural data types that are built on top of that, if you work with tools like scikit learn, you'll notice that the logic here is pretty similar.
We have transformers that allow you to do data pre processing and feature engineering. So you know, if you are one hot encoding your categorical variables, or scaling and normalizing your numerical variables, Transformers will help you out there. Estimators are all about applying your models to your data. So if you're training your supervisor, unsupervised machine learning models, their support for that as well. And then evaluators allow you to observe the model performance. So you know, in the classification world, your ROC curve, confusion matrix, precision recall, where if you're doing regression and want to analyze your residuals, there's support for that as well. And then pipelines are really the glue that helps put all of these pieces together. So you can queue your transformers, estimators and evaluators into a single pipeline using the spark syntax, and then automate the stages in your machine learning workflow.
Now, why would we want to use something like this, instead of other machine learning technologies out there, and there's two main reasons for that. Spark allows you to do parallel computing. And in the machine learning world, what that means is that we can train machine learning models across multiple worker nodes, and really leverage the full compute resources of the production environment. And then the other reason is lazy evaluation, which means that when you set up these pipelines, Spark is not collecting your data into memory at every stage. But rather, you put in your raw data at the beginning, and then only collect it at the very end once you have your final output and it's ready for reporting. In terms of what a data science lifecycle might look like, with Spark, well, we at eCapital likes to think of all data science projects as software development projects. And then when you bring spark into the picture, there's a couple of things to keep in mind.
Spark itself is written in Scala, but it has API support for Python and R. So for data scientists that are already working in those languages, using Spark is not very difficult. It's just a matter of understanding the syntax. And then for data engineers, their support for SQL as well. Develop, configuring the Spark cluster itself can be a little bit intimidating at first. But with just a couple of tweaks to the settings, you can be parallelizing, your machine learning models in no time. Now, I want to take a couple of seconds here to talk about why Incorta is a great tool for working with Spark and developing machine learning pipelines. Now Incorta really allows you to discover insights from your data very quickly, really, at any level of granularity. Because it's an end to end platform, building rapid prototypes, and getting those in front of your users is very quick as well.
A lot of the overhead that comes with configuring Spark is removed with Incorta since you can configure the settings within the UI or your code. And then it also has the built in job scheduler, so deployment is pretty straightforward, too. And, you know, most importantly, speed and performance. Thanks to the direct data mapping and the Apache Spark engine itself. Now, we want to invite everybody in the audience to join us for a webinar on June 9, or I will be diving deeper into how we at eCapital approach machine learning solutions, talking about some success stories, and then showing you how you can leverage the full power of tools like Spark and Incorta to develop very powerful machine learning solutions. So join us on June 9, if you're interested in diving deeper into that. Thank you, everybody.
Data Science Consultant