ZG Session | Scaling Up Apache Airflow to Enterprise Level

Rabobank Netherlands is currently building an enterprise-grade data mesh. And within that mesh, our team at Avanade was tasked with implementing Apache Airflow as the de facto orchestration tool for hundreds of teams.

Looking back over the last two years, we can finally say we did it. Airflow now hosts over 50 teams, runs smoothly, and links data delivery streams with data consumption streams. As a result, our team has become known as a sort of Airflow Center of Excellence. However, we did not get to this point without making our fair share of mistakes.

This session covers every step of our journey, including dealing with user team proficiency (or the lack thereof), struggling with the Kubernetes executor, trying our hand at zero-downtime deployments, getting the scaling right, and wrestling with the PostgreSQL database backend.

Transcript:

Glenn: First, I'm going to start with a quick introduction to Containers, Kubernetes and Apache Airflow. Next up, I'm going to tell you why you should care about designing for scale. And then to remain X, what we all did wrong. Initial with a database being lost, overcrowding, our database, issues with network scaling, some application settings you might want to look into, and an upgrade to airflow to the trigger. To top it off, we will do the Q&A session. So what are containers? Containers are a piece of software, which contains everything you need for your application to run.

Think of your web server, application server, maybe even a database server, but also the libraries and all of that. So what's Kubernetes? Kubernetes is a platform which will provide you with some runtime, you can get some disks, some storage and some CPU cycles out of there. But it's more than that. It's also orchestrator, which will monitor workloads on your platform, you specify that you want things to be run in a certain timeframe. And from that point, Kubernetes can take it over, you set the autoscaler to running. And as extra workload starts increasing really hard, Kubernetes will detects it, it will auto scale your workload and make sure that everything is still handled in time. So it's like temporary workers in your job. Then what's Apache Airflow? Apache Airflow is an orchestration tool for data. But it also allows you to design some data flows. You can do multiple things with it, for example, called ADF or Databricks.

But everything you can call with or with a REST API, you can use fire airflow. So why should you care about designing for scale, when you start to grow in number of users, your environment will start to behave very differently than it does with available users. By default, the airflow Helm chart will read your chance for middle users. But scaling up, you might start to get an availability or performance degradation. In addition to that, with a bigger user base, your downtime becomes way more expensive than it did before. If 1000s of developers are waiting for you to get your system up and running or responsive again, it costs a lot of money. And not to speak about all your downstream systems that might depend on you. Think of the report your manager really needs for that business deal. It can cost a lot of money that way. So onto our main course. First, I'm going to show you a bit of the Apache Airflow which is now running in my local cluster. And here have a scheduler, a stats V pot, a triggerrer and a web server. Let me increase that for you so you can also read it. And in here, the actual web server of Apache West for that.

This is all written in Python with three different steps. All three of them are bash operators with a simple sleep for demo purposes. But this can be anything like I said before, you can start with dumping data from a database after bad manipulate it into a storage account. And as your search operator, create a report out of it and send it to your manager. In this case, we will run T1 and T2 parallel to each other and then starts T3. So I'm going to start this now and as you can see, airflow creative multiple containers into the Kubernetes cluster.

Imagine that you have a cluster with two different note. A couple of balls like the web server scheduler and triggerer and the database in your cluster. This is all fine and well until you start experiencing workloads. Like I said before Kubernetes can also scale not only on board level, but also on Node level. If your cluster keeps getting busier, you might even get a fourth node. And you start to see things like rebalancing or redistributing your cluster. So when this happens your database with your database, you get a short unavailability. Most of the time, this is all fine and well, but keep in mind that all your pods are communicating through your centralized database to keep everything in sync lightning web server.

Now, in this case, the migration from the database went okay. But at a certain point in time, your workload will decrease and Kubernetes might decide to start scaling down again. If it happened to scale down your database, denote which contains your database, your database base will be terminated. If that happens, we have a phenomenon recalled bye bye database. All your work of pods will start failing. And if that happens, a lot of users have to do some manual work to start it again and you really hope they made their data flows by the .. So how can we fix this? We found out that auto scaling and databases are not good friends. So we decided to implement a non scalable database pool. We did this by implementing a taint and a toleration. A taint is something you can set to rebel on a node boss from a special note.

And a tolleration is something that you will accept it is like a peanut allergy. Some people are allergic to peanuts, others are not. If you are you stay out of the way. With that cleared up, we have a different phenomenon. As you can see, in my cluster, I don't have the in cluster database anymore. That was because our development team was wasting a lot of time trying to manage the postgreSQL database, we thought at first, it would be easier because we only had to learn one technology, namely Kubernetes. As it turned out, we had some issues when upgrading it and with automatic backups. So we decided to go with the past database. Now, our past database has 32 cores, which allows for 1500 concurrent connections and 14,195 concurrent users, you would change that this is sufficient. The only problem is that Apache Airflow is implementing their database connections in a special way. I only have one deck in my environment now.

And as you can see, we scaled up from four connections in the database to 16 by only running that one back. Imagine that going over a 1000 or more DAGs concurrently running. So how did we solve this issue? We decided to go with another non scalable node, we reintroduce our database node. And we installed PG bouncer on that. With PG bouncer installed, you change your connection string to from your application and PG bouncer will make sure that it grabs all your connections together and puts them in one single session to your database. You can specify this over how many sessions you want to parallelize this. But at least it's using way less connections than airflow itself. The default Helm chart from Apache Airflow comes shipped with PG bouncer, but it's not enabled by default. It took us about one and a half days to implement this. So it's really worth to do it. Not only if you have a big cluster, but also if you have a small cluster.

Which are small clusters you can go for smaller database sizes. I'm currently running at one gigabytes of memory in my database and one fee core and is barely keeping up. But it should have been sufficient. And this was really due to the concurrent connections in airflow. In addition to that, the Azure PostgreSQL flexible server already includes the PG bouncer for you. So you don't have to introduce it within your cluster. And you can get rid of that non scalable, notable, which was another single point of failure. Now on to the fun stuff, with no more room in the network. In our environments, we deployed Kubernetes within a subnet. Every piece of compute needs subnets and IP addresses. And we decided to go with a 512 IP address, big subnet, which at the time, we thought it was more than enough. Oh, boy, we were wrong about that. We have a couple of ports running here, which were six at this point in time. But we have other things like denote, which also require IP addresses.

We have some things in cube system cube system is the mandatory thing for Kubernetes to run properly. It contains things like core DNS, ACD, which is a key value store, the API server, and all other things regarding auto scaling as well. In Azure, it will also contain the drivers to connect to your storage. So you can imagine that you're already losing about five or six IP addresses per node. Now, imagine having the autoscaler back up and running again, and a lot of user workloads running at the same time. For us around midnight is the time that most of our jobs are getting scheduled. Now we every port gets an IP address, and the entire sublet is filled up. Then it really happens. One of our pods was unable to be scheduled due to not getting an IP address. How to solve this issue? There are multiple ways to approach this.

You can either finger hat if you're using Azure CNI, which stands for container network interface, you can calculate how many IP addresses you would need for your environment. For Azure, the maximum number of nodes within an AKs cluster is currently 1000. We decided to go with 90 pods per node. So we can calculate that that would be 90,000 IP addresses required. But don't forget the IP addresses that are required for your notes as well. With 1000, that would be 91,000 IP addresses. If you want to upgrade your nose balls, you need to scale down some things first. So if you haven't exactly fitting subnets, you get another issue. So go with at least double of what you expect to be because then you can spin up an entirely new node pool, or an entirely new cluster in the same subnet and upgrade seamlessly.

The other solution would be cube NAT. This uses a think of NAT, which stands for network address translation. And it gets one IP address per node. With one IP address, it will create more IP addresses in his internal network, just like you have in your home network with your ISP, you get one IP address from there. And internally in your network, you have multiple IP addresses, and everything will work properly. I'd really recommend to go with Kubenet if your developers already have experience with it, or you decide to go with another cloud vendor. And that's mainly because Azure CNI is not available for AWS or Google Cloud and cube manage platform independent. If you want to go for fast and you offer going to stay with Azure, Azure CNI is the easier way to implement it. But it does come at the cost of having to plan ahead. Now ��.your application settings. At a certain point in time, we experienced seeing the scaling issues with our platform. Around midnight, all of our ports are getting started but only a couple are getting run and a lot, we're having delays in starting. After investigating, we found out that our worker pod creation batch size was set at one. That means that every cycle of your scheduler, you can only create one pod for your DAG or job to running which works fine if you only have one, or maybe a couple of users. But with over 200 users to support and more than 1000 DAGs. That's not enough. We went with 20 at this point in time, but we might increase it in the future to 50 or even a 100. The same goes for the maximum that runs to create... This allows to create multiple bags in one schedule a loop with which was default set at 20 due to the celery executor. We decided to go with Kubernetes executor because of the scalability from Kubernetes.

So try to keep those two in sync. If you have a higher number at work report creation by choice. Also increase your Mac's that runs to create a loop variable with it. Otherwise, you will still have a bottleneck in your settings. The last thing I really advise you to change is your parallelism. By default, this is set to 32, which allows 32 DAGs, or jobs running at the same time. Now for our environment, this was really a bottleneck. And as of yesterday, I upgraded this to 512 parallel jobs at the same time. Which brings me to our next point. Last week, we were trying to upgrade our cluster to airflow 2.3.0. This was we were really looking forward to it because we wanted to use the new Kubernetes version higher than the 1.21 due to ----------.

We also have another issue where airflow didn't support the latest Kubernetes version that for AKs. And diversion that airflow support, it was getting sunset. We thought this would be an easy upgrade, like all the other upgrades work. And we were really wrong about that. We have two days of downtime on our development environment. Due to Phil's migrations that would normally go right. We had to do multiple rollbacks from our cluster. To keep up to date to get the database back in the state where we know it is then we had to do manual migrations, which would normally have been automatically. During the migrations we found out definitely had some no records in tables that should not have no record. So we had to drop those. After all of that the upgrade went successfully. But we found out that we had a huge performance impact on our web server, it was taking more than two minutes to load the main page of DAGs.

And that's ridiculous. Normally, it would only take two or three seconds for us to load our main page with more than 1000 DAGs. After a lot of digging around in debugging, we call it more times found out that the webserver exposure check introduced in alpha 2.3.0 does a select all on our Logs table. Our logs table was about 21 gigabytes on our DEV environment, because it contains all the logs from the web server. So all the requests that come in, all the logs from the DAGs, but also finish from the scheduler and the triggerer. So it's really big. To be honest, I think it's still really nice that the database managed to keep up within two minutes to do a select all from our log state. But as you can see, we have I/O spiking at 100% most of the time and our CPU percentage was also right around that. Now, if you are having issues with a non performing web server, please make sure that you disable this exposure check. Even better if you know that your environment is exposed, disable it already. And if you are in an intranet where you do not expect it to be ultimate disable it because you are contained within your environment.

Now, what are our key takeaways. Don't have the skill, your database and connection pool nodes. It can cause you a real pain to debug this issue because your logs are not getting proficient to your web server as well. And everything's failing. Regarding your database capacity, go with a pass database to help your development team to get up to speed and let them focus on the things they want to focus on. User connection pool are no matter the size of your cluster, because you can save a lot of money on it. Plan and design your network ahead and make sure that you can at least accommodate 200,000 pods running at the same time. It might be overkill for now. But you will thank me later on. Because enlarging subnets is not fun.

Now upgrade your application settings. And try to keep up with the latest changes that Apache Airflow makes for you. There are a lot of settings that keep getting updated every release, and more are getting there every time. And last but not least, if you are having issues which your database server, got which a web server, try and disable the exposure check, because it's a real hard hands on your database. Next to that, I would really like to thank my colleagues, Martijn, Jaminu and Krijn who helped make this all possible. And of course, Rabobank, the Netherlands, who allows me to talk about our failures, instead of our successes. Well, who am I? I'm Glenn Schuurman. I'm a senior platform engineer from Quintor. I started as an application developer early in my career, but I quickly came back to being a platform engineer. And I am roughly believing that using experiments to test a technology is the way to learn. Because you can only tell if you know what you're doing if you experiment it. Now, thanks for the attention. And I think we can go to the Q&A session.

Ardeshir: Great presentation. Glenn, thank you so much for that. So for the data teams that are watching your session, Glenn, a couple of questions here, what would you consider was the most costly mistake to avoid? And this can be in the context of solutions or technical debt or delays in implementation when they're trying to configure or reconfigure a data pipeline? And what should they do instead? I guess, I guess pick the biggest one or two or three?

Glenn: Yeah, I think the IP address subnet would be our biggest mistake, because it required a full redeployment of our cluster. Within Azure, you are not allowed to resize a virtual network. So if you specify your virtual network wrong at the first time, you cannot increase it any further than your already, which in our case was 512 addresses.

Ardeshir: Wow, I see. Okay, that's that's great. And another question here is that, surely you're using DevOps framework in your environment here was, were you able to leverage that in any way to identify and pinpoint the issues around the implementation?

Glenn: A bit, it's really hard to test the airflow environment, because you have to have a running instance, and you have to get your authorization up and running. We wrote a custom authorization with the Azure Active Directory to make single sign on possible. And that makes it really hard to authenticate from any pipelines to test. But we do test our Docker images beforehand, to make sure that a lot of things are working. But of course, you can never guarantee that everything's working.

Ardeshir: Sure, yeah. That's understandable. Are there other scenarios or potential workarounds that you can conceptually think of that that folks can think about? That might be a good fit to get around the issue?

Glenn: What issue are you referring to?

Ardeshir: I was talking about testing of the pipeline.

Glenn: Yeah, I think it's always goods to have multiple environments for you. Also, not only the environments that are user facing, but also your environments that you test on your shelf. In some cases, you can try and do some trick based things. But it might be harder to implement that.

Ardeshir: I gotcha. And, you know, it's such a complex project that you that you spelled out here. So how much human capital did it take to complete this? Is there anything that you would do differently in terms of managing people resources?

Glenn: Um, I'm not really sure. I think we've learned a lot about our database. I think that mistake was mandatory for us to make. By doing that, we found out that the outage kind of was creating some issues for us. But if I have to cross anything over, yeah, it would be the IP addresses, which we would have learned beforehand or that database can cluster because I just really had some user impacts and a lot of users complaining that logs were not there, pipelines were not being scheduled correctly, and things were not getting row at the correct time.

Ardeshir: Yep. Okay. And then I really liked the note that you've mentioned about leveraging community for support. I know you, you guys have done that, throughout this process. And through this project, you know, how was that able to help you the most because we have an amazing community of Incorta users that are sharing and exchanging ideas and solutions at community.incorta.com. And wanted to know about, you know, what was? What were some of the things that you did, using the community around to be able to solve through some of these issues?

Glenn: You know, what we did in our environment is we create a channel for user to communicate to us to report some issues and to ask for any support that they might require. Let it be some frequently asked questions or something like that. We also created an internal Stack Overflow, like site. And we answered the questions on that. We're really trying to make sure that people are able to ask any question that they want, and moderate and answer those questions. And at a certain point in time, your user will start to answer the questions themselves. We've seen that happen over the past week, a couple of times now. And it was really satisfying to watch seeing that our community is growing, really mature and seeing that they can help themselves.

Ardeshir: Yeah, that is definitely always very exciting to see. And it's great to have that kind of a community environment to be able to rely on your peers and other technical experts to help you through some of the challenges that you're facing. Glenn, thank you so much for being here today at zero gravity. And I'd hope you I hope you've really enjoyed the event so far today. We don't have any other questions. So we're gonna start wrapping it up here. Glen Shuurman, senior platform engineer at Quintor. He told us today about how you can scale up Apache Airflow at enterprise level and what Mistakes You Should Avoid along the way. Folks, that is our last live session of track one on use cases. Please stay tuned. As we go towards finalizing and wrapping up this incredible day at zero gravity by Incorta. And we look forward to seeing you at future events. Please move on to the next item on your zero gravity agenda. We should be wrapping things up just about the bottom of the hour. So we'll see you there. Thank you again and hope you're having a great, great session. Take care

Scaling Up Apache Airflow to Enterprise Level – And How to Avoid Mistakes

Speaker:

Glenn Schuurman