In file systems, large sequential writes are more beneficial than small random writes, hence many storage systems implement a log structured file system. In the same way, the cloud favors large objects more than small objects. Cloud providers place throttling limits on PUTs and GETs, so it takes significantly longer time to upload a bunch of small objects than a large object of the aggregate size. Moreover, there are per-PUT calls associated with uploading smaller objects.
In Netflix, a lot of media assets and their relevant metadata is generated and pushed to the cloud. Most of these files are between 10s of bytes to 10s of kilobytes and are saved as small objects on the cloud.
This talk proposes a strategy to compact these small objects into larger blobs before uploading them to the cloud. We discuss the policies to select relevant smaller objects, and how to manage the indexing of these objects within the blob. We also discuss how different cloud storage operations such as reads and deletes would be implemented for such objects. This includes recycling blobs that have dead small objects — due to overwrites, etc.
Finally, Tejas showcases the potential impact of such a strategy on Netflix assets in terms of cost and performance.
Ashley: Up next, we have Tejas Chopra who joins us for his talk, object compaction in Cloud for high yield. Tejas is a senior software engineer and the data storage platform team at Netflix, where he is responsible for architecting storage solutions to support Netflix studios, and Netflix stream night streaming platform. Tejas, we really appreciate the opportunity to hear from you today, the floor is yours.
Tejas: Thank you so much, Ashley. And hi, everyone. It's an honor to be here at zero gravity today. And very, very excited to share my learnings with you today. And today, I will be discussing compacting small objects in cloud and how does that lead to an improvement in how data is stored in cloud and how it is retrieved from Cloud. So the agenda for the talk is as follows. I'll go over a brief introduction about myself. We'll go over the problem, statement and overview of what we're discussing. We'll discuss the term compaction, what it means. And to do that, what are some of the data structures that have to be designed? And finally, how typical operations file system and cloud operations such as reading, writing, deletes work in this scenario? How have we ensured fault tolerance? And finally, what was the impact of the changes that we made? So I'm a senior software engineer at Netflix, my name is Tejas and I am also a keynote speaker, I talk on distributed systems, cloud and blockchain.
At Netflix, my team is responsible for building infrastructure solutions to support building the Netflix studios in the cloud, so that artists can collaborate from any part of the world on a shared movie asset. Prior to Netflix, I have experience working at companies such as Box, Apple, Samsung, Cadence and Datrium. When you think about the cloud workload, right, the nature of cloud file systems is such that it will provide a file system interface to objects, when you think about files and folders that you work on locally. And when you think about cloud that has object stores, you would imagine that the simplest way to map files and folders to objects is to have each file as an object stored on Cloud. However, that may actually be detrimental to the performance that you get when you move to cloud. Because when you look at all the workloads that run on cloud, the nature is that there are many, many small files that are created in bursts. Small files are not very beneficial for Cloud Storage. And there are multiple reasons for it.
The first reason is the request level costs. If you are adding a file to cloud or if you're inserting an object into Cloud, there are costs that are associated with sending the request as well as retrieving. Imagine uploading a one gigabyte file with four kilobyte puts, the cost of actually doing this operation is 57 times the cost of storing the file in cloud. The other important aspect to note is the throttling. So Amazon s3's best practices guide says that you can if an application issues more than 300 puts, deletes or lists per second, or more than 800 GET requests per second s3 may rate limit. Throttling is so aggressive sometimes that a single four megabyte put is 1000 times faster than 1000 4kilobyte puts. In this graph, we are showing the impact of object size and throughput we're comparing it. And it shows how we see a rise in the average throughput as the object size increase.
Now, if you think about alternatives to this scheme that I mentioned, where a file is mapped to an object in s3, you can use file systems and cloud that exists today as well. So for example, you can use EFS, it's a generic file system, it charges only for storage, but it does not charge for making the request. Also the charges around $0.3/gigabyte/month. You can also explore storing your data in DynamoDB, which is a no SQL database for kilobyte sized objects. Now, the cost model is very different here, because it has a cost of $0.25/gigabyte/month for storage. But it also has $0.47 for 1 write/second/month, and $0.09 for one read/second/month. Compare it with s3. In s3, your storage cost is $0.023 gigabyte/month and your put cost is 0.0005 cents/put. So now if we try to compare and contrast this.
To store one terabyte worth of four kilobyte files on cloud, and where your application is performing 100 writes/second if you just use s3 as is without any layer on top of it. It will cost you $1,366. The interesting point to note here is that the operation cost, which is the cost of actually sending the data is 98%. And it will take five years of storage to match the operation cost. In comparison, if you use the elastic file system EFS, your cost is $307. And your operation cost is actually 0%, because you do not charge for any requests. For DynamoDB, it's $303 per month, your operation cost is 16%. Because you're still charged for writes and reads per month, you will need five days of storage to match the operation cost. With our scheme today, if you use it on s3, you it'll actually cost you just $24. What's interesting to notice that the operation costs will be point 9.001% of the overall cost. And it will take just 27 seconds of storage to match the operation cost. So what exactly is compaction, you may ask? It's nothing but aggregating the data And then making single writes, it's a very similar scheme that has been applied traditionally to file systems. And it's, it's something that has been there forever. We are we just tried to mold it appropriately so that it can work with cloud. So you can think of compaction as a pluggable client side packing and indexing module. The module will pack your data or your small files into gigabyte sized blobs, and will maintain an index of where the file is within a blob. So now, what we wanted to also do is ensure that doing this does not compromise on our fault tolerance and redundancy.
So along with packing the data in a blob, we also packed some metadata into the blob, that metadata is at a fixed location within a blob, and it contains the index of all the files that have been packed into that block. Along with that, we also maintain a global index, a global database of where all files are in which blocks at what index. So these two, when they run together, ensure that we have coverage when it comes to fault tolerance as well. Now, when we think about the data structures that we use to design such a system, the first important concept is to understand what a blob is. A blob is a single, immutable, large sized object and cloud, data from multiple files can be packed into one block. And each blob, like I said, will contain a footer. And a footer is the metadata for all the files that are contained in that blob. It contains information about the byte ranges of a file within a blob. So in some sense, if you read all the footers of all the pack blobs, we will get all the files that are present in cloud. We could have used SS tables, but it requires sorting of key value pairs before persisting. The second important concept is the global table that I mentioned, which contains all the files. So we have a blob descriptor table, which is optimization, over reading all the blog footers to create the world. It's a global database that contains information of all the files and all the blobs. It is redundant in nature, because you can recreate it, the consistency of this BDT is a challenge, because blob creation is very distributed, you can have multiple clients working on collecting files and putting them into blobs. And so you need to update the blob descriptor table appropriately in the right order. So maintaining consistency, there was always a challenge for us in the implementation.
Now, let us discuss the next important part of when this is conceptually packing the data and pushing the data. So what you need to do is accumulate sufficient amount of data so that you can add it to a blob, and then push it to cloud. So this requires some form of buffering. Data is buffered on the node on the client side and until it is persisted in cloud, it is not considered durable. So that is one caveat in this key that, unlike other schemes, where when you send the file to cloud, it is durable right away. In this case, you are buffering the file on a client machine and then pushing it to cloud. The other way, other important aspect to consider is how do you pack the data, you know which files do you use to pack together. So packing policy needs to be specified at mount time. Once the data is packed, the embedded indices within the blob are updated. And when we have to put the blob to cloud, what we do is we name it in an interesting way. So the object ID for the blob in the cloud is the client IP, the IP of the machine on which the data was buffered and from which the blob is shared. The offset of the footer within the block and the date.
This is very important because reading just the blob name will tell you exactly where the footer lies in the blog and will also tell you the time of update. So if the same file is appearing in multiple blobs, but if one of the blobs has a more recent date that means the file was overwritten. So you will pick that information to recreate the world. The global blob descriptor table will always maintain the updated location for a file. So when the file is still being buffered on the client, this global BDT will contain the client IP of where the file is present. Once you've packed and pushed the file to cloud, and it's successful, right, you will update the global BDT to contain the object ID and the file extent in the blob. So that is how logically packing and pushing will work. Now, if we think about the impact of it on operations, let's say that you are reading some data, you want it to read a data or read your file, it depends on where the data is buffered.
At that point in time, the read request will be directed to the blob descriptor table, which will contain the location of the file. If the file is still being buffered on the client node, the BDT will point to the client node. And you can actually read the data directly from the client, your read request will be forwarded to that client. But if it is the oncloud, if the data has already been pushed to cloud, you know the object ID and the extent information from your BDT, so you will forward a read request to cloud and you can actually issue a range read request to that big object in cloud. So you will just read the appropriate range from that big blob that is stored in cloud, there is no need to fetch the entire blob. However, there are use cases in which actually fetching the entire blob may be useful for subsequent read requests. So you can imagine prefetching the entire block content so that the next read requests, there is a high probability that it will read the range in the blob that you've pre fetched, and you will then be able to serve those read requests quickly.
Deletes and renames complicate our schema a bit. So when you delete a file, it's simple in the existing model, where the Delete gets issue directly to cloud, and you can delete the object there. But in our case, because we are putting files packed into a blob, a delete has to place a tombstone to some degree within a blob, and then there'll be multiple sections within a blob that have been deleted. The first thing that you do is when an issue when a delete is issued, you remove the file from the global table to prevent requests to the file. And periodically, you will also check the liveness of a blob. If you feel that the..or if your threshold is such that the blob has a lot of deleted extents within it, you want to repack the block. So you will read the remaining live entities from the blob pack them together into a new blob and update your blob descriptor table. So that is how you can handle deletes. Renames, again, will atomically update a file name in the blob descriptor table, what we will also do is we will store the context of a rename and piggyback on future blobs embedded indexes.
So you the next blob that has been written you would hold the Rename information and put it in that new blob. When we have to recover from the embedded indices, because that is why we are maintaining these embedded indexes. Blobs will be read in the order of creation. This is the reason why we have the date as a field in our object ID. So blobs will be read in the order on which they are being created. So the latest blobs will always contain the Rename information and will apply it on top of the older blobs information. So that is how you can handle renames. All of this maintaining of indexes within a blob and maintaining a global table as well as for fault tolerance. We will we want to ensure that at every point in time, you always find your file in the blob at the appropriate location. So if your master dies, in that case, we are periodically backing up our blob descriptor table to cloud. On a crash, we will read the last check pointed blob table descriptor table. And we know the time of it. So we will read all the blobs that are written after that time to recreate the world. And let's say one of the clients where the files are being buffered is lost. So that is the assumption that we make.
We are not making these files durable until they are pushed to cloud. So these files can still be recovered if it is persisted on local disk, but we make no such guarantees. If the files were packed and pushed, your BDT will contain all the information for recovery. If the files are packed and pushed, but your BDT was not updated, then your embedded indexes will contain information for recovery in this case. Now when we applied this, and we went when we develop this and we applied this, we saw these results. So in terms of just the price, we saw that there was a 25,000x reduction in the price when it comes to storing these objects, or these files, these small files on cloud, and when it came to throughput comparison, we were writing 24 gigabytes worth of 8 kilobyte files to Amazon s3. And we saw that direct versus packed gave us a 60x improvement in the throughput. So, in conclusion, cloud file systems should aggressively pack the files into Cloud objects, because the nature of the workloads in Cloud is a bunch of small files that are created in bursts.
When you explore cloud object stores, there are significant per operation costs, and there are rate limits as well when exploring just the direct s3, for example. But using our scheme of packing, compaction and pushing the objects to cloud, you can actually increase your throughput by 60X and reduce your costs by 25,000X. Next, a very good example of where this can be useful is backups. So backups of smaller files can be archived using this strategy, it will both improve the backup times and costs. Another important aspect to note here is that the packing policies need to be very intelligent. So you can have a packing policy that can pack your objects by create time or so the files that are created at the same time get packed into the same blob, or the files that belong to the same application get back together. And you could apply learning to choose the objects that you want to pack together. In general, objects that have the same read profiles are the objects that live and die together, can benefit from packing together. And also the prefetching of the entire blob can help with reducing your read latencies as well. That's my time. Thank you so much for joining me today. And please feel free to reach out to me if you have more questions. And I'm open to discussing some questions right now. Thank you.
Ashleigh: Awesome, thank you so much. To us. That was a great presentation. We do have a few questions coming up here in the Q&A panel. Question number one, is there a milestone to hit and the number of files when you should start to think about compacting and moving files to the cloud?
Tejas: That's a good question. We did not have any such milestones. Because a lot of our files that we were uploading to cloud were actually backups of database entries. So we realized that there were 10s of 1000s of entries that could be packed into a blob, before we chose to send it to cloud. And also, it's the number of files was not a criteria for us, it was always the size of the blob. But I imagine that as you have many, many, many files, your footer size will also increase in a given blob, so that we haven't done any study so far on if there is a max limit to that.
Ashleigh: Thank you Tejas. And as a reminder, please drop your questions in the Q&A panel and see we have a couple here to go. So Tejas, do your cost double with high availability redundancy.
Tejas: So the that's a good question. So we actually have not seen the cost double we when we are storing the data in the database, as well as storing it in s3, the way we are doing it, the cost benefits of s3 actually overshadow whatever extra cost we are paying for storing it in the database. So we actually more than make up for that extra cost that we are spending when storing these entries. And as you as you can see, initially the motivation, right, we can save more than, like 90% of the cost by just putting the data as is on s3. And these are request level costs. So we do not see an increase in those costs.
Ashleigh: Thank you. Next question. What level of people res resource investment does it take to implement this type of file management strategy?
Tejas: When it is around when you're thinking about developers, right, how many developers were needed in terms of people resources, it took two developers to like develop this. And it's an ongoing thing, because we want to make it not just used for backups, but also for other use cases that can benefit from this. The one thing to note is that we store hundreds of petabytes and exabytes of data on cloud. So the savings that we get are like in multiple million dollars. And so it's worth the investment to you know, explore these strategies for an organization where you have petabytes of data on Cloud.
Ashleigh: Our next question is, where can people find more information on this approach to file management?
Tejas: We have not yet openly like shared our source code because a lot of it is proprietary. But I imagine that it's not very complicated to implement this using the C++ or other languages we may explore open sourcing it in the future. But at this point, there are no such plans to do that.
Ashleigh: Thank you, Tejas. Next question here? Where are when in your data management strategy? Do you start to think about compacting and moving files to the cloud?
Tejas: We realized when we looked at our workloads that we have a lot of huge files when you think about movies, and even more number of smaller files when you think about the metadata for movies. And we realize that the storage cost of these small files is very, very low, compared to the request level costs. We just were storing bunch of them. And the request costs were overshadowing the storage costs. That's when we took a decision that we need to evaluate the impact of making such a strategy and storing it. And like has been highlighted earlier as well. It's not a durable store until it makes to cloud because it's a backup, because it's considered as a secondary storage. We are okay, taking that bet. So we are going to that was what informed our initial design choices and our strategy.
Ashleigh: Got it? Thank you, Tejas. Another question just popped in any future plans for extending it to open source?
Tejas: Yep, no plans at this point. But that is something we can definitely explore internally. Because we like I said, we have a lot of proprietary stuff inside this, we may have to abstract it out and allow people to build on top of it as well. One of the goals of open source is so that developers in the community can add more value to it. So we need to have the right abstractions in place for developers to feel that this will benefit them and their organizations. And it's also this is something that will help only at scale. If you have smaller amounts of data, this may not give you that much benefit. So that is very important before we make all of those decisions of open sourcing it.
Ashleigh: Of course, thank you so much.
Senior Software Engineer