You may already know this – Parallel processing is very important in data analytics. It increases the performance of the analytics engine by processing several data streams in parallel and thereby reducing the processing time by several folds.
Architecting software with optimized parallel processing is the kind of work I do all the time, but mostly in AWS (Amazon Web Services) infrastructure. In a majority of AI and Big data projects, we have to work with GZIP data, which is a compressed version of very large files(typically in Terrabytes). In AWS, there is an adapter to stream GZIP data which allows for parallel data processing and data download. But, not so long ago, I came across this new challenge – to stream GZIP data in AZURE blob storage.
The task in my hand was to build an analytics engine for a given data set. The data was available in GZIP format and unfortunately, there was no option to stream GZIP data in azure blob storage. For the same reason, the program has to wait until the entire file is downloaded to start the data processing works. This has resulted in a huge delay in completing the entire data processing. Here, the delay I’m talking about is between several hours to even a couple of days. Well, no one is willing to wait that long, certainly not me. And that’s a situation worth exploring and neutralized.
When it comes to IaaS, a majority of enterprises prefer AZURE over AWS despite the fact that AWS provides more services. One of the reasons is, Azure allows the utilization of the same tried and trusted technologies that several enterprises and businesses have used in the past and are still using today. These include Windows and Linux, Active Directory, and virtual machines. Moreover, Azure is designed based on Security Development Lifecycle (SDL) which is a leading industry assurance process which emphasizes security at its core. Private data and services stay secured and protected while they are on Azure Cloud. Azure’s compatibility with the .NET programming language is one of the most useful benefits of Azure, which gives Microsoft a clear upper hand over AWS and the rest of the competitors.
A fair share of the AI and Big Data projects involve processes such as sentimental analysis of text data like chat data, user reviews, and feedback, etc. Usually, for each project, there would be GZIP data for around 1 TB to be analyzed before deriving useful insights.
In AWS S3, thanks to Rara technologies, they have created an adapter to stream GZIP data which allows for parallel data processing and data download. Using ‘Smart open’ adapter, we have the option to process the bulk data in chunks with AWS. Here the process does not have to wait till the entire file is downloaded. Smart open can work asynchronously to process chunks of 10 MB size while the rest of file is being downloaded simultaneously. This freedom to do parallel processing saves a lot of time. But for Azure Blob, such an adapter is not available and it results in the waste of time whenever data processing happens.
Waiting until data download to complete is not an option. So, I built a Python Adapter to stream GZIP data from Azure Blob storage. Using the ‘get_blob_to_stream’, which is the default functionality to stream data in the Azure cloud, we can stream data from Blob storage.
The high-level view of the solution goes like this – At first, we define the start and stop indexes and set the start index to 0. After that, we can initiate a process which streams the data with a base file size as defined by the indexes. And the process can run until the condition that the available file size in Blob storage is less than the chunk size defined by the indexes.
Once the blob data is streamed as chunks, zlib in Python can be used to decompress the GZIP chunk to fetch the data and process it. While this happens, blob storage will still be streamed in parallel. Essentially, the adapter makes it possible to stream the GZIP data in parallel while processing it.
With this adapter for Azure blob storage, the process that initially took around 10 hours can be now completed in less than an hour. 10X better performance! If you’d like to use this adapter, you can fork the code from our Github here.
Transform your business with Data Analytics