DEV
BLOG
  • Design
  • Data
  • Discernment

We believe in AI and every day we innovate to make it better than yesterday. We believe in helping others to benefit from the wonders of AI and also in extending a hand to guide them to step their journey to adapt with future.

Know more

Our solutions in action for customers

DOWNLOAD

Featured Post

How we spoke to data in 2 days

What would you do in two days? Let me be more precise, what would you do on a weekend? Depending on the kind of person you are, the answers may differ. Some may wanna stay in, have a good sleep, take it slow. If you are like me, you would be on the road riding […]

Know More

MENU

  • HOME
  • SERVICES
  • PRODUCTS
  • CASE STUDIES
  • INDUSTRIES
  • CAREERS
  • CONTACT US

Artificial Intelligence

Blockchain

Enterprise Solutions

Blog
White Papers
Resources
Videos
News

Streaming of GZIP data: Solving one of the basic shortcomings in AZURE Blob storage

  • mm
    by Charush S Nair on Mon Sep 9

You may already know this – Parallel processing is very important in data analytics. It increases the performance of the analytics engine by processing several data streams in parallel and thereby reducing the processing time by several folds.

Architecting software with optimized parallel processing is the kind of work I do all the time, but mostly in AWS (Amazon Web Services) infrastructure. In a majority of AI and Big data projects, we have to work with GZIP data, which is a compressed version of very large files(typically in Terrabytes). In AWS, there is an adapter to stream GZIP data which allows for parallel data processing and data download. But, not so long ago, I came across this new challenge – to stream GZIP data in AZURE blob storage.

We’ve got a situation here!

The task in my hand was to build an analytics engine for a given data set. The data was available in GZIP format and unfortunately, there was no option to stream GZIP data in azure blob storage. For the same reason, the program has to wait until the entire file is downloaded to start the data processing works. This has resulted in a huge delay in completing the entire data processing. Here, the delay I’m talking about is between several hours to even a couple of days. Well, no one is willing to wait that long, certainly not me. And that’s a situation worth exploring and neutralized.

When it comes to IaaS, a majority of enterprises prefer AZURE over AWS despite the fact that AWS provides more services. One of the reasons is, Azure allows the utilization of the same tried and trusted technologies that several enterprises and businesses have used in the past and are still using today. These include Windows and Linux, Active Directory, and virtual machines. Moreover, Azure is designed based on Security Development Lifecycle (SDL) which is a leading industry assurance process which emphasizes security at its core. Private data and services stay secured and protected while they are on Azure Cloud. Azure’s compatibility with the .NET programming language is one of the most useful benefits of Azure, which gives Microsoft a clear upper hand over AWS and the rest of the competitors. 

The Shortcoming of Azure

A fair share of the  AI and Big Data projects involve processes such as sentimental analysis of text data like chat data, user reviews, and feedback, etc. Usually, for each project, there would be GZIP data for around 1 TB to be analyzed before deriving useful insights.

In AWS S3, thanks to Rara technologies, they have created an adapter to stream GZIP data which allows for parallel data processing and data download. Using ‘Smart open’ adapter, we have the option to process the bulk data in chunks with AWS. Here the process does not have to wait till the entire file is downloaded. Smart open can work asynchronously to process chunks of 10 MB size while the rest of file is being downloaded simultaneously. This freedom to do parallel processing saves a lot of time. But for Azure Blob, such an adapter is not available and it results in the waste of time whenever data processing happens.

The Solution

Waiting until data download to complete is not an option. So, I built a Python Adapter to stream GZIP data from Azure Blob storage. Using the ‘get_blob_to_stream’, which is the default functionality to stream data in the Azure cloud, we can stream data from Blob storage.

The high-level view of the solution goes like this – At first, we define the start and stop indexes and set the start index to 0. After that, we can initiate a process which streams the data with a base file size as defined by the indexes. And the process can run until the condition that the available file size in Blob storage is less than the chunk size defined by the indexes.

Once the blob data is streamed as chunks, zlib in Python can be used to decompress the GZIP chunk to fetch the data and process it. While this happens, blob storage will still be streamed in parallel. Essentially, the adapter makes it possible to stream the GZIP data in parallel while processing it.

With this adapter for Azure blob storage, the process that initially took around 10 hours can be now completed in less than an hour. 10X better performance! If you’d like to use this adapter, you can fork the code from our Github here.

Transform your business with Data Analytics

Learn more

Author

  • mm
    Charush S Nair

Charush is a technologist and AI evangelist who specializes in NLP and AI algorithms. He is heading HPC at Accubits Technologies and is currently focusing on state of the art NLP algorithms using GAN networks. He is an active speaker, conducted several talk sessions on AI, HPC and is heading several developers and enthusiast communities around the world.

Categories

View articles by categories

  • Uncategorized

Subscribe now to get our latest posts

All Rights Reserved. Accubits INC 2020