DEV
BLOG
  • Design
  • Data
  • Discernment

We believe in AI and every day we innovate to make it better than yesterday. We believe in helping others to benefit from the wonders of AI and also in extending a hand to guide them to step their journey to adapt with future.

Know more

Our solutions in action for customers

DOWNLOAD

Featured Post

How we spoke to data in 2 days

What would you do in two days? Let me be more precise, what would you do on a weekend? Depending on the kind of person you are, the answers may differ. Some may wanna stay in, have a good sleep, take it slow. If you are like me, you would be on the road riding […]

Know More

MENU

  • HOME
  • SERVICES
  • PRODUCTS
  • CASE STUDIES
  • INDUSTRIES
  • CAREERS
  • CONTACT US

Artificial Intelligence

Blockchain

Enterprise Solutions

Blog
White Papers
Resources
Videos
News

Inevitable muscle memory for your Machine Learning Applications

  • mm
    by Jubin Jose on Tue Feb 18

This is the story of Aquila DB – a latent vector and document database built for data scientists and machine learning engineers. In this story, I will explain how and why I developed AquilaDB, to solve one of the major problems faced by developers in building information retrieval systems. 

A couple of years ago, I was frantically working on a project to develop an AI chatbot. The purpose of this chatbot was to engage hundreds or even thousands of website visitors simultaneously and serve them with the information they seek. Only when I was nearing the end of the project, did I realize that the chatbot I developed wasn’t very scalable with the intent-entity paradigm. The chatbot gets confused when the number of intents exceeded a certain limit. Most of the AI-powered chatbots available in the market today use an intent-based information retrieval mechanism. Here, the application tries to identify the intent behind a query and parses information from related documents and databases to deliver relevant information to the user. However, in an enterprise-level application, the volume of the information at disposal will be very high. It can be documents, images, videos, etc and in this scenario, an intent-based information retrieval mechanism would not efficiently work.  

Despite the disappointment that came with this newfound challenge, I pulled up my socks and started looking for a way to solve this problem. The main obstacle was to find a solution that would be able to fetch information from a large volume of data and handle a large number of queries while maintaining a fast and efficient information retrieval system. After a good dose of sleepless nights and caffeine, I came across the concept of Neural Information Retrieval using a vector space model for representing information to solve the scalability issue. I ditched the standard boolean model and moved to a vector space model.

A vector space model is an algebraic model for representing any objects, such as documents, images, videos, etc as vectors of identifiers, such as for example, index terms. Different objects are represented in a numerical format so that we can apply data mining techniques such as information retrieval, information extraction, information filtering, etc.

I replaced the standard boolean model with a vector space model to fix the scalability issues. It allowed me to compute the continuous degree of similarity between queries and documents. Moreover, it eased the ranking of documents according to their possible relevance. The partial matching feature of the vector space model enabled the application to retrieve relevant information much faster. In a nutshell, using a vector space model for the application has increased the performance by at least 10X.

How to access vectorized data?

Vectorizing data and using that for data retrieval applications is clearly the best way to do it. However, to my surprise, there was no database or any open source projects available at that time to my knowledge, to help a developer who wants to index vectorized data. That’s when I decided to build one, and in a few months, I open-sourced the first-ever tool to vectorize data – AquilaDB. 

The driving idea behind AquilaDB was to build something that could act as a type of muscle memory for all machine learning applications. My goal was to provide data scientists and machine learning engineers with a sustainable solution for effective Neural Information Retrieval.  It is the Redis for Machine Learning.

In a bird’s eye-view, AquilaDB is a latent vector and document database. It can help developers to prototype an idea in minutes and then scale it at their pace. It facilitates efficient similarity search on dense vectors of any size that do not fit in RAM. In AquilaDB, I used two core libraries- FAISS and PouchDB. FAISS is a library that is often utilized to search for similar vectors. It has algorithms that search in sets of vectors, irrespective of size. PouchDB is an in-browser database that is written in JavaScript and modeled after CouchDB. AquilaDB offers everything that Couch protocol can offer. It enables native connectivity to anything that understands the Couch protocol. 

The video below explains in detail about AquilaDB.

AquilaDB is extremely easy to set up and you can start indexing your feature vectors within minutes. All you need to do is vectorize the data for your application and send it to AquilaDB with the respective metadata. After you have indexed enough data, you can send your query vector to the database and retrieve similar vectors along with their metadata. Through this information retrieval method, you get a versatile system for all kinds of data including sound, video and image formats.

Below is a simple step by step process on how to use AquilaDB for efficient information retrieval. AquilaDB is quick to set up and run as a docker container. All you need to do is either build it from the source code or pull it from the Docker hub.

To build from source code, clone this repository

build image: docker build -t ammaorg/aquiladb:latest .

To pull from Docker hub

docker pull ammaorg/aquiladb:latest

Finally, deploy AquilaDB container

docker run -d -i -p 50051:50051 -v "<local data persist directory>:/data" -t ammaorg/aquiladb:latest

 

This project is still under active development (pre-release). It can be used as a standalone database now. The peer manager is a work in progress, so networking capabilities are not available now. With release v1.0 we will release the pre-optimized version of AquilaDB. If you have any questions about the AquilaDB, please drop an email to [email protected]

Author

  • mm
    Jubin Jose

Jubin Jose is a technology enthusiast. He works as Senior NLP Developer at Accubits. When he’s not watching sci-fi content, he might be coding, listening to rock music. Jubin is interested in Decentralized Machine Learning and Information Retrieval. He believes in the importance of giving back to the community and is an active FLOSS contributor.

Categories

View articles by categories

  • Uncategorized

Subscribe now to get our latest posts

All Rights Reserved. Accubits INC 2020