This is the story of Aquila DB – a latent vector and document database built for data scientists and machine learning engineers. In this story, I will explain how and why I developed AquilaDB, to solve one of the major problems faced by developers in building information retrieval systems.
A couple of years ago, I was frantically working on a project to develop an AI chatbot. The purpose of this chatbot was to engage hundreds or even thousands of website visitors simultaneously and serve them with the information they seek. Only when I was nearing the end of the project, did I realize that the chatbot I developed wasn’t very scalable with the intent-entity paradigm. The chatbot gets confused when the number of intents exceeded a certain limit. Most of the AI-powered chatbots available in the market today use an intent-based information retrieval mechanism. Here, the application tries to identify the intent behind a query and parses information from related documents and databases to deliver relevant information to the user. However, in an enterprise-level application, the volume of the information at disposal will be very high. It can be documents, images, videos, etc and in this scenario, an intent-based information retrieval mechanism would not efficiently work.
Despite the disappointment that came with this newfound challenge, I pulled up my socks and started looking for a way to solve this problem. The main obstacle was to find a solution that would be able to fetch information from a large volume of data and handle a large number of queries while maintaining a fast and efficient information retrieval system. After a good dose of sleepless nights and caffeine, I came across the concept of Neural Information Retrieval using a vector space model for representing information to solve the scalability issue. I ditched the standard boolean model and moved to a vector space model.
A vector space model is an algebraic model for representing any objects, such as documents, images, videos, etc as vectors of identifiers, such as for example, index terms. Different objects are represented in a numerical format so that we can apply data mining techniques such as information retrieval, information extraction, information filtering, etc.
I replaced the standard boolean model with a vector space model to fix the scalability issues. It allowed me to compute the continuous degree of similarity between queries and documents. Moreover, it eased the ranking of documents according to their possible relevance. The partial matching feature of the vector space model enabled the application to retrieve relevant information much faster. In a nutshell, using a vector space model for the application has increased the performance by at least 10X.
Vectorizing data and using that for data retrieval applications is clearly the best way to do it. However, to my surprise, there was no database or any open source projects available at that time to my knowledge, to help a developer who wants to index vectorized data. That’s when I decided to build one, and in a few months, I open-sourced the first-ever tool to vectorize data – AquilaDB.
The driving idea behind AquilaDB was to build something that could act as a type of muscle memory for all machine learning applications. My goal was to provide data scientists and machine learning engineers with a sustainable solution for effective Neural Information Retrieval. It is the Redis for Machine Learning.
In a bird’s eye-view, AquilaDB is a latent vector and document database. It can help developers to prototype an idea in minutes and then scale it at their pace. It facilitates efficient similarity search on dense vectors of any size that do not fit in RAM. In AquilaDB, I used two core libraries- FAISS and PouchDB. FAISS is a library that is often utilized to search for similar vectors. It has algorithms that search in sets of vectors, irrespective of size. PouchDB is an in-browser database that is written in JavaScript and modeled after CouchDB. AquilaDB offers everything that Couch protocol can offer. It enables native connectivity to anything that understands the Couch protocol.
The video below explains in detail about AquilaDB.
AquilaDB is extremely easy to set up and you can start indexing your feature vectors within minutes. All you need to do is vectorize the data for your application and send it to AquilaDB with the respective metadata. After you have indexed enough data, you can send your query vector to the database and retrieve similar vectors along with their metadata. Through this information retrieval method, you get a versatile system for all kinds of data including sound, video and image formats.
Below is a simple step by step process on how to use AquilaDB for efficient information retrieval. AquilaDB is quick to set up and run as a docker container. All you need to do is either build it from the source code or pull it from the Docker hub.
To build from source code, clone this repository
build image: docker build -t ammaorg/aquiladb:latest .
To pull from Docker hub
docker pull ammaorg/aquiladb:latest
Finally, deploy AquilaDB container
docker run -d -i -p 50051:50051 -v "<local data persist directory>:/data" -t ammaorg/aquiladb:latest
This project is still under active development (pre-release). It can be used as a standalone database now. The peer manager is a work in progress, so networking capabilities are not available now. With release v1.0 we will release the pre-optimized version of AquilaDB. If you have any questions about the AquilaDB, please drop an email to [email protected]