If you’ve ever used Google photos or the Photos app on your iPhone, you probably noticed how it manages to recognize different faces and categorizes them under a specific profile. This is made possible using face clustering technology. In essence, face clustering is the task of grouping unlabeled face images according to individual identities.
Face clustering models are mostly built using Convolutional Neural Networks. Today, face clustering has advanced to such a point that faces can be recognized irrespective of extreme poses, illumination, and resolution variations. It would suffice to say that we already have a host of proven clustering methodologies to group photos.
But, what about videos? Would these clustering methodologies work well for face tagging and face clustering in videos?
Recently, I came across this question as I was developing a new feature for Searchyf – a multimedia data retrieval platform built by Accubits. The feature was to tag faces of people appearing in live streaming videos and cluster the faces and build a database based on the clustered faces. Similar to what you’ve seen in Google Photos or Apple Photos, but for videos instead of photos.
Since we were dealing with a live stream video and not a group of still images, it was evident from the get-go that we would need to come up with a unique solution for this problem. Because the existing face clustering models are built using Convolutional Neural Network and the face clustering is managed in the RAM. As the load increase, here, the number of faces appearing in the video, this would result in RAM overload. The model is good for low scale face clustering processes. But for video data, the apparent number of faces that need to be clustered is way too high for these clustering methodologies can handle.
This is when my team and I began our brainstorming session. We came up with several ideas that could provide significant throughput and zero downtime. While researching, we understood that the conventional unsupervised clustering wouldn’t work since we don’t have access to any data. Additionally, the usual train and test split wouldn’t work in any case, as we will get new instances of data continuously from live video streams. So, we thought about building an online system which can dynamically form clusters when it sees a new instance (a face) that has never been seen before. The diagram below shows on a high level, the methodology we developed for face tagging and clustering in live stream video data.
On many occasions, we have turned towards Elasticsearch for such developmental processes. It invariably ends up in our development stack as it is an incredible tool for storing and managing a large amount of textual and geo-spatial information. We were pleasantly surprised to find out that it currently supports semantic similarity search among vectors or embeddings.
After weeks of slogging, we developed the above architecture based on Elasticsearch which can dynamically group incoming vectors into similar clusters. Our intuition told us to have as many clusters as the number of unique people in the stream. For every new, unlike vector, a new cluster would be formed and stored. Each incoming vector would be evaluated with every instance of the vectors stored in Elasticsearch to see whether there are close matches or not.
Finally, we would run a voting test over the responses to find the correct cluster to which the new vector to be mapped. In case we don’t find any such results, we assume that it might be a part of a new cluster. The metric we used to estimate similarity among clusters is Cosine Similarity. Elasticsearch is designed to perform huge comparisons in the split of a second. Thanks to their design. From our evaluation, it gave us the expected results despite some shortcomings.
The HTTP interface exposed from Elasticsearch was extremely useful since we were able to interact with the clusters using REST APIs. We really wanted to see how well this architecture scales. So, we used the well-known benchmark dataset called CelebA. This gave us considerably better results.
To make things easier, Elasticsearch offers a thorough set of REST APIs for performing tasks such as checking cluster health, performing CRUD (Create, Read, Update, and Delete) and search operations against indices, and executing advanced search operations such as filtering and aggregations.
In a nutshell, this new methodology turned out to be quite successful and we were able to accurately cluster faces into different profiles from any live video stream that we tested the algorithm on. However, this model has a huge scope for improvements. Right now, the vector searches are executed on a 1-to-1 basis, which is not very efficient. And the entire scope of the methodology can be improved using big data technologies.