Practicing Deep Learning (DL) definitely costs a lot of computing power, because the amount of matrix calculations is very huge. We can still use the most recent GPUs and Memory modules for model training and fine-tuning, but the problem never stops here. We always expect a production-ready version of our solution which should go live at minimal cost, in terms of specifications for the stationing environment. Customers expect to have a solution with lower investment but high throughput. Deploying different models at scale is always a mess.
Recently we were working on a requirement to detect a wide range of anomalous events which include thefts, explosions, accidents, stealing, etc. Governments and private institutions invest a humongous amount in surveillance to secure their premises but require manual intervention to assess the output in a useful way. This deficient approach motivated us to design a solution that leverages the capabilities of Deep Learning and Computer Vision. We developed a highly sophisticated hybrid neural network that can detect anomalous events and raise alerts on time. We figured out the state of the art performance in its maturity.
We wanted a data structure that can accommodate the results of the workflow. So, we thought about an old school stack, I’m sure nobody would have left your schools without hearing the term ‘stack’, at least once. The vanilla stack architecture supports a Last In, First Out (LIFO) methodology and we wanted the same. Also, we added a couple of additional features – like flushing the stack in frequent time intervals – to the conventional architecture and we call it an Extended Stack (stackX).
The real problem was about to come out, the deployment. Customers wanted it to run real-time and give alerts on time. It’s not an ideal approach to save incoming videos first and process it then. So, we planned to go live by taking live streams as inputs. We started scratching our heads to find a solution that can scale up to process multiple live streams. Processing each incoming frame is not a novel method so we introduced a First In, First Out (FIFO) queue to manage the frames and keep it in order. Instead of analyzing separate frames, we started processing batches of images. So, the feature extraction happens in batch level rather than frame level. Nevertheless, keeping huge batches (we were asked to process 4K video streams) in memory is not always a good idea, so we then opted for a file system queue over in-memory queue to release the memory overhead. Now the queuing part is also done. Here is a short demo of what we’ve developed;
Our system is live and running successfully across the globe. The key takeaway from the exercise is to expand the fundamental data structures like stacks and queues for robust inferencing but it could only be done with some added flavors as explained above that can drastically reduce the overall workload.