In this article, we’ll see how to build super scalable and optimized system with assembly language with the help of a case study. I’ve always seen programming as a two-step process. The first step is to successfully compile the code you wrote. No bugs, no errors, and everything should work like a charm. And the second step is even more important. If you want the program you wrote to be scalable, you need to optimize it, to make the execution time and resource consumption as low as possible. Today, we have a wide variety of programming languages. Each of those has its own merits and demerits.
I started my career with JavaScript. Node.js to be precise and I had some C programming experience in college. Although JavaScript is still widely used for programming, I personally feel that it shouldn’t be the language one should learn as a beginner. Scripting languages let you code in any way you want since there is no structure. You can also declare variables on the fly, even without knowing their types. All these features are convenient when you are a newbie programmer, but once you start building high performing, scalable architectures, these languages are not upto the mark. The next two languages that I learned were two static typed powerful languages called Go and Rust. They were both awesome languages that reduced the majority of my runtime errors and gave me a different perspective on programming. I was experimenting with these programming languages to see how far I can optimize a program. Despite trying out several programming languages, one question still prevailed on my mind. Can I optimize my code further?
A couple of months ago, I developed an interest in retro computing. Two projects that really inspired me were the Apollo guidance computer (computer used in moon landing) and Nintendo NES. The average smartphone is a million times better compared to the specs that helped humanity to land on the moon. The fact that the memory of an NES could not load a single jpeg file with a single screen of Super Mario was shocking to me. The amount of innovative coding that the NES has is worth appreciating. Both of these programs were written in assembly language, and I wondered if it was still worth trying out executing my program in assembly language.
The low-level language I knew was C++, which is unbeatable when it comes to performance. So, I thought of using C++ and optimizing some parts with assembly and comparing their processing time. While benchmarking something, I like to use the Fibonacci series. This is primarily because it covers iteration or recursions and will be able to cover both benchmarks. I wanted to solve a new age problem rather than a classic algorithm or known algorithm.
After conducting a little research for my assembly program, I found a tutorial in which the user manipulated an image. I found this extremely intriguing, so I tried to pull off the same thing with my own unique twist.
The project’s input is the path of an image and brightness factor and the output would be an output image with increased brightness. All the images would be converted into a matrix with values ranging from 0 – 255 so that the factor gets added to all the cells in the matrix. This is a basic scalar matrix addition, but the scale is relatively large.
Testing Environment
The assembly language varies according to the assembler, architecture, operating systems, etc. The cross-platform assembler that I preferred to use was NASM. It does not have a powerful syntax like the MASM (Microsoft based assembler), which is the system where the benchmark was conducted. For the complete code, you can check out my repo image-manipulation.
OS | Mac OS Mojave |
Memory | 16 GB 1600 MHZ DDR3 |
Processor | 2.5 GHz intel core i7 |
The C++ compiler used was G++ and NASM x86 64 for assembly. For the C++ part, I used an OpenCV library to conduct image manipulation. Code is given below,
#include "opencv2/imgcodecs.hpp" #include "opencv2/highgui.hpp" #include <iostream> using std::cin; using std::cout; using std::endl; using namespace cv; int main( int argc, char** argv ) { CommandLineParser parser( argc, argv, "{@input | lena.jpg | input image}" ); Mat image = imread( samples::findFile( parser.get<String>( "@input" ) ) ); clock_t time_req; time_req = clock(); if( image.empty() ) { cout << "Could not open or find the image!\n" << endl; cout << "Usage: " << argv[0] << " <Input image>" << endl; return -1; } Mat new_image = Mat::zeros( image.size(), image.type() ); double alpha = 1.0; beta = 45; for( int y = 0; y < image.rows; y++ ) { for( int x = 0; x < image.cols; x++ ) { for( int c = 0; c < image.channels(); c++ ) { new_image.at<Vec3b>(y,x)[c] = saturate_cast<uchar>( alpha*image.at<Vec3b>(y,x)[c] + beta ); } } } uchar* image_data = image.data; uchar* new_image_data = new_image.data; cout << "Running Time" << clock() - time_req<<endl; // cout << "The image is" << image_data <<endl; imwrite( "orginal.jpeg", image ); imwrite( "newImage1.jpeg", new_image ); waitKey(); return 0;
This is the complete C++ program, which performs the above-mentioned task. The most time-consuming part of the code is
for( int y = 0; y < image.rows; y++ ) { for( int x = 0; x < image.cols; x++ ) { for( int c = 0; c < image.channels(); c++ ) { new_image.at<Vec3b>(y,x)[c] = saturate_cast<uchar>( alpha*image.at<Vec3b>(y,x)[c] + beta ); } } }
A loop of the order of n^3. In order to optimize the code, I moved this part to assembly, so the upload section was still in C++.
ection .text global __start ; RDI 1st argument ; RSI 2nd argument ; RDX 3rd argument ; RCX 4th argument ; uchar* new_image, RDI ; uchar* old_image, RSI ; short brit, ; Size_<int> size default rel __start: mov r10, 0 cmp rdx, 0 jl ReduceBright mov r11w, 0ffffh mainLoop: movsx eax, word [rsi] add eax, edx cmovc ax, r11w mov [rdi], eax inc rsi inc rdi dec rcx jnz mainLoop ret ReduceBright: mov r11w, 0 neg r8w MainLoopSubtract: mov al, byte [rdx + r10] sub al, r8b cmovc ax, r11w mov byte [rcx + r10], al inc r10 dec r9d jnz MainLoopSubtract ret
The argument called in the function would come into the respective registers mentioned in the comment
; RDI 1st argument ; RSI 2nd argument ; RDX 3rd argument ; RCX 4th argument
The result was astounding. The assembly code was near 20x times faster than C++. The C++ code can be further optimized to make it faster but 20 times is a major improvement
C++ | Running Time 4378 |
Assembly | Running Time 261 |
My objective for the exercise was to check how far we can optimize a code. So, the million-dollar question here is ‘Should we start programming in assembly language?’
The short answer is NO. It isn’t a good idea, even though it comes with the power of CPU registers. Great power comes with great responsibility, and the codes are hardly portable and maintainable. However, rewriting some parts of the programs with assembly is definitely a method that a developer should add to his arsenal to build highly scalable and high performing applications. Most of the compilers convert the code into assembly more efficiently than one would expect. With proper optimization, it is possible to reach considerable performance. Although, it does take a good amount of effort.