A few months ago we made a style transfer application to demo the capabilities of the Genius supercomputer cluster. And now we got our hands on a PowerAI which IBM generously loaned to us for a couple of months to use. Time to see how it stacks up against Genius in painting!
IBM PowerAI
PowerAI is a GPU accelerated server developed by IBM. Just like a cluster on the Genius supercomputer it uses NVIDIA Tesla P100 graphic cards with 16GB memory. It also has a 128 core CPU and 128GB RAM. But the thing that makes the PowerAI so great and powerful is nvlink. Nvlink provides ultra fast communication between the GPU’s, CPU and the RAM memory.
The specs don’t lie, even though PowerAI only compares to a single cluster from genius it won’t have a single problem being a great painter. Perhaps even on par with Genius.
Porting Genius to PowerAI
To run our style transfer demo on the PowerAI we will need to make some changes to the architecture. And while we’re at it, we’re gonna update the front end because, let’s be honest, the one we built for Genius was well, uhm, yeah… very ugly…
First up: changing the system architecture to accommodate for PowerAI. Instead of multiple clusters each with 4 GPUs that can each run a single style model we now need to run all of our style models on the 4 GPUs of our PowerAI. Good thing that the P100 has 16GB of memory! We decided to run 2 models per GPU so each model will be able to use 8GB of VRAM which should be plenty. In theory it should even be possible to run 4 style models on a single GPU but since we only have 6 models we can easily get away with running only 2 models per GPU. Another thing that changes from genius (for the better) is that there is only 1 Flask server. The Genius architecture had a login node that received all requests and ported the request to the correct compute node running the model. The entire setup used Flask servers and all the communication between nodes was done with REST.
So for the PowerAI the compute nodes will actually disappear and everything will happen on what was the login node on Genius. Instead of routing the request to the correct compute node, the correct model is loaded from memory. Using the correct model the frame is styled and is then returned in the response. Eliminating the latency of the internal communication between the login and compute nodes should also mean that PowerAI is going to have a higher frame rate.
Now let’s fix that hideous frontend that we used for the Genius demo. The Genius frontend was built using the standard gui that comes with Python, Tkinter. Gui frameworks like in Tkinter are horribly outdated, all the cool kids build their frontend using fancy web frameworks.
The initial idea for the frontend was to build everything using HTML, CSS and Javascript to retrieve the camera feed and communicate with the REST API. But due to security restrictions that modern browsers impose on javascript where you cannot use a users webcam if there is no https certificate on your website. So sadly this won’t be possible and we were back at square one.
Back to Python it is! Just like before we are using OpenCV to retrieve the camera stream (no https required, the user just needs to give access to the camera once). Using the requests library we are sending the frames of the camera stream to the API running the styling models, in this case PowerAI. But unlike before we are going to expose the styled frames through a local Flask API which makes it possible to make the final frontend with HTML and CSS. I’m no designer so I downloaded a simple theme I liked from bootstrap, added some missing elements and boom, the frontend was done and not too shabby looking if I say so myself.
Optimising the frame rate
Everything is looking great! Well maybe not everything, the frame rate is terrible. Around 2.5 FPS… I have to note that the communication to the PowerAI is done with a wireless connection. A wired connection will probably improve it, but still 2.5 FPS unacceptable! Time to find out the cause of this madness!
So probably the bottleneck is going to be the bandwidth, but just to be sure that in the small off chance that PowerAI can’t handle all the style transfer models at once we are going to check that first. No surprise here, PowerAI has no problem at all running all the models simultaneously. In fact on each of the graphics cards only 8GB out of 16GB is used. The GPU utilisation is between 30 and 50 percent so that’s not going to be the bottleneck either. When timing the duration to style a single frame it takes between 50 and 80 milliseconds so on average we should get about 15 FPS, that’s way more than we are getting in reality… The good thing is that PowerAI isn’t the bottleneck, the bad thing the bandwidth is and we cannot simply up the bandwidth. We can however limit the strain on the bandwidth.
The backend is a RESTful api, yes this isn’t the best for streaming data, sockets probably would have performed way better but that’s beside the case. The problem with rest is that the client sends a request, the server handles the request, does its magic, and sends back a response after which the connection between client and server is closed. Opening a connection takes time and to make things worse we had to enable HTTPS for GDPR reasons, so every time a connection is opened a SSL handshake is done adding to time to open a connection. Keeping the connection open will probably give pretty good performance boost. And just as most things in Python doing this is as simple as it can be, the requests library has support for sessions and sessions keep the connection alive by default. Gotta love the Python community for all the awesome things they make. Using sessions let’s see how much of an improvement it is… Wait whut, for real, from 2.5 FPS to on average 8 FPS that’s a huge performance boost with such a simple fix!
An average of 8 FPS still doesn’t make it look smooth by a long shot, but it’s good enough for now, especially when there is a second easy fix: a wired internet connection.
Conclusion
While PowerAI is not on the same power level as Genius, it only has 4 Tesla P100 cards compared to the 80 cards Genius had to its disposal (4 cards per node, 20 nodes). Its power level is still over 9000! Punching well above its weight class. This is proves that brute force isn’t always the best way to get better results but also optimisations to the communication on hardware level.
A big shoutout and thanks to IBM for loaning us a PowerAI! It was a blast to work with such a powerful machine.