Whilst looking over various internship opportunities, one particular challenge got my attention: developing an AI application capable of classifying Dutch Sign Language videos. After a brief interview, Brainjar offered me the opportunity of a lifetime by helping me tackle this challenge… and a challenge it was. This summer I had an amazing experience trying to handle this beast. Let me tell you all about it.
What’s an algorithm without any data?
First things first, we needed data… lots of data. Whilst scouring the internet, the main source of existing data turned out to be an online dictionary of Dutch signs. This gave me one video example per sign. Now, as most of you probably know, deep learning algorithms are quite data-hungry and will not be pleased with a single appetizer: they want the full 5 course meal. So, in order to create the data we needed, we set up a handy website allowing people to easily record video-clips of Dutch signs. Fortunately, more widely known sign languages, such as American or Indian Sign Language, do come with existing datasets. So whilst the website was up and running and gathering data, I decided to have a peek at the second challenge looming over the horizon: going from pixels to predictions.
To feature engineer or not to feature engineer?
Using the WLASL dataset, my first task was to decide how to use it. Considering an average of 17 videos per sign, I decided an end-to-end approach would be plan B. Plan A was to feed the videos through a keypoint detector called Openpose which spits out coordinates of various points on the body. Reasoning that these coordinates are fairly low-level, yet much more abstract than actual pixel values, I thought this new form of data would retain most of the important content of the video, whilst drastically reducing the amount of work the algorithm would have to do. In a sense, I was spoon-feeding my baby algorithm pre-chewed data, which was easier to digest.
Plan A: RNN’s to the rescue
After implementing my very first RNN with Tensorflow, I was ready to train my first algorithm. Surely everything would go smooth and accuracies would rise to the heavens! Yeah… that didn’t happen. A measle 20% accuracy was quite the reality check. But, good news, it turns out that in my excitement, I skipped a couple chapters: preprocessing, data augmentation and regularization. So, I went back to work. After doing some filtering, trying out different data augmentation techniques and doing some regularization on the network, the accuracy got up to 70%! However, no amount of effort and fine-tuning seemed to be able to breach this cap. So I decided to move on to Plan B.
Plan B: End-to-end with limited data
Now, some of you might have spotted the problem already… Training a CNN with this limited amount of data: “tsss, no bueno”. And indeed, even when transfer-learning with a DenseNet base model, the results were underwhelming. However, this part of the project did teach me a lot on the challenges of working with video data. CNN’s are huge networks with millions of parameters that take ages to train. Combined with the immense size of a video dataset, video classification turns out to be quite a bit more challenging than image classification.
Plan C: ‘And… Another one’
During the last week of my internship, I decided to make one last push for the sacred halls of higher accuracy. Since most of the faulty data in the Openpose coordinates came from fingers occluding other fingers, adding a separate feature vector retaining data on the handform could be worth a shot. Fortunately, an AI model doing exactly that already exists. Unfortunately, it doesn’t really work on my dataset. The existing model was trained on a fairly unrealistic dataset using the same clothing and background in every frame and never having overlapping hands. A solution would be to use the existing model as a base model for a transfer-learning approach… which would only require me to manually label 25.000 frames. Given the constrained time frame, this wasn’t really an option, but it is definitely something to keep in mind for any future follow ups on this project.
So how about that website?
During the period of time in which I tested and implemented different algorithms on the WLASL dataset, I also made some improvements to the data collecting website and implemented some gamification features. Still, the assumption to collect thousands of videos in a few weeks time was quite unrealistic. Instead a future alternative would be to extend the website further to a more learning oriented platform. This way, a regular group of users will provide data and the platform can be extended to sentences when we are ready for the next step in our machine translation application.
My ever supporting family
Now of course, none of this would have been possible without the Brainjar - and in extension Raccoons - family backing me up with amazing support. No question went unanswered and help was always right around the corner. Even though the internship was done working from home due to the Covid19-situation, daily standup Zoom sessions and an active Slack community made it so I still got to experience being part of the family. Whilst my initial plans and optimism did require somewhat of a reality check, my time with the Brainjar family has by far been the most educational period of my studies. I learned and grew enormously during my internship both technically and personally. Really, the usefulness of an internship cannot be underestimated, especially not one at Brainjar. Do you want to be an intern at Brainjar? Leave a message here!