While looking for a possible internship, one immediately got my attention. Building a ‘face animation tool’. I couldn’t let this opportunity slip by. The goal of this internship was to start from a face, and making the mouth move when given an audio sample.
During my first week, I found a paper named Wav2Lip, which did exactly the same as I wanted to build initially. As I did not want to reinvent the wheel, my mentors and I decided to build some extra functionalities on top of the existing code that was provided in the paper. As English BBC data was used to train the original model, we decided to fine-tune the model on Dutch data as our first challenge.
In addition, when you use a group picture with the original code, it generates the mouth animation for the face it detects first, instead of choosing the correct face. We thought we could improve this, that’s why we chose this as our second challenge. The goal was to be able to choose which person’s mouth had to be animated when a picture consists of multiple people.
The importance of data
Picking out the data
The first task was deciding which data I was going to use. Since there isn’t a lot of pre-processed data available, I had to scrape and pre-process it myself. The data used to train the original model consist of multiple people. Since it’s hard to get enough data of different people that is evenly distributed, I decided to focus on only one person and over-fit the data on that particular person.
I used videos of Lubach because a lot of them were available on YouTube. This made it easy to download a bunch of them. Once I got enough clips, it was time to start pre-processing the data.
Pre-processing the data
The videos were approximately 5 to 20 minutes, so they needed to be cut into smaller clips. For this, I used two libraries: MoviePy, to cut the videos into smaller pieces, and Pydub, to check for silences in the video, which were then cut out (it returns an array with the timestamps).
Because the length between silences was still too long in some instances, I wrote a loop. When the distance was longer than ten seconds, it splits that part into smaller pieces of five seconds each. This resulted in splitting the 80 downloaded videos into ~5800 video clips.
Since they are videos of a talk show, there are parts without Lubach in them. To solve this, I used a Face Recognition library to check if he was in the frame. If he wasn’t in the frame, I just deleted that video clip since I had enough data to allow deleting some. This left me with ~4700 video clips, so there were around 1100 clips which would have affected fine-tuning the model.
Can we start fine-tuning already? To do this, I only needed to run the pre-processing script that’s provided by the Wav2Lip library. This script detects the face in every frame of the videos and saves every face to its corresponding directory. It also saved the audio of the video as a WAV file.
Once the data was ready to start fine-tuning the original model, I had to make a choice. There were two ways to train the model, with or without a visual quality discriminator. The goal of this discriminator is to make sure that the generator (which generates the mouth animation) is not blurry. Training with the extra discriminator comes with a cost, unfortunately. I had to prepare for longer training time.
First, I trained without it to see if it was worth the extra training time or not. This resulted in really blurry mouth animations. So, in the end, I decided to train with the discriminator. Sadly enough, fine-tuning the model didn’t give any significant improvements. But that didn’t stop me!
Choosing the correct person
In the original solution, when a picture with multiple people was uploaded, the model selected the first person it detected. I changed this, so you can choose on which person the model should generate mouth movement. To make this more user-friendly, I made it in a Flask web app.
One downside with the web application was that sometimes, especially with videos, it could take some time to generate mouth movements because it needs to detect every face in every frame. So the longer the input video, the longer the face detection would take. To minimize this, I added a caching functionality that saved the face detection results of every input. This way, it only needs to calculate these and the next time it just reads them from a saved CSV file.
I want to thank the Brainjar team and especially my mentors Jens, Dries, and Deevid. No question went unanswered and help was always one click away, as I worked remotely because of the pandemic. Brainjar has an active slack community which gave it some kind of office experience. 😉 This was probably one of the best places to finish my internship, as I learned more during this time than I initially had expected.