Project date: 14/06/2021
Difficulty: ★ ★ ☆ ☆ ☆
This project was created to test Nvidia's StyleGAN 2 ADA. I decided to jump into this project with minimal prior knowledge of both neural networks and coding, which was likely a mistake!
cwRsync (Windows replacement for rsync)
Microsoft Visual Studio 2019
CUDA Toolkit 11
waifu2x-caffe - Upscaling results
WSL + Windows / Ubuntu
Bulk Rename Utility - Useful!
32GB 3200MHz RAM
RTX 2060 Super
G Wern - Dataset creation and cleaning
Towards Data Science - Training StyleGAN
Images for the training dataset were taken from Danbooru 2020 (SFW?, 512px): G Wern - Dataset
Note that not all images in this dataset are completely SFW. Be mindful of this if you decide to download the data for yourself.
To begin, I downloaded the Danbooru 2020 dataset.
The images used were 512px - this meant many of the images saved were too small and unusable for StyleGAN (in total, 49,875 of the images were used, out of 1,000,000), but it saved on disk space and download time.
Since rsync wasn't available for Windows, I used cwRsync Client instead:
The next day, I realised that the download had failed overnight, so I used WSL with rsync --files-from= to download the remaining images (cwRsync had bugs with accepting Windows file directories). The files list was obtained from the same rsync server. I also realised that it would be a lot easier to use WSL throughout the rest of the project.
Once the images were downloaded, I modified G Wern's script to crop the all the images into square portraits:
And scaled them to 256x256:
img = img.resize((256,256))
These portraits were checked for grayscale from code I "borrowed" from stackoverflow (and moved for manual checking):
The value MSE_cutoff=200 could be changed, although higher numbers gave more false positives (which were fine since they were manually checked afterwards).
I also used findimagedupes to find similar images, and exported these to a text file, where I could move them out:
In total, of the 650,000 images were downloaded from Danbooru only ~35,000 were suitable for training.
Once these images were all 256x256, I upscaled them to 512x512 using waifu2x.
To test the dataset and make sure everything could run properly, I ran StyleGAN 2 ADA and set it to train on 20,000 images.
After cloning the GitHub repo, I ran train.py without installing the required dependencies. Whoops!
After installing the dependencies and starting training, my RTX 2060 Super quickly ran out of VRAM (as expected!). Changing the batch sized fix this, but it didn't fix the horribly long training training time, so I downscaled all 512x512 images to 256x256 using Python, with the intent of upscaling the 256x256 outputs back into higher resolutions later.
I also noticed a few non faces appearing in the dataset (1 in 100?), which were less than the 5% expected error rate for Nagadomi's Anime Face detector. While I removed some by hand, there were still quite a few remaining. I decided this was acceptable, mainly because I was not prepared to manually scan through all 20,000 images to make sure that they were all faces.
The training rate was ~55s/kimg and StyleGAN was trained for 6 hours.
An example of a non-face that managed to slip through the anime face detector.
Update: this image appeared frequently throughout the dataset, for whatever reason.
Images produced at the end of the test run. They were all similarly drawn and low quality - however, this showed that I did setup everything correctly, so I can continue to the final training.
With the test run sucessful, I added the remaining ~15,000 images and re-ran duplicate and size checks with a very brief manual look through the dataset. The dataset was 3.22GB in size, with 33,357 images in total.
Once everything was complete, the final dataset was copied and the StyleGAN dataset was created with dataset_tool.py.
Training started with the following command:
After 12 hours of training, I realised I could effectively double the available dataset by adding --mirror=1 to the training command, so I updated the command and resumed on the previous model:
Later, I tried moving training to Google Colab for a possible speed boost, but I found that the Tesla T4 assigned was still slower than my 2060 Super, averaging of 75.8 sec/kimg compared to an average of 62 sec/kimg from my 2060 over the entire day, with this average higher due to other GPU usage (mostly gaming).
There is a possibility that the P100s are faster than my 2060 Super, but they are very hard to obtain. Combining this with the fact that the VM resets every 12 hours, I decided it would be best to train on my local machine.
Adding additional data
Almost 5 days into training, I got bored and decided to download the remaining ~400,000 images. Including cleaning, this process only took a day and increased the dataset to almost 50,000 from 33,000. Luckily, I could just resume training of the previous model with the new dataset. This new combined dataset was also checked again for non-faces by re-running the Anime Face Detector again.
After 10 days and 18 hours of training, the final model finished training. I decided to end training here as the results were barely improving, ending on a value of 9.26 for the metric fid50k_full.
Update: although the network stopped training at value 9.26, this was decently higher than the lower values of 7.3 seen 3 days before the end of training. Overall, it appears that the latest model leans more towards purple colours, differing from the more yellow (and natural?) tones of the older model:
However, I wanted to share the latest model I had available, which in this case was not necessarily the best. This serves as a notice for any future projects: more training does not always mean a better model. I noticed this after writing the results section and generating images, so all images seen here as well as my results page are using the latest network snapshot.
The latest network snapshot does seem to handle non-modern artstyles better compared to the earlier model.
Impressively, the network was able to generate a wide range of different styles, recreating the many unique art styles seen in the Danbooru dataset:
While the images could fool someone on first glance, they usually fell apart on closer inspection:
The network generally tripped up on the same features, commonly ears, clothing, hair ties and eye distances.
In other times, the network completely missed:
Overall, however, the network did an decent job, with majority of the images having small defects. A hand selected few were marginally better than the rest.
With my low expectations, I found the results from the trained network impressive. However, I thought that it may be possible to refine the generated artstyle be feeding the network a series of hand selected images.
After upgrading to the Windows 11 insider build, the training started to encounter errors starting - while this was expected of an early, unsupported insider build, it was surely unfortunate and sadly, this would mean the refined dataset model would have to be suspended until a fix has been found.