Growing up in the nineties and the noughties, I loved Pokemon. Pokemon Red was my first and favourite game, and I easily spent hundreds of hours playing it as a kid despite being too dumb to actually beat it. When I wasn’t trying to catch ’em all, I would make up my own little monsters and draw them with all the artistic talent a five year old could muster. Despite this obsession, I had a few qualms with the games. Like, how does one Pokemon suddenly evolve into another; shouldn’t it be a gradual process as they get stronger? And why is it that when you breed two completely different Pokemon together the child is identical to the mother, rather than a freakish combination of both parents? And beyond that, why were there only 151 Pokemon? Fast forward 20 years and that Pokemon nerd found a solution to these problems, in the form of Generative Adversarial Networks (GANs).
A Brief Background on GANs
Generative Adversarial Networks, for the uninitiated, are a type of neural network first proposed in 2014 that have revolutionized creative AI. Before their invention, neural network-based methods for image generation resulted in blurry, low-quality pictures, but with the advent of GANs, high-quality high-res image generation was suddenly possible.
In very brief, GANs consist of two networks: a generator which makes the fake images, and a discriminator which tries to discern real from fake images. The two compete in an arms race, both improving over time, with the end goal of having a generator that makes images so real-looking that the discriminator (and you!) has no way to tell which are real and which aren’t.
In April, I proposed the AutoEncoding Generative Adversarial Network architecture (medium, arxiv), a generalization of CycleGAN, which combines two GANs and two autoencoders. Recently, I thought to myself, “what would happen if I used the AEGAN technique to generate video game sprites using Pokemon as training data?” Let me tell you, some questions are better left unasked.
Endless Forms Most Horrible
I’d always wondered what would happen if you left two dittos at the daycare and let them get to work. Unfortunately, now I know.
Here’s a handful of the unfortunate creatures my GAN spit into existence (if you’re on mobile and they look blurry, check them out here instead):
Despite their obvious deformities, many of these abominations have distinct body segments (arms, legs, heads) and they all have vibrant, diverse colours. Some even have eyes!
Evolution
In the games, Pokemon go through two or three distinct forms. Going from one form to the next is an abrupt process, and I always found it disappointing that we can’t watch a Charmander gradually transform into a Charizard as it grows stronger. But, using this AEGAN, we can!
I’ll be the first to admit that these sprites aren’t exactly game-ready. Still, while working on this project I found a number of tricks that greatly improved the quality of the output, and I’d like to share them here in case you’re interested in applying them yourself (and have access to a bigger dataset…). But before we get to that, here are a few more of my favourite little freaks:
AEGAN
The AEGAN architecture (described here), although slower to train than a standard DCGAN, made for much faster convergence. After only 30 epochs across the ~1600 sprites, the output was distinctly pokemon-shaped.
Training the same generator without the AEGAN scaffolding took over 100 epochs to reach similar quality (if it ever did). I recommend you try the AEGAN technique. It stabilizes training by preventing mode collapse and allows the generator to quickly beeline to a useful section of parameter space. I timeboxed this little project, but if I had world enough and time I’m sure it would also allow the generator to reproduce pokemon from the orginal dataset.
ColourPicker
The biggest increase in quality followed what I call the ColourPicker technique. Rather than using the typical transposed convolutional layers with a channel depth of 3 (red, green, and blue) in the generator, my final layer had a channel depth of 16 with a softmax activation applied to the channel dimension. This forced the generator to pick one of 16 indices for each pixel. Elsewhere in the network, the generator generated 16 colours (essentially a palette), and this palette was multiplied against the 16-depth index layer. This was done three times separately to create red, green, and blue channels for output, and allowed colours to be easily coordinated across the output image.
Noisy Discriminators
GANs are notoriously unstable during training. Although the AEGAN scaffolding went a long way towards increasing stability, it wasn’t until I started adding multiplicative Gaussian noise to nearly every layer of the discriminators that I could stop worrying about waking up to a model that had completely broken during its overnight training.
Data Augmentation
The dataset I was working with was tiny. All-in-all, there are about 600 unique images; add the shinies (glorified palette-swaps) and occasional female variants (minor changes most people wouldn’t notice) and you’ve got about 1600 images. That’s way too few to properly train a GAN to produce 96x96x3 images. Still, I tried, and applied augmentation to artificially increase the dataset size. This included:
- randomly translating the image horizontally and/or vertically by up to 5 pixels
- randomly rotating the image’s hue
- randomly flipping the image in the horizontal dimension
Use a Small Latent Dimension
You don’t know how big 100-dimensional space is. Heck, you probably don’t how big 4-dimensional space is. Our brains just aren’t wired for that. But think about it this way: if you’re using a 100D latent space (as many tutorials suggest), and you want at least one image per “quadrant” of that space (i.e., binning each dimension into two bins, positive and negative), you would need over a billion billion billion images, and even then the space would still be incredibly sparse. Mapping such a huge latent space to the relatively few images you have is a big ask, and you’d likely do better with a much smaller latent space. For this project, I only used 16 dimensions.