Subpixel Upscaling/Reshuffle Layers are frequently used in image enlargement networks, though their applications extend far beyond that. In the development of v1 of my Neural Enlarge application, I spent a year experimenting with various things seeking better and better results. I had an early model that was impressive that implemented subpixel reshuffle, but it was slow on Tensorflow for Python, and for Tensorflow.js, it was unusable. It crashed every time and I could not find a solution. Therefore, I moved to using standard deconvolution layers for upscaling.

In the development of v2, I decided to revisit subpixel upscaling to see if I could improve the algorithm. Lets take a look at how it works.

How it works

The idea for subpixel upscaling comes from Real-Time Single Image and Video Super-Resolution Using an Efficient Sub-Pixel Convolutional Neural Network by Shi et. al., in a process referred to as “phase shift”. The equation for phase shifting is:

Simple right?


Yeah.. I am more of a visual learner as well. So I threw together some illustrations that might make more sense for the non mathematicians out there.

The basic principal is to take pixels from separate channels to expand the size of the layer. For example, with a 2x upscale, a layer with a shape of [1, 128, 128, 12] becomes a shape of [1, 256, 256, 3] with each channel of the output coming from 4 channels of the input creating 4 pixel clusters where every pixel in the cluster comes from a single pixel from a different channel. Here is a tinted version of the above illustration that better illustrates the rearrangement of the pixels.

Tinted colors to demonstrate the final arrangement of the pixels.

As you can probably imagine, splitting every pixel up and rearranging them in this fashion takes some computation with a lot of splits, concatenations and transpositions nested inside of loops.

For example, Tetrachrome lays out an implementation of this on github in subpixel: A subpixel convolutional neural network implementation with Tensorflow . Their Tensorflow implementation looks like this. 

Let’s Make it Better

I spent a few days working the problem to try to find a way to avoid all of the splits and loops, which is the primary cause for the computational slow down. Eventually, I reached the conclusion that we have been overthinking the problem. One could achieve identical results with a deconvolutional layer with a kernel size of (upscale,upscale) and a stride of (upscale,upscale). The only thing left to do is to create a custom kernel that would achieve the same effect and test it. 


After some thinking, testing, hair pulling, debugging, and keyboard smashing, I eventually came up with this.

It outputs the same result as other methods, but works with a standard deconvolution layer. The weights can easily be added to a model to avoid having to build the kernel each time. 

Test It

I ran a rough test on a NVIDIA 1060 6GB. This test does not include any metrics other than timing the subpixel layers. It does not take into account compilation time or GPU transfer time. So take it as the rough test it is. If anyone would like to run better tests, please let me know. Here is the code.

The Results

Input ShapeScaleOriginalNew Method
(1, 8, 8, 16)2x0.0487s0.0408s
(1, 8, 8, 16)4x0.0419s0.0408s
(10, 8, 8, 16)2x0.0458s0.0408s
(10, 8, 8, 16)4x0.0418s0.0406s
(1, 32, 32, 64)2x0.0973s0.0409s
(1, 32, 32, 64)4x0.0548s0.0399s
(10, 32, 32, 64)2x0.0966s0.0419s
(10, 32, 32, 64)4x0.0542s0.0404s
(1, 96, 96, 512)2x1.4646s0.0591s
(1, 96, 96, 512)4x0.3553s0.0513s
(10, 96, 96, 512)2x1.6885s0.2893s
(10, 96, 96, 512)4x0.5836s0.2724s
(1, 256, 256, 16)2x0.1397s0.0427s
(1, 256, 256, 16)4x0.0694s0.0424s
(10, 256, 256, 16)2x0.1844s0.0908s
(10, 256, 256, 16)4x0.1094s0.0981s

As you can see, our deconvolution layer out performs the standard subpixel layer in every test. These test do not include back-propagation or compilation times, which there is a noticeable improvement on as well. 

Why not just train it?

One thing this implementation demonstrates is that a deconvolution layer can achieve the same result with better performance. So why build the kernel at all? Why don’t we just train it? It is a valid point. I am currently training 2 different models. They both have my implementation of the subpixel upscaling layer, but one of them only used the kernel for a constant initializer and is training from there. It is a little early to tell, but training it seems to actually work better, which honestly, I sort of expected. However, I still believe there is some magic by forcing the sub pixel upsample, though that may be more wishful thinking than anything, because I currently have no evidence to back that idea up.

Use it

If you find this code or information useful please cite this article in any derivative work. Citation is not required, but appreciated. All code presented here is released under MIT license and is free to use wherever and however you would like. The images are free to use as well, but must retain the copyright information.

Notify of
Inline Feedbacks
View all comments