Subpixel Upscaling/Reshuffle Layers are frequently used in image enlargement networks, though their applications extend far beyond that. In the development of v1 of my Neural Enlarge application, I spent a year experimenting with various things seeking better and better results. I had an early model that was impressive that implemented subpixel reshuffle, but it was slow on Tensorflow for Python, and for Tensorflow.js, it was unusable. It crashed every time and I could not find a solution. Therefore, I moved to using standard deconvolution layers for upscaling.

In the development of v2, I decided to revisit subpixel upscaling to see if I could improve the algorithm. Lets take a look at how it works.

## How it works

The idea for subpixel upscaling comes from Real-Time Single Image and Video Super-Resolution Using an Efficient Sub-Pixel Convolutional Neural Network by Shi et. al., in a process referred to as “phase shift”. The equation for phase shifting is:

Simple right?

Yeah.. I am more of a visual learner as well. So I threw together some illustrations that might make more sense for the non mathematicians out there.

The basic principal is to take pixels from separate channels to expand the size of the layer. For example, with a 2x upscale, a layer with a shape of [1, 128, 128, 12] becomes a shape of [1, 256, 256, 3] with each channel of the output coming from 4 channels of the input creating 4 pixel clusters where every pixel in the cluster comes from a single pixel from a different channel. Here is a tinted version of the above illustration that better illustrates the rearrangement of the pixels.

As you can probably imagine, splitting every pixel up and rearranging them in this fashion takes some computation with a lot of splits, concatenations and transpositions nested inside of loops.

For example, Tetrachrome lays out an implementation of this on github in subpixel: A subpixel convolutional neural network implementation with Tensorflow . Their Tensorflow implementation looks like this.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 |
def _phase_shift(I, r): # Helper function with main phase shift operation bsize, a, b, c = I.get_shape().as_list() X = tf.reshape(I, (bsize, a, b, r, r)) X = tf.transpose(X, (0, 1, 2, 4, 3)) # bsize, a, b, 1, 1 X = tf.split(1, a, X) # a, [bsize, b, r, r] X = tf.concat(2, [tf.squeeze(x) for x in X]) # bsize, b, a*r, r X = tf.split(1, b, X) # b, [bsize, a*r, r] X = tf.concat(2, [tf.squeeze(x) for x in X]) # bsize, a*r, b*r return tf.reshape(X, (bsize, a*r, b*r, 1)) def PS(X, r, color=False): # Main OP that you can arbitrarily use in you tensorflow code if color: Xc = tf.split(3, 3, X) X = tf.concat(3, [_phase_shift(x, r) for x in Xc]) else: X = _phase_shift(X, r) return X |

## Let’s Make it Better

I spent a few days working the problem to try to find a way to avoid all of the splits and loops, which is the primary cause for the computational slow down. Eventually, I reached the conclusion that we have been overthinking the problem. One could achieve identical results with a deconvolutional layer with a kernel size of *(upscale,upscale)* and a stride of *(upscale,upscale)*. The only thing left to do is to create a custom kernel that would achieve the same effect and test it.

After some thinking, testing, hair pulling, debugging, and keyboard smashing, I eventually came up with this.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 |
def subpixel(X, r): batch_size, rows, cols, in_channels = X.get_shape().as_list() kernel_filter_size = r out_channels = int(in_channels // (r * r)) kernel_shape = [kernel_filter_size, kernel_filter_size, out_channels, in_channels] kernel = np.zeros(kernel_shape, np.float32) # Build the kernel so that a 4 pixel cluster has each pixel come from a separate channel. for c in range(0, out_channels): i = 0 for x, y in itertools.product(range(r), repeat=2): kernel[y, x, c, c * r * r + i] = 1 i += 1 new_rows, new_cols = int(rows * r), int(cols * r) new_shape = [batch_size, new_rows, new_cols, out_channels] tf_shape = tf.stack(new_shape) strides_shape = [1, r, r, 1] out = tf.nn.conv2d_transpose(X, kernel, tf_shape, strides_shape, padding='VALID') return out |

It outputs the same result as other methods, but works with a standard deconvolution layer. The weights can easily be added to a model to avoid having to build the kernel each time.

## Test It

I ran a rough test on a NVIDIA 1060 6GB. This test does not include any metrics other than timing the subpixel layers. It does not take into account compilation time or GPU transfer time. So take it as the rough test it is. If anyone would like to run better tests, please let me know. Here is the code.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 |
import tensorflow as tf import itertools import numpy as np import time num_iterations = 10 # modified to support more channels def _phase_shift(I, r): bsize, a, b, c = I.get_shape().as_list() bsize = tf.shape(I)[0] # Handling Dimension(None) type for undefined batch dim X = tf.reshape(I, (bsize, a, b, r, r)) X = tf.transpose(X, (0, 1, 2, 4, 3)) # bsize, a, b, 1, 1 X = tf.split(X, a, 1) # a, [bsize, b, r, r] X = tf.concat([tf.squeeze(x, axis=1) for x in X], 2) # bsize, b, a*r, r X = tf.split(X, b, 1) # b, [bsize, a*r, r] X = tf.concat([tf.squeeze(x, axis=1) for x in X], 2) # bsize, a*r, b*r return tf.reshape(X, (bsize, a * r, b * r, 1)) def orig_sub_pixel(X, r): batch_size, rows, cols, in_channels = X.get_shape().as_list() channels = int(in_channels // (r * r)) Xc = tf.split(X, num_or_size_splits=channels, axis=3) output = tf.concat([_phase_shift(x, r) for x in Xc], axis=3) return output def new_subpixel(X, r): batch_size, rows, cols, in_channels = X.get_shape().as_list() kernel_filter_size = r out_channels = int(in_channels // (r * r)) kernel_shape = [kernel_filter_size, kernel_filter_size, out_channels, in_channels] kernel = np.zeros(kernel_shape, np.float32) # Build the kernel so that a 4 pixel cluster has each pixel come from a separate channel. for c in range(0, out_channels): i = 0 for x, y in itertools.product(range(r), repeat=2): kernel[y, x, c, c * r * r + i] = 1 i += 1 new_rows, new_cols = int(rows * r), int(cols * r) new_shape = [batch_size, new_rows, new_cols, out_channels] tf_shape = tf.stack(new_shape) strides_shape = [1, r, r, 1] out = tf.nn.conv2d_transpose(X, kernel, tf_shape, strides_shape, padding='VALID') return out def check_match(original, new): d = np.max(np.abs(original - new)) assert(not d > 0.0) with tf.Graph().as_default(), tf.Session() as sess: # build the graph noise_shapes = [ (1, 8, 8, 16), (10, 8, 8, 16), (1, 32, 32, 64), (10, 32, 32, 64), (1, 96, 96, 512), (10, 96, 96, 512), (1, 256, 256, 16), (10, 256, 256, 16), ] placeholders = [] orig_outputs_2x = [] new_outputs_2x = [] orig_outputs_4x = [] new_outputs_4x = [] for shape in noise_shapes: inputs = tf.placeholder(tf.float32, shape) placeholders.append(inputs) orig_outputs_2x.append(orig_sub_pixel(inputs, 2)) new_outputs_2x.append(new_subpixel(inputs, 2)) orig_outputs_4x.append(orig_sub_pixel(inputs, 4)) new_outputs_4x.append(new_subpixel(inputs, 4)) # Generate some noise noise_list = [] for shape in noise_shapes: noise_list.append(np.random.normal(0.0, 1.0, shape)) # warmup _ = sess.run( [orig_outputs_2x[0], new_outputs_2x[0]], feed_dict={placeholders[0]: noise_list[0]}) for i in range(0, len(noise_shapes)): orig_out, new_out = None, None # original 2x start_time = time.time() for r in range(0, num_iterations): orig_out = sess.run( orig_outputs_2x[i], feed_dict={placeholders[i]: noise_list[i]}) duration = (time.time() - start_time) / num_iterations print('orig 2x: {} - avg: {:0.4f}s'.format( str(noise_shapes[i]).rjust(17), duration )) # new 2x start_time = time.time() for r in range(0, num_iterations): new_out = sess.run( new_outputs_2x[i], feed_dict={placeholders[i]: noise_list[i]}) duration = (time.time() - start_time) / num_iterations check_match(orig_out, new_out) print('new 2x: {} - avg: {:0.4f}s\n'.format( str(noise_shapes[i]).rjust(17), duration )) # original 4x start_time = time.time() for r in range(0, num_iterations): orig_out = sess.run( orig_outputs_4x[i], feed_dict={placeholders[i]: noise_list[i]}) duration = (time.time() - start_time) / num_iterations print('orig 4x: {} - avg: {:0.4f}s'.format( str(noise_shapes[i]).rjust(17), duration )) # new 4x start_time = time.time() for r in range(0, num_iterations): new_out = sess.run( new_outputs_4x[i], feed_dict={placeholders[i]: noise_list[i]}) duration = (time.time() - start_time) / num_iterations print('new 4x: {} - avg: {:0.4f}s\n'.format( str(noise_shapes[i]).rjust(17), duration )) check_match(orig_out, new_out) |

## The Results

Input Shape | Scale | Original | New Method |
---|---|---|---|

(1, 8, 8, 16) | 2x | 0.0487s | 0.0408s |

(1, 8, 8, 16) | 4x | 0.0419s | 0.0408s |

(10, 8, 8, 16) | 2x | 0.0458s | 0.0408s |

(10, 8, 8, 16) | 4x | 0.0418s | 0.0406s |

(1, 32, 32, 64) | 2x | 0.0973s | 0.0409s |

(1, 32, 32, 64) | 4x | 0.0548s | 0.0399s |

(10, 32, 32, 64) | 2x | 0.0966s | 0.0419s |

(10, 32, 32, 64) | 4x | 0.0542s | 0.0404s |

(1, 96, 96, 512) | 2x | 1.4646s | 0.0591s |

(1, 96, 96, 512) | 4x | 0.3553s | 0.0513s |

(10, 96, 96, 512) | 2x | 1.6885s | 0.2893s |

(10, 96, 96, 512) | 4x | 0.5836s | 0.2724s |

(1, 256, 256, 16) | 2x | 0.1397s | 0.0427s |

(1, 256, 256, 16) | 4x | 0.0694s | 0.0424s |

(10, 256, 256, 16) | 2x | 0.1844s | 0.0908s |

(10, 256, 256, 16) | 4x | 0.1094s | 0.0981s |

As you can see, our deconvolution layer out performs the standard subpixel layer in every test. These test do not include back-propagation or compilation times, which there is a noticeable improvement on as well.

## Why not just train it?

One thing this implementation demonstrates is that a deconvolution layer can achieve the same result with better performance. So why build the kernel at all? Why don’t we just train it? It is a valid point. I am currently training 2 different models. They both have my implementation of the subpixel upscaling layer, but one of them only used the kernel for a constant initializer and is training from there. It is a little early to tell, but training it seems to actually work better, which honestly, I sort of expected. However, I still believe there is some magic by forcing the sub pixel upsample, though that may be more wishful thinking than anything, because I currently have no evidence to back that idea up.

## Use it

If you find this code or information useful please cite this article in any derivative work. Citation is not required, but appreciated. All code presented here is released under MIT license and is free to use wherever and however you would like. The images are free to use as well, but must retain the copyright information.