What are Fully-Connected Layers (FCN) in Convolutional Neural Networks (CNN)?

Page content

Recently, during a discussion with a colleague about his CNN model architecture on remote sensing image fusion task, he mentioned something that was interesting. Specifically, in his network, he used FCN implementations Keras.layers.Dense and torch.nn.Linear in his code, the input to the FCN is a 2D image with many channels with size (160, 160, channels). Traditionally, I think that to pass through a FCN layer, the neuron numbers of the first FCN layer in this case, should be 160 * 160 * channels, which basically means to flat the volume to a 1D array and feed in a traditional neural network. Therefore, I argued that his network should be really hard to train because with such large number of neurons in a layer, the whole FCN should have millions of parameters. Just like the VGG, the parameters of its FCN layers are:

layer # parameters
1 77512*4096 + 4096 = 102,764,544
2 4096*4096 + 4096 = 16,781,312
3 4096*1000 + 1000 = 4,097,000
Total approx. 123.64M

However, my colleague told me that his FCN does not have such large number of parameters. Specifically, in his code, he used Dense(channel#1) and Dense(channel#2) instead of Dense(160*160*channel#1) and Dense(160*160*channel#2). Therefore, his network only has channel#1 * channel#2 parameters, which is signifiantly less than 160*160*160*160*channel#1 * channel#2. This makes me wonder what dense layers are actually computing.

By digging around on the internet, I found a quote by Yan LeCuns:

In Convolutional Nets, there is no such thing as “fully-connected layers”. There are only convolution layers with 1x1 convolution kernels and a full connection table.

It’s a too-rarely-understood fact that ConvNets don’t need to have a fixed-size input. You can train them on inputs that happen to produce a single output vector (with no spatial extent), and then apply them to larger images. Instead of a single output vector, you then get a spatial map of output vectors. Each vector sees input windows at different locations on the input.

In that scenario, the “fully connected layers” really act as 1x1 convolutions.

This quote is not very explicit, but what LeCuns tries to say is that in CNN, if the input to the FCN is a volume instead of a vector, the FCN really acts as 1x1 convolutions, which only do convolutions in the channel dimension and reserve the spatial extent. But this is not true in the VGG case. In VGG, the input to the FCN is a 7*7*512 volume, what the first FCN layer does is actually a 7x7 convolution that shrink the volume to a 1x1 vector, and then do 1x1 convolution on the vector.

To further explain what the dense layer does in CNN, let’s see an example in Keras:

from keras.models import Sequential
from keras.layers import Dense, Activation
model = Sequential()
model.add(Dense(16, input_shape=(3,2)))
model.add(Activation('relu'))
model.add(Dense(4))
model.compile(loss='mean_squared_error', optimizer='SGD')
print(model.weights)

By running this example code, we can get the weights of the model as follows:

[<tf.Variable 'dense_13/kernel:0' shape=(2, 16) dtype=float32_ref>,
 <tf.Variable 'dense_13/bias:0' shape=(16,) dtype=float32_ref>,
 <tf.Variable 'dense_14/kernel:0' shape=(16, 4) dtype=float32_ref>,
 <tf.Variable 'dense_14/bias:0' shape=(4,) dtype=float32_ref>]

In this example, the input tensor with size (3, 2) is passed through a dense layer with 16 neurons, and then thorugh another dense layer with 4 neurons. In traditional neural networks, we can easily think that the first layer has 3 * 2 * 16 = 96 parameters as each neuron is connected to 3x2 = 6 inputs, and the next layer has 16 * 4 = 64 parameters. However, we can see that the printed weight matrix size of the first dense layer is (2, 16), not (2*3, 16).

Let’s see another example:

from keras.models import Sequential
from keras.layers import Dense, Activation, Flatten
model = Sequential()
model.add(Flatten(input_shape=(3,2)))  # Flatten the input before feeding it into the dense layer
model.add(Dense(16))
model.add(Activation('relu'))
model.add(Dense(4))
model.compile(loss='mean_squared_error', optimizer='SGD')
print(model.weights)

This code snippet prints out:

[<tf.Variable 'dense_15/kernel:0' shape=(6, 16) dtype=float32_ref>,
 <tf.Variable 'dense_15/bias:0' shape=(16,) dtype=float32_ref>,
 <tf.Variable 'dense_16/kernel:0' shape=(16, 4) dtype=float32_ref>,
 <tf.Variable 'dense_16/bias:0' shape=(4,) dtype=float32_ref>]

There we can see that by flattening the input to a 1-D array before feeding it to the dense layers, the weight matrix of the first layer becomes (6, 16).

Conclusion

How the FCN in CNN really do depends on the input shape:

  • If the input is a 1-D vector, such as the output of the first VGG FCN layer (1x1, 4096), the dense layers are the same as the hidden layers in traditional neural networks (multi-layer perceptron).
  • If the input rank is higher than 1, for example, an image volume, the FCN layer in CNN is actually doing similar things as a 1x1 convolution operation on each pixel slice.