in Artificial Intelligence, Maths

Artificial neural networks are a class of complex algorithms that try to solve complex problems starting from the idea that if the human brain is able to do something, even an abstraction of it can probably solve the same problem.

The effectiveness of these techniques (all different but related) is remarkable, but it is also necessary to pay a considerable computational cost.

A modern classic in neural networks is a type identified with the name convolutional neural network (CNN for short). This type of network is very effective at recognizing patterns, e.g. images, and is therefore a good basis for creating a classifier.

We don’t want to badly repeat here what is already written in an infinity of blogs on the internet, but we want to analyse the concept of “convolutional” of CNN.

Why convolutional?

In the context of neural networks, the concept of convolution derives from an operation that is performed between a pair of layers of the network.

Let’s take for example a minimalist CNN:

As in classical Neural Networks (which are similar to the last layer represented in the example, “Fully Connected”), the goal is to find a way to minimize a certain cost when you have input-output pairs in training at the same time. This is obtained by manipulating the weights of the neurons using minima search algorithms.

Since the number of connections of a fully connected network grows in a way intractable when the inputs increase and become more complex, there are various ways to increase the constraints and reduce the possibilities to explore.

In the case of CNNs, the first step is to make a convolution between Input and filters.

The filters are the cores of operations on the input that allow to highlight general characteristics useful to discriminate the categories in output. (Dramatic but effective simplification)

But why convolutional? And what are these filters?

If the input is a matrix (for example, the pixel matrix of an image), the filter is also a matrix (or rather, a tensor).

The filter is a function because there is an algebraic operation between the input and the filter. The operation is fixed for all images and filters and it’s a convolution!

The image shows a sharpen filter used to highlight the edges of an image. The Image matrix (which contains the pixel intensity value) is convoluted with the filter matrix.
The first value of the result (the first pixel at the top left for example) is the product between the first 3×3 square in the image and the filter. Moving the filter matrix of n pixels to the right and m down and computing the product gives the coordinate pixel (n,m) of the output.

Convolution is a mathematical operation, a product between similar elements, which can be defined in different ways. The most famous way is the one often encountered in some analysis course:

It is therefore a sum (identified by the integral sign) with respect to a certain measure of products between a fixed function and another that is translated (the operation is symmetrical so the roles are interchangeable).

Implicitly, convolution is familiar to everyone (or almost everyone) since primary schools.

Convolution is hidden (by carry-over) in the multiplication algorithm:

In fact, each digit on the second line is multiplied by the digits on the first line. The various products are translated and summed (exactly the recipe of the integral definition). There are other definitions (concerning the pleasant relationship between convolution and operators), but this is enough for us.

The convolution layer does exactly that. From the input matrix and the filter matrix it creates a new matrix by making a translation of products that are summed.

The filters in this type of neural networks are “learned” through backpropagation. They are good filters for training input because they are modeled on training data.

The filters are therefore cores of operations on images, they are a counterpart of the input (output oriented) and are also of the same kind as the input.

The last statement means that a filter can be (more or less) displayed as if it was an image.

So the operation on an image is defined by an image (thanks to algebra).

The convolution operation is symmetrical, which means that you can reverse it and say that images are operations on filters!
Taking the arithmetic example, multiplication by 11 is equivalent to the sum of adjacent pairs of numbers (i.e. a function).

If we take the number 31452 and imagine it with two zeros, one on the right and one on the left, the last sentence becomes clearer: 031452.0

The result figures are the sum of the adjacent pairs:

3 = 0+3; 4 = 3+1; 5 = 1+4; 9 = 4+5; 7 = 5+2; 2 = 2+0

But it can also be seen as the action of 31452 on number 11. In this case the interpretation of the operation that would be carried out does not have a simple definition, but is always a valid expression.

In deep neural networks, the layers (in this case input and convolution), are separated and this seems almost a pity since the representable objects are so similar.

A YOLO network.

A change in this viewpoint requires an increase in the complexity of the network topology or, perhaps, a new paradigm.

Are you curious about the potential and nature of artificial intelligence? Contact us at