# Deep Learning = alchemy?

Within the wider field of Artificial Intelligence (AI) – which includes disciplines like Robotics, Machine Learning (ML) is the ability for machines to learn without being explicitly programmed. The area is booming and is considered by some as the 4^{th} Industrial Revolution which will profoundly transform the way businesses operate, including within the financial industry. A great leap forward has been the extensive use of ever cheaper computing power -Graphics Processing Units (GPU) in particular- which, coupled with vast amounts of data, enabled advances in the field of Artificial Neural Networks (ANNs) over the past few years.

ANNs mimic the way neurons work in the human brain. **One way to understand them is to imagine how information can flow from a set of inputs through a series of (hidden) layers that activate neurons or not in order to produce a final output**. For instance, in Image Recognition, the output would be what the image represents. **The specificity of neural networks is the sequential nature of the learning process which, in its simplest form, maps an output through a sequence of linear and non-linear transformations**. The term Deep Learning (DL) refers to ANNs with a large number of layers (currently in the hundreds) vs. shallow neural networks having just a handful of layers.

In this post, we review why DL has proven so effective in matching (and often exceeding) human performance, and the implications of choosing DL over other ML algorithms.

**Anecdote:**DL has frequently been compared to a form of alchemy due to both its complexity and its efficiency. However, the Google researcher at the origin of that analogy meant it in a very different way. Indeed, his original claim was a reference to the lack of framework, within the ML research community, to choose a particular algorithm over another. Not a reference to DL’s magical powers!

We draw on the latest research to attempt to explain why ANNs work so well in practice.

# The laws of Physics

There are two aspects to DL:

- The structure of Neural Networks (NNs) in itself, and
- Their depth.

The task of ML is to approximate complex functions. For instance, taking as input a vector of pixels representing an image to produce, as the output, what the image represents.

Mathematically it can be proven that, for any (smooth) function and any level of accuracy, a NN of a given size exists, which can approximate that function to that accuracy level or better. However, the resulting number of possible NNs is enormous, exponentially larger than the set of possible NNs.

**Example:**An image classifier of grey scale megapixel images would involve estimating a function defined by 256^1,000,000 probabilities, i.e. way more than the number of atoms in the universe!

**Low order polynomials.** In practice, a NN with no more than thousands or millions of parameters would perform well to classify such images. Such dramatic simplification is due to the laws of physics. In fact, **real world data sets are drawn from a tiny fraction of all imaginable data sets**. One reason is that physics involve low order polynomials, typically of order 2 to 4.

**Locality.** Another core principle in physics is locality: things only directly affect what is in their immediate vicinity, meaning that non-zero parameters are limited.

**Symmetry.** Lastly, some form of symmetry (translation and rotation) is common, further reducing the number of parameters.

Because of these, it can be proven that, if there are *n* inputs taking *v* values each then the number of parameters can be cut from *v*^*n* to a multiple of *v*x*n*.

# The hierarchy of Nature

The second aspect of DL is their depth. ANNs with many layers somehow perform better than shallow ones. One explanation, taken from physics again, is that **the real world is highly hierarchical**. Or in other words, complex structures are the result of a sequence of successive simple steps. A deep neural network is doing just that: learning sequentially and incrementally, for instance, the brightness/darkness of an image, then its edges and lines, then its shapes, as it progresses through its layers. It’s a generative process that we find everywhere in nature.

Lastly, the representation of this generative process through many multiple steps seems optimal, as a number of theorems show that such networks cannot be efficiently “flattened”.

The pictures below offer some insight into the sequential and incremental learning process in the case of image recognition.

# Practical implications: DL vs. other ML algorithms

As we saw, due to their generative structure, deep NNs extract low-level categories without the need to explicitly identify such features. This is often considered an advantage of DL over other ML techniques in the sense that feature extraction is automatically taken care of by the network. For non-ANN techniques, feature engineering (often involving domain expertise), i.e. identifying and combining relevant features for the task, is a pre-requisite to applying an algorithm to the data.

We argue that the flip side of this automatic feature extraction is that DL algorithms are perceived as black boxes. In many applications interpretability of the model is important and DL networks are particularly opaque. In such cases, one might prefer to use non-ANN algorithms.

Often, one criticism of DL is the training time needed to estimate the thousands/millions of model parameters. Even on sophisticated architectures, training can take weeks for DL vs. minutes or hours for other ML algorithms. However, it is less of a problem in production, as using DL models is usually fast. The need for an advanced architecture to cope with training load remains, though.

But above all else, **empirically, DL is only superior to non-ANN techniques when the amount of data to train the model is very large**. This means that, for most financial applications on the trading floor (except for some high frequency trading uses), non-ANN algorithms appear superior, due to their other benefits. For our typical use cases at qbridge, we would also systematically recommend applying non-ANN ML algorithms to establish a benchmark before possibly moving to ANNs.