What is a recommender system?
A company thrives on producing products (or services) likely to fulfil its clients’ needs. This task of matching products and potential clients, historically devoted to humans, becomes challenging when both product set and client universe are large. This difficulty is compounded by the emergence (often dominance) of digital client acquisition channels and fast-changing consumption patterns.
A recommender system is a Machine Learning (ML) algorithm which aims at predicting the most likely matches between a company’s products and its clients, at any given time.
In heavily digitized sectors (e.g. online content, e-commerce, search engines, etc.), recommender systems are at the centre of a company’s strategy. Prominent examples of such companies include Netflix (movies), Amazon (e-commerce), Yelp (search engine) and YouTube (streaming).
In sectors such as the financial sector, where digitization holds the promise of massive productivity gains, recommender systems are part of the lowest hanging fruits for investment banks, commercial/retail banks and private banks alike.
In practice, an effective recommender system is a system capable of predicting, with a high level of accuracy, and at any time:
- The products most likely to be of interest to any given client, and
- The clients most likely to have an interest in any given product.
The two main families of algorithms which can be applied to the recommender problem are based on fundamentally different filtering strategies:
Content filtering. This approach consists in gathering extensive information about products (e.g. price, technical specifications, etc.) and clients (e.g. demographics, etc.) to create detailed profiles. Content filtering requires extensive information about products and clients;
Collaborative filtering. This approach consists in interpreting previous client behavior to formulate future product recommendations. The power of collaborative filtering resides in its domain-agnostic nature. Indeed, no a-priori knowledge of either products or clients is required for such a strategy to work.
Hybrid recommender systems can also be built to combine both filtering strategies.
At the heart of any recommender system is the concept of similarity (a.k.a. location). Learning hidden relationships (similarities) between products or clients is an unsupervised learning task.
We specifically focus on collaborative filtering techniques throughout this post.
Two classic techniques are used to solve collaborative filtering problems:
- Nonnegative Matrix Factorization (NMF), and
- Latent Dirichlet Allocation (LDA).
Built on fundamentally different principles, both approaches constitute powerful mathematical tools to learn hidden relationships in unstructured data sets.
Client x product matrix
Client preferences for products are best synthesized in the form of a product x client matrix where:
- Each row vector represents a product,
- Each column vector represents a client, and
- Each element of the matrix represents the level of interest of a client in a product.
The input to the model reflects historical feedback from clients on the products they purchased over the training period. Depending on the specific use case, this feedback can take the form of a discrete numerical rating (e.g. 1: bad to 5: excellent), a trade count (number of times a product was purchased by a client), or a more continuous measure reflecting nuanced indications of interest.
By construction, this input feedback matrix is highly sparse. Indeed, many clients only purchased a small subset of the available product universe. The main output of a collaborative filtering algorithm is a similar matrix, fully populated with probability measures for the interest any given client may have in any given product.
Techniques such as TF-IDF (Term Frequency – Inverse Document Frequency) can be used to smooth out the dominance of hyper-active clients and frequently traded products in a sparse input product x client matrix.
Consider the input product x client interest matrix M (dimensions Np x Nc) discussed above, as illustrated in the figure below:
The NMF model aims at learning an optimal approximated low-rank factorization of matrix M as the product of 2 low-rank matrices, named U and V, as illustrated in the figure below:
Each one of the row vectors composing matrix U (vector ui) represents the location of a given product (product i). Similarly, each one of the column vectors composing matrix V (vector vj) represents the location of a given client (client j). Both vectors are expressed in a basis of dimension d << min (Nc, Np).
By postulating adequate distributional assumptions for locations vectors ui and vj, it is possible to build a generative probabilistic NMF model and infer optimal parameter matrices U and V by maximizing the joint likelihood probability on observed data.
Details of possible model inference techniques (full posterior inference, Expectation Maximization, Variational EM etc.) are beyond the scope of this post. It is, however, mathematically possible to optimize location matrices to achieve the desired level of precision in the approximation of M.
Once convergence has been reached, a prediction can be made for every point of matrix UV, reflecting the likely level of interest from client j for product i: uiT x vj.
By approximating the initial product x client matrix M as a product of low rank matrices, NMF exposes a hidden factor structure linking product and client locations. The low rank of the resulting matrix product reflects the high degree of correlation between its expected product ratings.
The LDA model is a generative probabilistic model borrowed from the field of topic modelling. While NMF learns non-descript hidden location factors linking products and clients, LDA explicitly postulates a hierarchical structure for the generation of a corpus of products.
The LDA model manipulates 3 types of objects:
Clients. Analogous to a word in topic modelling, a client is the basic building block of an LDA model.
Products. Analogous to a document in topic modelling, a product is a set of clients with an interest to transact.
Client groups. Analogous to the concept of topic in topic modelling, a client group is a collection of clients with an interest in similar products. Whereas products and clients can be observed, client groups are latent variables learned by the model.
The power of the LDA model lies in its ability to learn the structure of hidden product x client relationships by assigning high probability to members of a corpus of products (clients) but also to members of other “similar” products (through client groups).
The purpose of an LDA model is to build a generation framework for products and clients. To that end, it defines:
- Each product as a multinomial distribution over client groups, and
- Each client group as a multinomial distribution over clients.
The parameters for each of the above multinomial distributions originate from the sampling of 2 master Dirichlet distributions:
- A product Dirichlet prior distribution is used to generate all multinomial product distributions over client groups, and
- A client group Dirichlet prior distribution is used to generate all multinomial client group distributions over clients.
The Dirichlet distribution is a highly flexible probability distribution over a support simplex. It allows to pilot resulting distributional mixes very finely, as illustrated below.
The Dirichlet distribution:
The Dirichlet distribution is a multivariate generalization of the Beta distribution for a k-dimensional random variable. Its parametric probability density function enables to generate highly customizable distributions over a k-dimensional support simplex.
This is best illustrated through the figures below, for k=3 (on a 3-dimensional support simplex), which show the variety of distribution samplings generated by such a Dirichlet distribution for different values of its k-dimensional parameter (Alpha).
A Dirichlet distribution enables to generate, for a given product, a highly bespoke multinomial distribution over client groups. For instance, some products may be associated with a few specific client groups only, or with a wide variety of client groups.
Likewise, a Dirichlet distribution enables to generate, for a given client group, a highly bespoke distribution over clients. For instance, some client groups may be associated with a few specific clients only, or with a large number of clients.
Equipped with the above mathematical toolkit, we can now specify our probabilistic model more precisely, starting with some notations:
- D is the number of available products.
- N is the number of clients associated with any given product.
- There are T client groups and V clients in total.
The LDA probabilistic generation model aims at evaluating the probability of each client associated with a given product. This is done iteratively by considering every client associated with a given product, assigning a client group to it, and evaluating the probability of that client under the distribution of the assigned client group.
Practically, the generation algorithm works as follows:
For client group t = 1 to T:
Draw βt ~ Dir(η)
For product d = 1 to D:
Draw θd ~ Dir(α)
For client n = 1 to N:
Draw zd,n from θd
Draw wd,n from p(wd,n | zd,n,β)
# Loop on all client groups
# βt is a multinomial distribution over V clients
# φt,w stores PDF of client group t according to distribution βt
# Loop on all products
# θd is a multinomial distribution over T client groups
# ϒd,t stores PDF of product d according to distribution θd
# Loop on all clients associated with the current product
# zd,n is the client group assigned to client n of product d
# wd,n is the client sampled from multinomial distribution βzd,n
Such a probabilistic model is classically represented as shown in the plate representation below. Stylized matrices (in green) have been added to illustrate the dimensionality of parameter and sampled variable matrices.
Here again, details of possible model inference techniques (e.g. variational EM) are beyond the scope of this post. It is, however, mathematically possible to prove convergence of the model by maximizing the posterior probability of parameters α, η, θ and β given the data.
Once convergence has been reached, a probability of interest can be estimated for any (product, client) pair.
Efficacy measures and possible enhancements
NMF and LDA are highly effective models for building high dimensionality collaborative filtering systems. Both models are able to learn complex product x client relationship structures from previously unseen, unstructured data sets, without being prone to overfitting and at reasonable computational cost.
The efficacy of a recommender system is typically measured on an unseen testing set, over a given testing period (e.g. over 1-day rolling periods for frequently traded products). Standard confusion matrix accuracy measures can be adapted to the recommender problem.
For instance, a Cumulative True Positives (CTP) measure will track, for any product, the number of clients who actually traded that product during the testing period and who were included in the n most highly recommended clients at the start of the period.
Other measures may be included to track various typologies of false positives, depending on bespoke needs. All measures should also be monitored through time to assess the stability of model predictions.
Numerous enhancements are possible by coupling collaborative filtering techniques such as the ones described in this post with content-driven recommender systems.
In the field of finance, a content-driven recommender system might offer more insight into risk factors and favorable market configurations for a given product to be traded. Combining this knowledge with a traditional collaborative filtering system might improve the relevance of client suggestions.
This completes our primer on recommender systems. Recommender systems are powerful ML algorithms with wide ranging applications. In the financial sector, they constitute some of the lowest hanging fruits in the quest to future-proof client services.
 See D.M. Blei, A.Y. Ng and M.I. Jordan. Latent Dirichlet Allocation. Journal of Machine Learning Research, 3(Jan): 993-1022, 2003.