library book shelf representing massive volumes of textual data

Concepts behind Natural Language Processing (NLP)

Within Machine Learning (ML), Natural Language Procession (NLP) – think any task that is related to language – is a vast domain, and the focus of intense research by both leading universities and Big Tech. Ever increasing processing powers have made giant advances in the field possible in the course of just a few years. At qbridge, our prediction is that entirely new techniques will emerge over the coming years, which will make certain NLP tasks vastly easier to perform.

When it comes to language processing, impressive tasks can already be achieved at human, and often super-human level. The mere mass of textual data which machines can process today is the most obvious illustration of that fact. But we also struggle, and sometimes fail, to handle some basic, child-level tasks. This paradox, combined with the astonishing pace of progress in the field, make NLP one of the most exciting areas of data science.

We do apologise in advance for the large number of acronyms and technical terms in this post. Unfortunately, NLP is a field with many techniques and technicalities. On the bright side, after reading this post, you will have enriched your vocabulary with quite a few words, including NLP, POS, NER, BOW, TF-IDF, Word2Vec, GloVe, CBOW and Skip-gram, but also tokenisation, n-gram, stemming, lemmatisation, word vectorisation, embedding, vector space, compositionality, and more…

NLP applications are not the main focus of this post. Some are ubiquitous in our day-to-day lives, from text summarisation, question answering, machine translation, dialogue generation, image captioning or chatbots.

Why does finance care about NLP?

In 2016, IBM stated that 90% of the data had been created over the previous two years alone. Indeed, what we commonly refer to as big data is predominantly textual data. At qbridge, we estimate that 80% of the data collected by financial agents is unstructured and mostly of a textual nature. For financial institutions, use cases are plenty, spanning across trading activities and client services. Examples range from making sense of news feeds to offer clients valuable insights or to support trading activities, to conducting sentiment analysis about companies or markets, or classifying unstructured data collected by agents (not an exhaustive list).

"90 percent of the data in the world today has been created in the last two years alone – and with new devices, sensors and technologies emerging, the data growth rate will likely accelerate even more."
IBM_cloud
IBM Marketing Cloud
2016

NLP pipeline

A typical NLP pipeline involves a sequence of NLP steps followed by ML tasks (performed by either traditional ML algorithms or neural networks). In this post, we purposely ignore generic ML techniques (which can be used in a variety of contexts, including but not limited to, NLP) to specifically focus on the specificity of understanding a language. Certain ML techniques are particularly well suited to model NLP tasks because they address key fixtures of elaborate languages, such as long-term memory within texts (how to model long distance contextual information where it matters), or the hierarchical nature of language.

Language is, by nature, of a hierarchical nature. Words (with their associated meaning) constitute sub-phrases with a higher level meaning, which themselves form sentences and groups of sentences, capturing an increasing global context. Certain ML models are specifically designed to handle such hierarchical nature.

The complexity of language

To illustrate the complexity of language, let’s just mention two basic tasks in NLP, which do not appear trivial for a machine, even though they can be easily solved by a child:

  • Parts-Of-Speech (POS) tagging aims to assign parts of speech to each word in a given text (such as nouns, verbs, adjectives, and others), based on the relationship of the word to other adjacent words. In some other cases, it links these parts of speech to higher order units which have grammatical meaning, creating some sort of hierarchy or tree structure within a sentence.
  • Named-Entity Recognition (NER) aims to identify ”named entities” and assign them to different categories, such as names of persons, companies or places.

Data pre-processing:

This first step is a bit different when dealing with text data than with other types of data. It involves most of the following:

  • Tokenisation, i.e. the process of splitting text into smaller pieces called ”tokens” (words, n-grams – i.e. typically contiguous words –, or even characters);

  • Lowercasing, removing numbers and special characters, expanding abbreviations;
  • Removing stop words, i.e. the most frequent words in a language, which bring little information (e.g. the);
  • Stemming: crude heuristic process which chops off the ends of words but brings the benefit of requiring less knowledge about the language (e.g. ponnies -> ponni when using 1980 Porter’s algorithm);
  • Lemmatisation: morphological analysis of words to return the root (“lemma”) of the word (e.g. is -> be).

Vectorization

After acquiring data, the next step in text analysis is feature engineering, i.e. structuring the data so that algorithms can be applied to it. In this specific instance, it’s called vectorisation, meaning creating word vectors that can be read by machines.

The most basic technique to transform text into vector representations is called Bag-Of-Words (BOW). It uses a simple and popular technique used for non-numerical data in ML, called One-Hot-Encoding. For a given document (within a corpus of documents), the vector will encode whether a particular word (or n-gram) in the vocabulary is present in the document or not (1 if present and 0 otherwise). We can easily see how it can be useful for things like topic modelling.

However, there is a crucial piece potentially missing in BOW: it does not tell us if a particular word is frequent across documents or not. Ideally, we’d like to identify words that are frequent in a given document but reasonably rare across documents, so that we can affirm that that word has a strong association with that document. That’s what TF-IDF (Term Frequency – Inverse Document Frequency) does: it is proportional to the frequency of the word in that document, but it is inversely linked to the frequency of the word across documents. By the way, TF-IDF is able to filter out stop words. Indeed, even though such stop words are frequent within documents, they are frequent indiscriminately in all documents, and therefore carry a lower weight.

Word embeddings

Despite being useful for certain tasks, BOW and TF-IDF fail in two areas:

  • Semantic meaning. These techniques are merely statistical representations of individual words, irrespectively of the surrounding words or context, and therefore cannot extract a meaning.
  • Curse of dimensionality. These representations have very high dimensionality (of the size of the chosen vocabulary). They are therefore sparse, which brings complications such as statistical relevance, difficulty to assess similarity, etc.

Over the last few years, instead of relying on simple word frequencies, language representations have moved towards word embeddings, which are compact and denser vector encodings of words aiming to capture conceptual meanings. This is the field of vector space models.

"You shall know a word by the company it keeps."
John Rupert Firth (1957)
(A synopsis of linguistic theory, 1930-1955. Studies in linguistic analysis)

Technically, word embeddings (or distributional representations) rely on the distributional hypothesis: words appearing in a similar context have a similar meaning. These representations are also much denser, with typically 100-300 dimensions versus 10,000+ for a basic vocabulary.

TF-IDF vs. word embeddings

Simple techniques like TF-IDF shouldn’t be underestimated. It turns out that on some tasks they perform better than more advanced word representations, despite their inability to relate words to other words.

The key benefits of TF-IDF over word embeddings are:

  • They do not require a large external corpus, they can be trained on the documents at hand.

  • For that reason, they are much less computationally expensive and memory intensive.

  • They perform well in the context of large document files, like in text classification or topic modelling, whilst word embeddings focus on single words and do well in things like text generation or translation.

Word embedding: state of the art

We now turn to the latest word embedding algorithms developed over the past few years.

Word2Vec (2013, Google)[1]

An implementation of Word2Vec has been pretrained on Google News datasets composed of 100 billion words in order to produce 300-dimensional vectors for 3 billion words and phrases. Training may be (surely is, for NLP) the computationally expensive part of an ML project, hence the need for pre-training. Note, however, that even in the context of pre-training on external data, additional training might be required to capture a specific context.

Word2Wec comes in two implementations: CBOW and Skip-gram. CBOW uses a shallow neural network that predicts (in the form of conditional probabilities) a target word given the context words in a given window around that word. Skip-gram does the reverse: it tries to predict the surrounding context words given a target word.

It is quite fascinating to note that adding or subtracting such vectors makes sense, which clearly proves that some meaning is being captured by the algorithm. We call it compositionality. For instance:

Vect(Paris) – Vect(France) + Vect(Italy) ≈ Vect(Rome)

GloVe (2014, Stanford)[2]

GloVe uses a different technique and works on word-word occurrences.

For instance, from a 6 billion token corpus, the co-occurrence probabilities for the target words “ice” and “steam” with selected context words would look like this:

The above shows the strong association of ‘solid’ with ‘ice’ and ‘gas’ with ‘steam’, whilst ‘water’ is non-discriminative, therefore extracting some form of meaning. The co-occurrence matrix is then mapped to a much smaller space, through matrix multiplication, producing a word embedding for each word.

GloVe has similar performance to Word2Vec but is quicker to train.

fastText (2016, Facebook)[3]

The open-source framework by Facebook is an improvement on Word2Vec. It is based on character embeddings (i.e. sub-words instead of words), to extract morphological information in words, which can be useful when encountering rare words. A word embedding is then the sum of the vector representations of its sub-words or n-grams.

fastText has proven to be superior to both Word2Vec and GloVe but is more computationally expensive given the need to split words in n-grams.

 

This covers the latest techniques in the field of text processing, and in particular the concept of word embeddings. This is a fast moving field of research which we will revisit as and when new findings surface with interesting applications to the financial sector.

References:

[1] Mikolov, T., Chen, K., Corrado, G. and Dean, J., 2013. Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781.

[2] Pennington, J., Socher, R. and Manning, C., 2014, October. Glove: Global vectors for word representation. In Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP) (pp. 1532-1543).

[3] Bojanowski, P., Grave, E., Joulin, A. and Mikolov, T., 2017. Enriching word vectors with subword information. Transactions of the Association for Computational Linguistics, 5, pp.135-146.

If you want to know more about our capabilities, reach out via LinkedIn, email or through our website.

Hashtags:
#AI/ML #algorithms #NLP #qbridge

Share: