Using Semantic Fingerprinting In Finance

Feriha Ibriyamova

Leiden University

Samuel Kogan

Leiden University – Leiden University College

Galla Salganik-Shoshan

Ben-Gurion University of the Negev

David Stolin

Toulouse Business School – Economics and Finance

March 28, 2016


Researchers in finance and adjacent fields have increasingly been working with textual data, a common challenge being analyzing the content of a text. Traditionally, this task has been approached through labor- and computation-intensive work with lists of words. In this paper we compare word list analysis with an easy-to-implement and computationally efficient alternative called semantic fingerprinting. Using the prediction of stock return correlations as an illustration, we show semantic fingerprinting to produce superior results. We argue that semantic fingerprinting significantly reduces the barrier to entry for research involving text content analysis, and we provide guidance on implementing this technique.

Using Semantic Fingerprinting In Finance

Finance and economics researchers are deluged with data, much of it of non-quantitative nature. Accordingly, there is much interest in how such data – usually in the form of text – can be used in explaining and predicting financial and economic phenomena. Our paper conducts a first assessment of the predictive power of an emerging development in textual analysis: semantic fingerprinting.1

A series of papers co-authored by Hoberg and Phillips (2010, 2015a, 2015b) pioneered the use of textual data to measure similarity between firms’ products and therefore their proximity in the business space. These papers document that text-derived measures of competitive proximity outperform traditional industry classifications in a wide variety of predictive and explanatory specifications. Specifically, these papers use text sources such as 10-K business descriptions to create descriptive word lists for individual firms. Each firm is then represented by a vector of 1s and 0s indicating the presence and absence, respectively, of a given word in the text. The cosine similarity between the vectors captures the overlap between the word lists and is therefore a measure of proximity between firms.

Recent advances in the semantic analysis of texts make it potentially possible to improve on such ‘word list’ analyses. For example, if one text uses the term ‘online’ while another relies on the term ‘internet’, a word list process would not conclude that the two texts are talking about the same thing. Semantic fingerprinting, on the other hand, is ‘trained’ on a large body of text so that it can record concepts that a given word is associated with, similarly to how the human neocortex processes and stores related concepts. These related concepts form the so-called ‘semantic fingerprint’ of a given word. For example, the terms ‘online’ and ‘internet’ will have similar semantic fingerprints. Accordingly, the semantic fingerprints of two texts each of which relies on one but not both of two related words, will overlap. Therefore, the two related words’ presence will contribute to the two texts’ proximity measure if the measure is derived from their semantic fingerprints but not if it is derived from word counts. We thus expect that semantic fingerprinting has the potential to improve on word list methods, at least when it comes to measuring document similarity – and therefore the similarity of two firms when the documents in question describe the firms’ business activities.

semantic fingerprinting

The premise underlying our tests is simple. If semantic fingerprinting improves on word list-based analyses when measuring firm relatedness, it should be useful for predicting future stock return correlations, as similar firms can be expected to have similar stock price responses to relevant news. We therefore conduct regressions of pairwise correlations for a group of firms from a variety of industries on textual proximity measures derived from semantic fingerprints and from word lists, with prior correlations and other similarity measures as controls. We find that similarities based on semantic fingerprints are better at predicting correlations than are those based on word lists. This finding, combined with the ease of use, computational advantage and visual interpretation inherent to semantic fingerprinting suggest that this method merits attention from finance and economics researchers. To this end, Appendix B of our paper provides detailed information about how to implement semantic fingerprinting.

The remainder of our paper has the following structure. In Section 2 we overview semantic fingerprinting and introduce our data. In Section 3 we use semantic fingerprinting to predict stock return correlations. Section 4 discusses possible applications of semantic fingerprinting in finance, and Section 5 concludes.

semantic fingerprinting

See full PDF below.