Topic Modeling: Classifying Social Media Conversations

Effective social media listening for financial assets, such as cryptocurrencies, stocks, commodities, and foreign currencies, hinges on the ability of NLP language models to correctly identify discussion topics, themes, and contexts. This task, known as topic modeling, is among the most critical mechanisms employed by our social media listening platform, PUMP. Topic modeling and the key NLP language models used for it are the focus of our article today. We are going to discuss these models in a non-technical way so that our audience of traders, investors, and fund managers who haven’t had time to wrap up their PhDs in AI still appreciate the inner workings of our platform.

What is Topic Modeling?

Topic modeling is a major field in NLP dedicated to algorithms used to identify the primary themes or topics in a large text collection. Topic modeling algorithms analyse the frequency of words that often appear together in the text and cluster them into groups based on their similarities. Topic modeling produces clusters that represent topics, allowing for easy identification of the main themes or concepts discussed in the texts.

Topic modeling is used extensively in social media listening to correctly identify the main topics discussed on social media platforms. PUMP uses a number of topic modeling algorithms and models. Among these, the top models we employ are Latent Dirichlet Allocation (LDA) and BERTopic.

Latent Dirichlet Allocation (LDA) – The Most Popular Topic Modeling Method

LDA is by far the most popular and commercially successful topic modeling language model. The way LDA works is by looking for words that tend to appear together frequently within the same text corpus. It then groups those words into topics.

The algorithm goes through all the documents in the collection, grouping together similar words into distinct groups and rating each group on the extent/depth of the topic being discussed. This information can be used to understand the main themes present in the collection of text and to categorize new documents based on those themes.

BERTopic – The Advanced Topic Modeling Method

In order to deliver you the best quality data, we employ more than one topic modeling algorithm within PUMP. While LDA is very popular and widely used, there is a newer and more advanced model – BERTopic.

BERTopic is a topic modeling algorithm that uses pre-trained language models to create numerical representations, or embeddings, of text data. It then clusters similar documents together based on their embeddings to identify the topics present in a document.

The algorithm uses a hierarchical clustering technique that starts with each document as its own cluster and then merges clusters that are similar based on their embeddings. This approach allows BERTopic to identify subtopics within broader topics, making it more nuanced and accurate than the earlier topic modeling algorithms.

BERTopic has further advantages over other topic modeling methods. For example, it can handle large datasets with long documents, which can be more challenging for the earlier algorithms like LDA. Additionally, it can work with pre-trained language models that are trained on large amounts of data, allowing it to benefit from the latest advancements in natural language processing.

As of now, BERTopic is possibly the most powerful language model for the analysis of very large amounts of text. This makes it a perfect choice to analyse social media signals and discussions.

There are many language models used for topic modeling within the social media listening niche. Among these, two stand out in terms of their popularity and effectiveness, respectively. While LDA has become the most popular, BERTopic is likely the best choice for analysing the vast amount of data generated daily on social media platforms. We use both of these models within PUMP to deliver you the best, most nuanced, and highly relevant insights for the assets we track – cryptocurrencies, stocks, market indices, bonds, commodities, and more.