A Social Media sentiment tracking product is only as good as the quality of its data. Zenpulsar makes the quality of Social Media signal measurement a top priority. We employ a number of AI-based data mining and analysis models — an area of our core strength and value proposition. Below, we briefly touch upon some of the methods we use for data selection and purification.
Selection of asset-relevant social media posts. This is done via iterative usage
of information retrieval methods such as keyword extraction and topic modelling (LDA, BERTopic, etc.).
Finance-related classification.
To filter key samples from large amounts of posts and news, we employ state-of-the-art NLP models (Roberta-XLM) to achieve the best performance.
Bot detection. Some of the key techniques we use to identify if content originates from bots or humans include:
- NLP-based content analysis — we employ transformer models, such as Google MT5 and XLM-RoBERTa, trained on bot post datasets.
- Heuristics-based features (speed of posting, statistical characteristics based on NER
- analysis results, etc). Those features are fed to the Support Vector machine classifier.
- The format of recent posts from the same user. Many bots have templates for different posts by putting the text together and transforming it. The model can extract features on it to improve the model.
- Analysis of network topology (bots have a different one from human accounts), specifically betweenness centrality characteristics of an account within an account network (Katz centrality, Pagerank).
Identification of influencers, market analysts, and abnormal accounts. To identify specific account types, we use the following techniques:
- NLP-based content analysis — transformer models like Google MT5 or XLM-RoBERTa trained on influencer post datasets.
- Analysis of the account-following network characteristics of an account, specifically betweenness centrality, within the account network (Katz centrality, Pagerank, Eigenvector centrality).
- Number of followers/Reddit karma thresholds.
Sentiment detection. We utilise transformer-based models (FinBert, CryptoBert and CryptoRoberta) fine-tuned on our internal datasets. The model was trained on the cryptocurrency and stock data collected from Social Media, and three output classes by the classifier — bearish, neutral, and bullish.
Use Cases for the DatasetsAll three of our datasets may be used for:
- Identifying assets for Alpha generation in your portfolio
- Using sentiment signals to predict short-term asset price movements for day trading and other forms of active and frequent trading
- Identifying suitable assets for long-term investment, typically by using extended historical time intervals in the sentiment data analysis to spot established correlations between sentiments and prices
- Portfolio management and diversification for hedge funds and other forms of investment funds
- Quantitative investing
- Using the Social Media sentiment data within fundamental analysis of the target assets