Twitter Bots and Their Detection: Interview with ZENPULSAR’s Data Science Head

The recent content viewing restrictions imposed by Elon Musk at Twitter have led to a renewed interest in social media bots. ZENPULSAR has built its business and reputation to a significant degree on having a state-of-the-art bot detection mechanism. Our technical team, led by Pavel Dudko, are at the forefront of detecting and neutralising the effect of bot content on social sentiment measurement. Today, we are interviewing Pavel again, now on the issue of Twitter bots and bot detection.

Q: Pavel, Twitter’s new viewing restrictions are said to be aimed at countering data scraping and bots. Specifically on the issue of bots, how big of a problem is it on Twitter?

PD: Bots, and particularly spam bots, are a massive problem at Twitter. The platform is really drowning in bot and spam content, at least in comparison to other social media networks we track. This is particularly noticeable in the crypto content domain. As you, know, crypto discussions make up one of the three key financial areas we measure sentiment on, the other two being equity- and commodity-related content.

Just a few months ago, we could see that around 90% of crypto-related Twitter discussions were bot-based. However, several months ago, Twitter started to get serious about fighting bots. They started doing it well before these recent restrictions were announced. These restrictions generated quite a hype among the public, but in the background, Twitter’s technical team got serious about bot-related content around late 2022-early 2023.

I think this “attitude change” was undoubtedly driven by Musk’s acquisition of Twitter. So, while Musk was accused of many things upon buying Twitter, one positive directional change he set up at the technical level was the increase in anti-bot activities.

Twitter hasn’t been successful in defeating bots overnight, there are still plenty of them on the platform. However, I can tell you that, for example, the proportion of bot-generated content in the crypto space has now decreased to about 60-70%. That’s still a very high percentage, but at least it’s notably lower than the 90% we could observe earlier this year.

This decrease has less to do with these very recent viewing limits they announced. It’s been more a result of multi-month efforts by their technical area.

Q: Speaking of these viewing restrictions, what immediate effects have you observed?

PD: The most immediate and noticeable effect is the overall decrease in content, both bot- and human-generated. In terms of prevalence of bots, I think it’s too early to confidently claim any serious decreases. Let’s see how the situation develops over the following weeks and months.

Q: You noted above that around 60-70% of crypto discussions on Twitter are bot-generated. How does Twitter compare to other social media platforms when it comes to bot incidence rates?

PD: Twitter has always had higher bot incidence rates than competing platforms. For example, on Reddit, bot incidence rates are around 45%. However, these rates are highly variable depending on a specific sub-reddit. While Twitter uses a platform-wide bot detection mechanism, Reddit does so on a sub-reddit level. Each sub-reddit is responsible for fighting (or ignoring) bots. As a result, while some sub-reddits have virtually no bots, others might have incidence rates comparable to Twitter’s.

In contrast, Seeking Alpha has minimal bot incidence rates.

Q: What does ZENPULSAR do within our own systems to identify bots?

PD: A critical part of our data cleaning process is spam removal. Spam is nearly always generated by bots. Our AI algorithms are trained to identify nearly all spam content that floats around on social media. As we filter out spam, we, therefore, also filter out bot content. Specifically for Twitter, our bot detection rate stands at around 98%.

We also achieve high bot detection rates thanks to our rigorous overall data processing framework. It’s made up of 3 key stages – content analysis, network analysis, and heuristics ML based analysis.

Q: Could you briefly describe these three stages?

PD: Certainly. At the first stage, content analysis, we use our trained AI models to identify content that is likely generated by bots. These models have continually updated training data to teach them what kind of content bots tend to generate. A large proportion of bot content doesn’t pass this first stage check.

At the second stage, network analysis, our algorithms look at social media accounts’ network topology. We analyse each account’s linkages with other accounts. Bots usually tend to follow other bots. While human users might also be linked to bot accounts, they are much more likely than bots to follow other humans.

At the final stage, heuristics ML, we use a set of custom heuristics ML rules to spot a bot. There are about 100 such rules that help our models detect bots. For example, one key rule is based on the average speed of posting. Bots often give themselves away by posting at speeds impossible for any human.

Another rule is based on analysing the account’s profile picture. Human users are much more likely to use unique images, while bots often use profile pictures grabbed elsewhere on the internet. Bots also like using AI generated profile pictures.

There are also rules based on identifying a mismatch between an account’s profile language and posting language; a ratio of likes to posts; and many more. No single rule is used to classify an account as a bot. Used together, the entire set of about 100 rules helps us spot a bot with a very high degree of precision.

Q: Thank you, Pavel. It’s been extremely informative. As a final quick question – do you think Twitter’s strategy of putting these rather severe limits on content viewing will help them drastically reduce bot incidence rates?

PD: As I said, they’ve been working on decreasing their bot rates for the last few months behind the scenes. As for these new limits, I am not sure if they’ll help with any significant bot rate reduction. We just have to give Twitter some time to see if it works, or if it was actually about bots anyway.

Frankly speaking, this is not a major concern for me as our bot detection rates are hovering close to 100%. Regardless of Twitter’s current or future bot incidence rates, we rely on our own framework to filter out nearly all bot content. As such, whether Twitter’s incidence rates are over 90% or under 5% has no substantial bearing on our social sentiment data.