Introduction
The global financial landscape underwent dramatic shifts during the onset of the COVID-19 pandemic, with cryptocurrencies like Bitcoin (BTC) experiencing heightened volatility. Numerous studies have explored how the pandemic influenced financial markets, revealing that Bitcoin did not act as a safe haven during economic turmoil and instead showed increasing correlation with traditional stock markets. Research has also examined the co-movement between cryptocurrency prices and global health indicators, such as daily reported deaths due to the virus.
While earlier works used machine learning models—like Multiple Linear Regression, Bayesian autoregression, and Multi-Layer Perceptron—to predict BTC prices based on social media sentiment, few focused specifically on the emotional tone of public discourse during the pandemic. This gap motivated a comprehensive analysis using Valence Aware Dictionary and sEntiment Reasoner (VADER), a rule-based sentiment analysis tool particularly effective for social media content.
This study evaluates 13 distinct preprocessing strategies applied to Bitcoin-related tweets collected during the height of the pandemic. The goal was to determine which text-cleaning methods best enhance the correlation between tweet sentiment scores and actual BTC price movements. By leveraging VADER’s ability to interpret slang, emojis, and informal syntax common on Twitter, this research offers insights into how emotional signals in digital conversations can reflect or even anticipate market trends.
👉 Discover how real-time sentiment shapes crypto markets—click to explore powerful analytics tools.
Understanding Sentiment Analysis Techniques
Sentiment analysis enables machines to interpret human emotions in text, making it invaluable for financial forecasting, brand monitoring, and behavioral research. Several methodologies exist, each with unique strengths depending on context and data type.
Valence Aware Dictionary and sEntiment Reasoner (VADER)
VADER stands out for its effectiveness in analyzing short, informal texts like tweets. As a lexicon- and rule-based system, it doesn’t require training data, making it faster than most machine learning approaches. It assigns each text a compound sentiment score ranging from -1 (extremely negative) to +1 (extremely positive), along with normalized scores for negative, neutral, and positive sentiments.
Crucially, VADER recognizes intensifiers ("very," "incredibly"), punctuation emphasis ("!!!", "???"), negations ("not good"), and emoticons (":)", ":-("), all of which are common in social media discourse. This makes it ideal for extracting meaningful emotional signals from noisy, unstructured tweet data.
Word2Vec: Capturing Semantic Meaning
Word2Vec transforms words into dense vector representations that capture semantic relationships. For example, vector operations can reveal analogies like “king – man + woman = queen.” These embeddings help machine learning models understand contextual similarity and improve performance in tasks like classification and prediction.
While powerful, Word2Vec requires large corpora and computational resources for training. In contrast to VADER’s immediacy, it is better suited for deep learning pipelines where contextual nuance matters more than speed.
TF-IDF: Identifying Key Terms
Term Frequency-Inverse Document Frequency (TF-IDF) quantifies the importance of a word within a document relative to a collection. Words frequent in one document but rare across others receive higher weights, helping identify distinctive themes.
In cryptocurrency sentiment analysis, TF-IDF can spotlight emerging topics—like “halving,” “ETF,” or “regulation”—that may influence market behavior. However, it treats words independently and misses emotional valence unless paired with a sentiment lexicon.
N-Gram Modeling: Context Through Word Sequences
N-grams capture sequences of n consecutive words (unigrams: single words; bigrams: word pairs; trigrams: triplets). This helps preserve context lost in isolated word analysis.
For instance, “not bullish” carries opposite meaning to “bullish,” but a unigram model might miss the negation. Using bigrams or trigrams allows models to detect such reversals, improving sentiment accuracy—especially when integrated with VADER’s negation handling rules.
Related Work: Social Media Sentiment and Cryptocurrency Markets
Prior research confirms a measurable link between social media sentiment and cryptocurrency price movements. One study found that changes in Twitter sentiment could predict short-term BTC price fluctuations with up to 83% accuracy using a simple model. Other researchers demonstrated that both tweet volume and sentiment polarity correlate with price volatility across multiple digital assets.
Some studies enhanced VADER by weighting tweets based on user influence—factoring in follower counts, retweets, and likes—to create a more representative market sentiment index. These weighted scores were then used as inputs in predictive models alongside historical prices and moving averages.
Regression analyses on lesser-known cryptocurrencies revealed varying degrees of dependence on social sentiment. While stablecoins showed minimal correlation, highly speculative tokens like Dogecoin exhibited stronger links between online chatter and price swings.
Additionally, work by Sailunaz and Alhajj showed that full-text tweet analysis outperformed filtered grammatical categories (nouns, verbs, adjectives) in recommendation systems, reinforcing the value of preserving raw linguistic expression in sentiment modeling.
Comprehensive Sentiment Analysis of BTC Tweets During the Pandemic
This study aimed to refine the preprocessing pipeline for Twitter data to maximize the alignment between VADER-generated sentiment scores and Bitcoin’s closing prices. By testing various combinations of text-cleaning techniques, we identified optimal strategies for enhancing predictive relevance.
Data Collection Methodology
To ensure data integrity and compliance with platform policies, a custom Python scraper was developed using the Tweepy library and Twitter API v1.1. The collection period spanned from May 22 to July 10—capturing critical phases of pandemic-driven market uncertainty.
Tweets were filtered using keywords related to Bitcoin: “bitcoin,” “BTC,” “XBT,” “satoshi,” and associated hashtags (#BTC, $BTC). Both truncated (140-character) and extended full-length versions were captured where available. A total of 4,169,709 tweets were collected, each with precise timestamps down to the second.
Bitcoin price data was sourced from CryptoCompare API at one-minute intervals, yielding 71,472 data points of OHLCV (Open, High, Low, Close, Volume) information. This high-frequency dataset allowed granular comparison between sentiment trends and price action.
To address data volatility in recently updated records, a bi-daily refresh protocol replaced provisional prices with finalized values upon subsequent API calls.
Preprocessing Strategies for Optimal Sentiment Extraction
Thirteen preprocessing configurations were tested, combining three core functions:
cleaned: Removes Twitter-specific elements (URLs, mentions @user, hashtags #tag, RT prefixes) while preserving emojis and emoticons.split: Breaks tweets into individual sentences to isolate distinct sentiments within longer posts.no sw: Eliminates stopwords not recognized by VADER’s lexicon after tokenization.
These functions were applied in different orders and combinations. For example:
- Clean → Split → Remove Stopwords
- Split → Clean → Remove Stopwords
- Clean only
- Split only
Special care was taken during tokenization to retain sentiment-relevant punctuation like exclamation marks and question marks, which VADER uses to adjust emotional intensity. HTML entities (e.g., &) were converted to standard characters to maintain readability.
After preprocessing, each tweet was scored using VADER’s compound metric. Minute-level aggregation produced average sentiment scores and polarity volumes aligned with BTC price timestamps.
👉 See how advanced sentiment tracking can inform your next crypto move—get started now.
Findings and Performance Evaluation
To assess effectiveness, an Average Feature Correlation Magnitude (AFCM) was calculated—the mean absolute correlation between sentiment features and BTC closing prices across all strategies.
Results showed that sentence splitting, removal of Twitter tags, and their combination consistently improved correlation strength. Specifically:
- Strategies involving splitting tweets into sentences allowed VADER to assign more accurate scores per idea.
- Removing @mentions and #hashtags reduced noise without losing emotional context.
- The highest AFCM values came from pipelines that cleaned text first, then split sentences.
Interestingly, removing stopwords had minimal impact—suggesting that many so-called “stopwords” carry emotional weight in crypto discourse (e.g., “will,” “can,” “not”).
Frequently Asked Questions
Q: Why use VADER instead of machine learning models for sentiment analysis?
A: VADER is fast, requires no training, and excels at interpreting informal language, emojis, and punctuation—making it ideal for real-time social media analysis.
Q: How does tweet preprocessing affect sentiment accuracy?
A: Proper cleaning removes irrelevant noise (like URLs), while sentence splitting isolates distinct sentiments. Both steps significantly improve correlation with market data.
Q: Can social media sentiment reliably predict Bitcoin prices?
A: While not deterministic, strong correlations exist—especially during periods of high uncertainty like the pandemic. Sentiment acts as a behavioral indicator of market psychology.
Q: What makes this study different from prior research?
A: It systematically tests multiple preprocessing techniques using VADER during a specific crisis period (COVID-19), offering actionable insights for building better predictive models.
Q: Were bot-generated tweets included in the analysis?
A: The dataset includes all publicly available tweets matching keywords. While some bot activity may be present, influence-weighted scoring could mitigate their impact in future studies.
Q: Is real-time sentiment analysis feasible for traders?
A: Yes—when combined with APIs and automated pipelines, tools like VADER enable near real-time monitoring of market mood shifts.
👉 Turn market sentiment into strategy—access real-time crypto analytics today.
Conclusion
This study demonstrates that thoughtful preprocessing of Twitter data significantly enhances the correlation between VADER-based sentiment scores and Bitcoin price movements during crisis periods like the pandemic. Sentence splitting and removal of platform-specific syntax emerged as key contributors to model performance.
By refining how raw social media text is transformed into quantifiable emotional signals, researchers and investors alike can build more responsive forecasting tools. Future work may integrate user influence weighting, multi-platform data aggregation, or hybrid models combining VADER with deep learning embeddings.
As digital discourse continues shaping financial markets, understanding the pulse of public sentiment remains a vital edge—one that starts with clean data and smart analysis.
Core Keywords: Bitcoin sentiment analysis, VADER, Twitter sentiment, cryptocurrency market prediction, BTC price forecasting, social media analytics, pandemic financial impact, text preprocessing