A PREPRINT - JULY 28, 2020
Table 3: Example results of eight Tweet processing cases. Stop word and Signal char denote stop word and single
character removal, respectively. The ’#’ character is removed keeping the keyword to retain the context of hashtag.
Case 3 appears to be the cleanest with most informative tokens.
Number Definition Tokenized list of strings
Case 0
No processing [’&’, ’#gofundme’, ’@united’, ’the’, ’refund’, ’request?’,
(raw tweet) ”functionality’s”, ’w’, ’on’, ’http://t.co/5IMDckODfx’, ’is’,
’broken.i’, ’get’, ’a’, ’time’, ’out’, ’error.’, ]
Case 1 RegEx
[’gofundme’, ’the’, ’refund’, ’request’, ’?’, ’functionality’, ’s’, ’w’, ’on’,
’is’, ’broken’, ’i’, ’get’, ’a’, ’time’, ’out’, ’error’, ]
Case 2 RegEx + Stop word [’gofundme’, ’refund’, ’request’, ’?’, ’functionality’,’s’, ’w’, ’broken’,
’i’, ’get’, ’a’, ’time’, ’error’, ’ ’]
Case 3
RegEx + Stop word [’gofundme’, ’refund’, ’request’, ’functionality’, ’broken’, ’get’,
+ Single Char ’time’, ’error’]
Case 4
RegEx+Single Char [’gofundme’, ’the’, ’refund’, ’request’, ’functionality’, ’on’, ’is’,
+ Encode Emoji ’broken’, ’get’, ’time’,’out’, ’error’, b’ \\xf0\\x9f\\x98\\x8a’ ]
Case 5
RegEx + Stop word + Single [’gofundme’, ’refund’, ’request’, ’functionality’, ’broken’,’get’, ’
Char + Encode Emoji time’, ’error’, b’\\xf0\\x9f\\x98\\x8a’]
Case 6 RegEx + Encode Emoji
[’gofundme’, ’the’, ’refund’, ’request’, ’?’, ’functionality’, ’s’, ’w’, ’on’, ’is’,
’broken’, ’i’, ’get’, ’a’, time’, ’out’, ’error’, ”b’\\xf0\\x9f\\x98\\x8a’”]
Case 7
RegEx + Stop word [’gofundme’, ’refund’, ’request’, ’?’, ’functionality’, ’s’, ’w’, ’broken’,
+ Encode Emoji ’i’, ’get’, ’a’, ’time’, ’error’, ”b’\\xf0\\x9f\\x98\\x8a’”]
performance. A window size of seven appears to be the best choice for our particular Twitter classification task. The
importance of individual words is calculated from the variance and L2 norm of the word vectors. The words with the
highest and lowest variance of word vectors yield mean AUCs of 70 and 60, respectively. This suggests that word
vectors with higher variance are more informative than those with low variance. The cumulative performance gains
after including the top N most important words and the N least important words are shown in Figure 3. The top 10
most important words in a tweet yield 98.33% of the maximum sentiment classification accuracy that is obtained using
all available words. The L2 norm and variance of word vectors result in similar accuracy trends as shown in Figure 3.
3.3 Effect of tweet embedding
We compare the performance of three proposed tweet embeddings with that of the baseline vector averaging method
in Twitter sentiment classification. Table 4 shows 10×2 nested cross-validation AUC-ROC for four tweet embeddings
and eight tweet processing cases. The baseline averaging of word vectors performs the worst among all tweet embed-
dings. Case 0, without any text processing, yields low accuracy with the highest standard deviation. The sum of the
word vectors yields the most accurate classification of sentiment. However, weighted average slightly outperforms the
summation method in several cases and is a superior choice over baseline averaging. The best classification perfor-
mance is obtained using text processing Case 6, which is regular expression-based processing followed by encoding
of emojis. This suggests the value of encoding and incorporating emojis in Twitter SA. The worst classification per-
formance is obtained using averaging of word vectors in Case 3. Case 3 excludes all the stop words, single characters,
and emojis to yield cleanest tokens in Table 3. Case 3 text processing scheme invariably yields the worst performance
for all types of tweet embedding. This observation suggests that aggressive removal of stop words and characters has
negative effects on word embedding, classifier model training and performance. Using the 10 most important words
(based on word vector variance) and summation of these word vectors, the classification performance is comparable
to other tweet embedding methods. We further compare classification performance for individual sentiment classes.
For the best (sum + case 5) and worst (average + case 3) scenarios, we show their corresponding confusion matrices
in Table 5. The average accuracies of the best and worst scenarios are 73.33% and 69%, respectively. The confusion
matrices show that negative tweets are confused as positive tweets much less than the positive tweets are confused as
negatives. The worst model heavily confuses neutral tweets as negative tweets. Classifying neutral tweets is the most
difficult task as it is a gray zone between positive and negative sentiment classes. Hence, the accuracy of neutral tweet
classification accounts for most deviations between the best scenario (71%) and the worst scenario (60%).
9