Machine Learning Project: Sentiment Analysis of Tweets

6 min readMay 3, 2021

Sentiment Analysis is the study of a user or customer’s views or attitude towards something. E-commerce websites like Amazon and eBay have pioneered the use of big-data to better understand their customers’ wants and needs. Now with improved technology and ease of access to big data, sentiment analysis has become much more feasible to an individual with a basic knowledge of Python. This article discusses the approach taken in creating a model for analyzing the sentiment of tweets. The source-code for this project can be found on Github:

https://github.com/xp55mihir/summer-project-2019

Gathering Data

The raw Twitter tweets were obtained from the following website:

https://datahack.analyticsvidhya.com/contest/practice-problem-twitter-sentiment-analysis/

Raw tweets contain some special characters and symbols that the model will not be able to process, hence the data needs to be cleaned before it can be passed to our model. This task and much of the subsequent tasks were undertaken with significant guidance from the following sentiment analysis project report:

https://github.com/abdulfatir/twitter-sentiment-analysis/blob/master/docs/report.pdf

Pre-processing Tweets

The basic rules we used to pre-process the tweets were:

Remove any punctuations.
Remove repeated letters eg: convert ‘happppy’ to happy.
Replace emoticons with a string.
Replace hashtags and retweets with space.

Most of the pre-processing steps were done with the help of a python library called Regular Expression operator.

Feature Extraction

Two features are extracted from the tweets:

Unigrams: These are single words such as “have”, “beautiful”, and “house”. 41,435 unique unigrams are extracted from the dataset, and the frequency of the top twenty unigrams is plotted.

Frequency Distribution of top twenty unigrams.

A vocabulary list of the top 3430 unigrams is created.

2) Bigrams: These are two adjacent words in a sentence such as “I have” and “beautiful house”. 161,278 unique bigrams are extracted from the tweets using the ngrams library, and the frequency of the top twenty bigrams is plotted.

Frequency Distribution of top twenty bigrams.

A vocabulary list of the top 830 bigrams is created.

Sparse Vector Representation

Following feature extraction, the vocabulary lists of top unigrams and bigrams are used to formulate sparse vectors that represent the tweets. The indices of each of these vectors correspond to terms in the same order as the top 3430 unigrams and the top 830 bigrams. Two types of sparse vectors are formulated:

Presence vectors: In these vectors, a value of 1 is assigned to the indices of terms that are present in the tweet and a value of 0 is assigned to all remaining indices.
Frequency vectors: In these vectors, for the indices of terms that are present in the tweet, a value equal to the frequency of occurrence of each term in the tweet is assigned to the index corresponding to that term. A value of 0 is assigned to all remaining indices. Each value of each vector is multiplied by a special value called the inverse-document-frequency (idf) of the corresponding term. This process helps in giving higher values to important terms. The inverse-document-frequency of a term (idf(t)) is determined using the following formula:

idf(t) = log((1 + nd)/(1 + df(d,t))) + 1

where nd is the total number of tweets and df(d,t) is the number of tweets in which the term t occurs.

Classification Techniques

Some of the classification techniques used were Naive Bayes, Decision Tree, Random Forest, and Sequential Logistic Regression. The training tweets were split using a ratio of 80:20, where 80 percent of the tweets were used to train the model and the remaining tweets were used as a validation set to measure the accuracy of the model, defined as the percentage of these remaining tweets whose sentiment labels are accurately predicted.

Naive Bayes

Tests were run on the sparse vectors for both presence and frequency feature types using GaussianNB, MultinomialNB, and BernoulliNB from the sklearn.naive_bayes package of scikit-learn. The results of the tests are shown in the table below:

Accuracy results for Naive Bayes classification.

2. Decision Tree

The DecisionTreeClasifier (DTC) was imported using the scikit-learn library. The DTC was initialized with ‘gini’ criterion, random state of 100 and a max depth of 5. The random state is the seed number used to generate a random number, this can be any integer value. Max depth is the maximum depth to which the tree will expand; the value of 5 in this model was obtained after testing the performance of the model for different integers. After initializing the DTC model, it was fit to the independent and dependent variable. This model was tested on the presence and frequency vectors, giving essentially the same accuracy of around 93.95% for both vectors.

3. Random Forest

The RandomForestClassifier (RFC) was imported using the scikit-learn library. The RFC was initialized with n-estimators of 30. This is the number of random trees being used by the model. This number was chosen to be 30 after testing the performance of the model for different numbers. This model was tested on the presence and frequency vectors, giving an accuracy of 95.34% with presence vectors and 95.17% with frequency vectors.

4. Sequential Logistic Regression

The keras library was used to perform Sequential Logistic Regression. Sigmoid activation function, binary cross-entropy loss, and Adam’s optimizer were used as part of this classification system. When tested on both presence and frequency vectors, the model gives essentially the same accuracy of around 93.07%.

Follow up Testing

The above four models were tested on new tweets and movie and restaurant reviews, using presence and frequency vectors. The results were analyzed and evaluated through consideration of the accuracy with the new data and texts from each data set whose sentiments were incorrectly predicted. The training tweets were also analyzed to find similar patterns for incorrectly predicted texts. Conclusions about the best model for classification were drawn through this analysis and evaluation. Finally, a generalized model for prediction was created to perform sentiment analysis on any set of tweets, based on the tasks performed earlier and on the conclusions reached.

Conclusion

The models Random Forest, Decision Tree, Sequential Logistic Regression, and BernoulliNB from Naive Bayes have high and comparable accuracies with the training tweets that were used for testing. When tested on new tweets, BernoulliNB and Random Forest give the highest accuracy, but with a limited number of new tweets. Considering the comparable accuracies, all of these models are giving with the training tweets, these models would very likely give comparable accuracies with a much larger number of new tweets. As such, there is likely no single best model among these models. Keeping this in mind, Random Forest was used to build the generalized model for sentiment analysis, aimed towards figuring out the overall sentiment from a group of tweets directed towards a particular topic.

Improvements

The raw training data had significant amount of incorrectly labelled tweets. If this had been realised at an earlier stage, the model could have been more accurately trained. So it is advisable to spend some good amount of time when looking for training data on the internet.

The model trained works very well with twitter dataset, however, it fails to do its job when the input is a text of a different length, or a from a different writing style. Using a dataset for training that mixes various text, for example: tweets, reviews, extracts, etc. might help in building a model that is more generalized.

At the moment, the model rates the sentiment based on a certain word or group of words. Training the model to understand the context of the sentence is the next challenge.

Acknowledgments

My friend, Vaibhav Kumar, and I jointly carried out this project in the summer of 2019. We planned an outline for the structure of the article, divided the writing tasks for all sections of the article between us, wrote our sections as per the plan, and edited the whole article together. We appreciate any feedback and would love to connect through our emails.