DSC180B Project

Introduction

Machine learning is a modern task in data science that uses observed data values to model and predict data. It takes advantage of having observed data available, but what should be done when observed data cannot be obtained? A common practice is to use predicted values when observed values are unavailable, but without any corrections we inevitably run into issues such as deflated standard errors, bias, and inflated false positive rates.

Wang et al. proposes a method to correct inference done on predicted outcomes-which they name post-prediction inference, or postpi-in Methods for correcting inference based on outcomes predicted by machine learning. This statistical technique takes advantage of the standard structure for machine learning and uses bootstrapping to correct statistical error using predicted values in place of observed values.

We are exploring the applicability of Wang et al.'s postpi bootstrapping technique on political data-that is, on political twitter posts. Our project will be investigating what kinds of phrases or words in a tweet will strongly indicate a person's political alignment, in the context of US politics. By doing so, we can simultaneously test how the bootstrap post-prediction inference approach interacts with Natural Language Processing models and how this method can be generally applicable towards analyses in political science.



Methodology

The postpi bootstrap approach by Wang et al. is a method that aims to correct inference in studies that use predicted outcomes in lieu of observed outcomes. It is effective due to its simplicity-this approach is not dependent on deriving the first principles of the prediction model, so we are free to focus on accuracy without worrying about the impact of the complexity of the model on the bootstrap approach. The reason why it is not dependent is because this approach utilizes an easily generalizable and low-dimensional relationship between observed and predicted outcomes.



The postpi bootstrap approach relies on four assumptions about our data:

  1. There must be a training and testing dataset with observed outcomes to train the prediction and relationship model.
  2. Observed and predicted outcomes should have a simple, low-dimensional relationship.
  3. The relationship model that describes the relationship between observed and predicted outcomes should be consistent for future data.
  4. The features used for inference should be present within the training and testing data, and used in the prediction model.


An implementation of this algorithm is provided below:




Data



Data Collection and Cleaning


We collected our data by scraping tweets from US politicians from Twitter. Specifically, we took the Twitter handles of the President, Vice President, and all the members of US Congress except Representatives Chris Smith (R-NJ) and Jefferson Van Drew (R-NJ), as they have both deleted their Twitter accounts. These Twitter handles were compiled and provided by the UCSD library, and outdated names or Twitter handles were updated manually by ourselves. Additionally, the two Independent members of Congress-Senators Bernie Sanders (I-VT) and Angus King (I-ME)-will be considered Democratic politicians for our purposes, as they caucus with Democrats.

To prepare our data for prediction and feature selection, we cleaned the tweets by expanding all contractions, converted all text into lowercase format, and removed urls, punctuation, and unicode characters. Additionally, we also removed stopwords like ‘the’, ‘an’, ‘are’, etc. using the dictionary of stopwords provided by the NLTK package.



Exploratory Data Analysis


Our data consists of a relatively equal number of tweets leaning either Democratic or Republican. As said earlier, with Independent politicians counting as Democrats, the table below is a brief overview of our data.


Democrats 22,850
Republicans 21,478
Total 44,328

Taking a deeper dive into the data, we look at the distribution of tweet lengths for either party. Figure 1 is an overlaid histogram plotting the number of words in tweets from Democrats and Republicans. While both histograms are clearly skewed to the left, we can see that the distribution of the length of tweets for Democrats has a higher peak than the distribution for Republicans. This tells us that tweets from Democrats average more words compared to their counterparts on the opposite aisle.

This could imply that the prediction model will utilize more vocabulary from Democrat-classified tweets than Republican, which might have interesting effects on the prediction model and thus the bootstrap algorithm and inference.



Figure 1: A histogram depicting the number of words in a tweet by party. We can see that Democrats generally have longer tweets compared to Republicans.


We take a deeper dive into each party in Figure 2 below, which lists the 10 most frequent words used by Democrats and Republicans, excluding stopwords. There are very few commonalities between either party–only two words are commonly used by both parties: ‘today’ and ‘year’.

Democrats seem to focus on policy issues as suggested by ‘act’ and ‘infrastructure’, but otherwise their attentions are spread across a multitude of topics as no single unifying issue seems to be able to group together their most frequently used words. On the other hand, Republicans seem to focus more on their political opponents–words such as ‘biden’, ‘democrats’, and ‘president’ seem to suggest that–and on the American people. There is notably a significant reference to ‘biden’, with the President’s name being used approximately 3500 times, almost double the frequency of the second most popular word.

As such, Figure 2 shows us that Republican-classified tweets may revolve more strongly around certain themes, such as their opponents, compared to Democrat-classified tweets. Again, this may influence the prediction model and in turn the inference conducted on our features.


Figure 2: Bar plots depicting the most frequent words used by either party. We can also see a significant difference in the most frequent words used by either party–only ‘today’ and ‘year’ is a word that both parties use in common.


Methods



Prediction and Relationship Model


During this stage of our project, we worked on maximizing the accuracy of our prediction model. We compared several different prediction models in the process of coming up with our final model, trying other classification algorithms such as logistic regression and ridge regression (regularized).

In the end, we used a TF-IDF vectorization model with 200,000 features and 1-3 words per feature, and an SVC model for prediction, with a linear kernel and C=1.5. For the relationship model that takes in the predicted and observed outcomes, we used a K-NN model.



Feature Selectiom for Inference

We reviewed relevant literature in political science to develop a criteria for choosing our features.

In Twitter Language Use Reflects Psychological Differences between Democrats and Republicans, Sylwester and Pulver discuss the differences between Democrats and Republicans in the context of previous findings and their own discoveries. For example, Haidt’s Moral Foundations model, which identifies “harm, fairness, liberty, ingroup, authority, and purity” as the pillars of morality, has been used to distinguish between liberals and conservatives. It was found that liberals prioritized the harm and fairness aspects of morality, while conservatives focused more on liberty, ingroup, authority, and purity. Sylwester and Pulver also found differences between Democratic and Republican-aligned people when it came to what kinds of topics they discussed and emotions they expressed–Republicans focused more on topics such as “religion…, national identity…, government and law…, and their opponents” while Democrats were focused on emphasizing their uniqueness and generally expressed more anxiety and emotion.

These findings are somewhat in line with our own observations made through the data–as stated before, we found that Republican tweets made references to their opponents on a much larger scale than Democrats, and also made mention of the American people–their national identity–plenty of times as well.

We also reviewed Chen et al.’s study, #Election2020: the first public Twitter dataset on the 2020 US Presidential election. Chen et al. found that more conservative Twitter users tended to share more topics related to conspiracy theories and “public health and voting misinformation” compared to liberal Twitter users.

Taking these two sources into consideration, our criteria for selecting features was whether or not they would fall into either liberal or conservative tendencies as discovered by either source. We ended up selecting 5 features to conduct inference, which are:


border illegal god defund happy

We hypothesized that the first three would be strong indicators for a Republican-classified tweet as they allude to national identity and religion, while the last two would indicate a Democratic-classified tweet as they allude to concepts of harm and fairness, as well as emotion.


Results


After conducting inference using the bootstrap postpi algorithm, we found that the parametric bootstrap method worked best to correct for inference. As such, for the inference we interpret below we will only be considering the corrections made using the parametric method, and not the non-parametric bootstrap method.



Inference on "border"


Feature: border Actual Values No Correction Non-Parametrix Parametric
Coefficient 7.49 8.27 7.50 7.50
SE 0.85 0.87 0.39 0.84
T-Stat 8.84 9.56 19.28 8.97


The table shows the results of conducting inference on the word "border". The bootstrap postpi algorithm corrects coefficients, SEs, and t-statistics as mentioned above and the results below shows that the algorithm works as intended.



These results are meaningful because the smaller differences would suggest that we have a good bootstrap model that corrects inference using predicted values instead of observed values.



A positive coefficient for the word "border" implies that this feature is a good predictor for the Republican party. To test whether the feature is a statistically significant predictor we must evaluate the t-statistic. If the null hypothesis was true-that there is no significant difference between Republicans and Democrats in their use of the word "border"—then we would expect a sample with no difference. Since the corrected t-statistic of ~ 8.966 is greater than 2, we have 95% confidence that there is a positive difference between our sample data and the null hypothesis. This implies that the word "border" is a good predictor for the Republican party.



Inference on "illegal"


Feature: illegal Actual Values No Correction Non-Parametrix Parametric
Coefficient 5.79 6.39 5.83 5.83
SE 1.10 1.10 0.37 1.09
T-Stat 5.28 5.81 15.81 5.34


The table shows the results of conducting inference on the word "illegal".



These results are meaningful because the smaller differences would suggest that we have a good bootstrap model that corrects inference using predicted values instead of observed values.


A positive coefficient for the word "illegal" implies that this feature is a good predictor for the Republican party. Since the corrected t-statistic of ~ 5.335 is greater than 2, we have 95% confidence that this feature is a statistically significant predictor.



Inference on "god"


Feature: god Actual Values No Correction Non-Parametrix Parametric
Coefficient 4.90 5.45 4.78 4.78
SE 1.03 1.04 0.37 1.01
T-Stat 4.76 5.25 12.79 4.72


The table shows the results of conducting inference on the word "god".



These results are meaningful because the smaller differences would suggest that we have a good bootstrap model that corrects inference using predicted values instead of observed values.


A positive coefficient for the word "god" implies that this feature is a good predictor for the Republican party. To test whether the feature is a statistically significant predictor we must evaluate the t-statistic. Since the corrected t-statistic of ~ 4.720 is greater than 2, we have 95% confidence that this feature is a statistically significant predictor.



Inference on "defund"


Feature: defund Actual Values No Correction Non-Parametrix Parametric
Coefficient 1.18 1.51 1.08 1.08
SE 1.92 1.92 0.39 1.92
T-Stat 0.62 0.79 2.75 0.56


The table shows the results of conducting inference on the word "defund".



Interestingly, conducting inference on the feature "defund" yielded a positive coefficient, which implies that this feature is a good predictor for the Republican party, and not the Democratic party contrary to our hypothesis.


To test whether the feature is a statistically significant predictor we must evaluate the t-statistic. Since the corrected T-Statistic of ~ 0.560 is less than 2 and greater than -2, we have 95% confidence that there is not a positive difference between our sample data and the null hypothesis.



Inference on "happy"


Feature: happy Actual Values No Correction Non-Parametrix Parametric
Coefficient 0.93 1.14 0.96 0.96
SE 0.50 0.50 0.43 0.50
T-Stat 1.89 2.29 2.21 1.92


The table shows the results of conducting inference on the word "happy".



These results are meaningful because the smaller differences would suggest that we have a good bootstrap model that corrects inference using predicted values instead of observed values.


Once again, we find that inference on the feature "happy" also yielded a positive coefficient, which tells us that this feature is a good predictor for the Republican party, and not the Democratic party. This is, again, contrary to what we hypothesized would be the case.


To test whether the feature is a statistically significant predictor we must evaluate the t-statistic. Since the corrected T-Statistic of ~ 1.920 is less than 2 and greater than -2, we have 95% confidence that there is no positive difference between our sample data and the null hypothesis.



Conclusion


In conclusion, we have demonstrated that the bootstrap postpi algorithm first developed by Wang et al. is shown to correct predicted outcomes when observed outcomes are not available on political data. In such a field where collecting observed outcomes can be exceedingly time-consuming and expensive to collect, this is a significant finding that may open doors to some studies that may otherwise be too difficult to conduct.



References

Chen, E., Deb, A. & Ferrara, E. #Election2020: the first public Twitter dataset on the 2020 US Presidential election. J Comput Soc Sc (2021). https://doi.org/10.1007/s42001-021-00117-9

Sylwester K, Purver M (2015) Twitter Language Use Reflects Psychological Differences between Democrats and Republicans. PLOS ONE 10(9): e0137422. https://doi.org/10.1371/journal.pone.0137422

Wang, Siruo, Tyler H. McCormick, and Jeffrey T. Leek. "Methods for correcting inference based on outcomes predicted by machine learning." Proceedings of the National Academy of Sciences 117.48 (2020): 30266-30275.