NLP - Climate Change Misinformation Detection

11 minute read

In this age of Internet, information sharing is at its peak. We can easily read and share content on different platforms without actually verifying the authenticity of the information. Due to this, misinformation and fake content is on the rise, which can be tackled and predicted using machine learning algorithms. In this project, we present a misinformation detection system for climate change related content that can classify text to be either correct information or misleading information. In order to implement such system, we propose different methodologies with their experimental results and error analysis.

Introduction

Climate change is one of the hottest topic of discussion in several conferences and debates. One of the major factor that contributes to this continuing debate is the widespread use of social media and news platforms to deliver and spread misinformation. In order to tackle and handle such problem, we can use natural language processing to build language models that can understand and distinguish between correct and fake content.
In this paper, we propose several methodologies which aims at building a supervised binary classifier model that can classify provided text into negative and positive label, where positive label means text is delivering some kind of misinformation. In order to train such classifier with supervised learning methods, we need training data with both positive and negative labels. The training corpus is provided with only positive labels, so we need to enhance it with negative labels. We achieve this by scraping several authentic website articles which gives correct information about climate change.
This is followed by data preprocessing pipeline which includes several methods to clean and process data which can be later integrated with machine learning models. With this preprocessed data, we explored several feature extraction techniques such as bag of words, frequency transformations with different importance weights as stated by @Leopold2002. We have also explored improved feature space such as Delta-TF-IDF as proposed by @Martineau2009, Word2Vec corpus embeddings and pretrained [@glove] word vectors. These feature extraction techniques are trained on several models and tuned on development data, which is then followed by detailed error analysis of these techniques.
As per our experimental results, word2vec average word embeddings [@Elsaadawy2018] learned on the training corpus with 10 ReLu units deep learning model outperformed other methodologies on the provided test data.

Data Extraction and Analysis

In order to enhance training corpus with negative labels, data needs to be extracted from external sources. We have used the article URLs provided by the @climateurls2020 which is a contextual dataset for exploring climate change narratives with more than 6.3M news articles. We are using dataset curated in 2020 which contains 1.08M article links.
In order to extract the information, this dataset is loaded using pandas and each row is checked for extraction using ’climate-change’ keyword. Then, @newspaper3k python package is used to download and parse articles using URLs. After parsing, text is extracted and saved for each row. This process is then repeated until 1500 articles are extracted. Using this extraction approach, training corpus is enhanced with 1500 negative labels, making it suitable to build a binary classifier.
We then analysed the training data to understand basic pattern or relationship of texts with the problem. Firstly, we analysed the distribution of part-of-speech tags across positive and negative labels as shown in the below figure.

POS distribution across labels

It can be deduced from the plot that content words are dominating the training corpus. Also, there is a substantial score of punctuation and numbers which can be eliminated in preprocessing phase to decrease the feature space. Similarly, we can analyse the word counts for uni-grams across positive and negative labels to understand their distribution.

Top 20 unigrams across labels

As per the above plot, most of the misinformation are based on the word related to ’climate’, ’energy’ or ’temperature’. These words are being contradicted by the unigrams in the negative label. This plot gives us an initution that distribution of unigrams is not highly skewed, thereby unigrams can provide good feature space for building models. Moreover, we saw that distribution of bigrams, trigrams and fourgrams is extremely skewed in terms of the word counts which makes them inappropriate for feature representation.

Data Preprocessing

Data cleanup and preprocessing is one of the most important aspect for effective modelling of the problem. As shown in POS figure, there is a very high score of function words which includes stop words, conjunctions and other non-important words. These words can be eliminated along with punctuation and numbers which will eventually decrease the learning feature space. We also performed lemmatization and used NLTK word synsets to further identify the relevant words for our analysis. This data cleanup and transformation pipeline is shown in below figure.

Data preprocessing pipeline

We run this preprocessing pipeline for each text in the training corpus, development corpus and test corpus to either get cleaned text in lemma form or get lemma tokens for each cleaned text. This preprocessed data is then used for feature extraction and feed into machine learning models for label predictions.

Proposed Methodologies

In this section, we propose different approaches to perform feature extraction from the preprocessed training corpus and train machine learning models for getting effective accuracy results on the development and test data.

Approach 4.1: Bag-of-words feature representation

We first started with the bag-of-words feature extraction method to get our baseline model. We used unigrams to create this representation as their distribution across labels is not extremely skewed as compared to bigrams, trigrams and four-grams. In order to build this representation, we used the raw frequencies transformation scheme as described by @Leopold2002 which uses raw counts of words across documents. This creates a bag-of-words matrix which is then trained on several models, tuned on the development data and tested on the test data.

Approach 4.2: TF-IDF feature representation

In this approach, we use the concept of importance weights to increase the weights of the words that are important for classifying labels. We use inverse document frequency (IDF) weight scheme as proposed by @Leopold2002 which calculates the weight for each word $w$ in a text collection consisting of N documents as:

\[idf_w = log\frac{N}{df_w}\]

where, $df_w$ is the number of documents in the collection which contains the word $w$. This yields a weight vector which is multiplied with raw frequencies used in the previous approach to generate TF-IDF matrix. This matrix is then normalized with L2-normalization so as to improve convergence of the machine learning algorithms.

Approach 4.3: TF-Redundancy feature representation

In this approach, we use different importance weighting scheme called redundancy as proposed by @Leopold2002 with the possibility that it might provide better weight adjustments than previous IDF approach. We calculate the redundancy weight for each word $w$ in a text collection of N documents as:

\[r_w = log\:N + \sum_{i=1}^{N}\frac{f(w,d_i)}{f(w)}log\frac{f(w,d_i)}{f(w)}\]

where $f(w,d_i)$ is the frequency of word $w$ in the document $d_i$ and $f(w)$ is the frequency of word in the entire collection of documents. This yields a weight vector which is again multiplied with raw frequencies and scaled to its maximum absolute value (MaxAbsScaler) to generate TF-redundancy matrix. These weights provides a measure of how much is the distribution of the term $w$ deviating from the uniform distribution, thereby quantifying the skewness of a probability distribution.

Approach 4.4: TF-Delta-IDF feature representation

In this approach, we use another weighting scheme called Delta-IDF as proposed by @Martineau2009. In this scheme, the weight of the term $w$ is calculated as:

\[\Delta idf_w = log_2 \left(\frac{N_w}{P_w}\right)\]

where $N_w$ is the number of documents with term $w$ in a negatively labeled training set and $P_w$ is the number of documents with term $w$ in a positively labeled training set. This yields a weight vector which is multiplied with raw frequencies and scaled to its maximum absolute value (MaxAbsScaler) to generate TF-Delta-IDF matrix. The idea of this scheme is to represent true importance of words within the documents that can help the algorithm to classify them more effectively.

Approach 4.5: Average Word2Vec Embeddings

In this approach, we created our own scheme to generate importance weights for the word by averaging word2vec embeddings as proposed by @Elsaadawy2018. We have used Word2Vec training model provided by gensim to learn word vectors of size 100 on the training corpus. We trained our model on unigrams using skip-grams approach with the window size of 2. We have also used negative sampling of size 20 for generating embeddings. These embeddings produced for each word $w$ in the corpus are averaged and used as the weight for the word in the documents to create a weight matrix which is then scaled using MaxAbsScaler. This weight matrix is used as the feature representation to train machine learning models.

Approach 4.6: Pretrained Word2Vec Embeddings

In this approach, we use pretrained model called @glove which contains 6B tokens with 400K vocabulary words. We used this layer as the embedding layer for the deep learning models with GloVe weight vectors of size 300 which are trained while the model is training. This layer is then trained with several deep learning models like CNN and LSTM and tuned on the development data to get the most effective precision and recall for our analysis.

Experimental Results

In this section, we present the results of our proposed methodologies on the precision, recall and F1-score metric for the positive class. We also show the most optimal model that worked with our methodology and present results on both development and test data.

Method Model Precision Recall F1-score
Approach 4.1 LSVC(C=3.25) 0.93 0.78 0.85
Approach 4.2 LSVC(C=1.5) 0.93 0.74 0.82
Approach 4.3 LSVC(C=0.015) 0.81 0.84 0.82
Approach 4.4 LSVC(C=0.025) 0.80 0.86 0.83
Approach 4.5 10 ReLU 0.95 0.72 0.82
Approach 4.6 10 LSTM 0.88 0.74 0.80

As per above table, we can see that Approach 4.1 produced best F1-score on the development set, where highest recall can be achieved by Approach 4.4 and best precision by Approach 4.5. Similarly, we can show results of these approaches on the test data as shown by codalab.

Method Model Precision Recall F1-score
Approach 4.1 LSVC(C=3.25) 0.5062 0.82 0.6260
Approach 4.2 LSVC(C=1.5) 0.5556 0.80 0.6557
Approach 4.3 LSVC(C=0.015) 0.4828 0.84 0.6131
Approach 4.4 LSVC(C=0.025) 0.5000 0.84 0.6269
Approach 4.5 10 ReLU 0.5714 0.88 0.6929
Approach 4.6 10 LSTM 0.5634 0.80 0.6612

Clearly, Approach 4.5 outperforms every other approaches on the results produced on the codalab. This approach is able to generalize extremely well on the unseen data with comparatively better precision and recall. The final score of this approach is shown in the below table.

Method Model Precision Recall F1-score
Approach 4.5 10 ReLU 0.5743 0.85 0.6855

Error Analysis

In this section, we provide explanation and reasoning why some approaches worked better than the other on the development data and able to generalise well on the test data.

Approach 4.1: The bag-of-words approach with LinearSVC is not able to classify 16 documents correctly from the dev set. The distribution of words for misclassified labels across positive and negative corpus can be shown using below bar chart.

image

We can clearly see from the left chart that certain words are more prominent in the negative label corpus than positive label corpus due to which these documents are misclassified as correct information. Similarly, we can see the same pattern for ’coal’, ’find’ or ’power’ word in the right chart, where ’say’ seems to be having less impact towards reducing classification error.

Approach 4.2: We can correct above shown problem using weighting scheme like IDF that assigns weights to the word as per their frequency across documents. But, this approach still takes into account the term frequency due to which those 16 documents are still not correctly classified.

Approach 4.3: This approach tries to quantify the skewness of the term to improve the weights but fails to overcome the effect of term frequency and provides almost similar result as of IDF scheme.

Approach 4.4: This approach tries to improve weighting scheme for the word by assigning more negative weights to the word in the positive corpus and more positive weights to the word in the negative corpus which is enhanced by the term frequency. Again, this approach suffers from the ambiguous words problem that are almost equally distributed across both corpus.

Approach 4.5: This approach tries to eliminate term frequency and calculate weights for the word by averaging Word2Vec learned vectors that captures semantic similarity among words. We are able to get best precision using this approach but it fails to provide good recall and f1-score for positive class. One of the reason is the large amount of words are common across positive and negative labels as shown in unigrams figure, due to which word2vec is not able to understand correct context of the word with the window size of 2. Moreover, huge amount of information is lost upon taking mean of the vectors.

Approach 4.6 This approach tries to retain the loss of information by keeping entire 300 length word vector learned for each word by using embedding layer. These word vectors are taken from GloVe pretrained model but they are not able to improve classification results for development set due to inconsistent cosine similarity for the words that are not valid in the provided context and less coverage of around 46% with the vocabulary of the training corpus.

Conclusion

We proposed several methodologies for representing features and used machine learning models to identify climate change misinformation. We compared the results with development and test set where average word2vec embeddings approach with 10 ReLu units and sigmoid layer provides the best precision on the development set and generalizes better on the test set.

Future Work

We can enhance proposed methodologies by exploring contextual representations using BERT or ELMo which might solve context ambiguity problem with better word embeddings. We can also explore topic modelling using LDA that might help us to filter out content not related to climate change. Moreover, transfer learning implementations like ULMFiT can also be explored for this problem.

Cheers!

Leave a comment