Abusive Language Detection

Original paper - https://www.aclweb.org/anthology/N19-1221
Here is a quick summary:

In the paper submitted by Facebook AI, London to the recent NAACL (North American Chapter of the Association for Computational Linguistics) conference held in Minneapolis, they presented a novel approach using Graph Convolutional Networks to outperform some the best ways to detect abusive language on the internet.
The approach made use of a heterogeneous graph that contains an authors community network and tweets. The graph is then used to predict the class and generate an embedding. In the paper's experiments, the researchers used embeddings from node2vec (sample implementation here https://snap.stanford.edu/node2vec/) and a 2-layer Graph Convolutional Network. 

The Graph Convolutional Network that represents the author's profiles and tweets were used to predict the author's tweet into three classes using a softmax layer as the output layer of the network. To extract the embedding from the Graph Convolutional Network, the researchers set the size of the hidden units to 200 (to be comparable to the node2vec method) then processed each graph input using a cross-entropy loss. The representation for each graph input is the first layer (the layer before the output layer) of the network (without activation).

The classification performance of the GCN (Graph Convolutional Network) somewhat improved the current best method for detecting the abusive language in two of the three classes ((based on F1 Score)) it is predicting.

Along with the node2vec and GCN generated embedding the authors tried to combine it with the best Logistic Regression (LR) classifier. The performance of the LR+node2vec (the researchers labeled it in the paper as LR + EXTD (Extended Graph - for the heterogeneous graph used)) was a little bit better in two of the three classes (based on F1 Score) it is predicting.
The performance of LR+GCN (embedding) outperforms the LR in three of the three classes (based on F1 Score).

Overall, an authors community network combined with the author's tweets helps to detect abusive language on the internet. The direction the researchers took makes sense as people with the same views form a community. What if the researchers could have derived echo chamber measurements ? or tried some feature related to echo chamber measurements.

Comments

Popular posts from this blog

OAuth 1.0a Request Signing and Verification - HMAC-SHA1 - HMAC-SHA256

Spark DataFrame - Array[ByteBuffer] - IllegalAurmentException

Gensim Doc2Vec on Spark - a quest to get the right Vector