TY - JOUR
T1 - Classification of Hate, Offensive and Profane content from Tweets using an Ensemble of Deep Contextualized and Domain Specific Representations
AU - Chinagundi, Basavraj
AU - Singh, Muskaan
AU - Ghosal, Tirthankar
AU - Rana, Prashant Singh
AU - Kohli, Guneet Singh
N1 - Publisher Copyright:
© 2021 Copyright for this paper by its authors.
PY - 2021
Y1 - 2021
N2 - The explosive growth of social media has also resulted in unfortunate emergence of hate, offensive, and profane content on the web. A certain conversational thread can contain hate, offensive, and profane content, which is not apparent from a standalone or single tweet or replies but can be identified if given the context of the parent content. Such social media content is spread in many different languages, including code-mixed languages like hinglish (English code-mixed with Hindi). So it becomes a huge responsibility for the social media sites to identify such hate content before it gets disseminated to the general population, which may trigger havoc. The hate speech and offensive content identification track (HASOC)[1] in FIRE 2021 English Subtask A track provides a forum and a data challenge for multilingual research on the identification of such problematic content. In this paper, we describe our submission for the above track. Our proposed approach uses a transformer-based embedding with HateBERT and achieves the Macro F1 score of 79% on the test data, which is 3.96% behind the best-performing system. We make our system run available at https://github.com/basavraj-chinagundi/HASOC_2021.
AB - The explosive growth of social media has also resulted in unfortunate emergence of hate, offensive, and profane content on the web. A certain conversational thread can contain hate, offensive, and profane content, which is not apparent from a standalone or single tweet or replies but can be identified if given the context of the parent content. Such social media content is spread in many different languages, including code-mixed languages like hinglish (English code-mixed with Hindi). So it becomes a huge responsibility for the social media sites to identify such hate content before it gets disseminated to the general population, which may trigger havoc. The hate speech and offensive content identification track (HASOC)[1] in FIRE 2021 English Subtask A track provides a forum and a data challenge for multilingual research on the identification of such problematic content. In this paper, we describe our submission for the above track. Our proposed approach uses a transformer-based embedding with HateBERT and achieves the Macro F1 score of 79% on the test data, which is 3.96% behind the best-performing system. We make our system run available at https://github.com/basavraj-chinagundi/HASOC_2021.
KW - HateBERT
KW - Profane Content
KW - Text Classification
KW - hate Speech
UR - http://www.scopus.com/inward/record.url?scp=85134242509&partnerID=8YFLogxK
M3 - Conference article
AN - SCOPUS:85134242509
SN - 1613-0073
VL - 3159
SP - 491
EP - 500
JO - CEUR Workshop Proceedings
JF - CEUR Workshop Proceedings
T2 - Working Notes of FIRE - 13th Forum for Information Retrieval Evaluation, FIRE-WN 2021
Y2 - 13 December 2021 through 17 December 2021
ER -