Twitter Dataset for Hate Speech dataset termed The Levantine Hate Speech and ABusive is the first Arabic Levantine Hate Speech and Abusive Language Dataset proposed in the 3rd Workshop ALW-2019 co-located with ACL-2019, Florence, Italy.
The volatile political/social atmosphere in Levantine-speaking countries, particularly, Syria and Lebanon, has been always associated with intensive online debates of toxic contents: Hate speech and abusive language. L-HSAB combines 5,846 Syrian/Lebanese political tweets labeled as normal, abusive or hate. Coping with hot political debates, the collected tweets were posted between March 2018 and February 2019.
The main classification aims are:
1- Binary Classification (Normal, Abusive):
2- Multi-Class Classification (Normal, Abusive, Hate):
The dataset is spiltted into train and test. The features are the tweet and the annotation (Normal, Abusive and Hate). Theannotation process was conducted by 3 Levantine-speaking annotators. The annotation instructions defined the 3 label categories as:
? Normal tweets are those instances with no offensive, aggressive, insulting and profanity content.
? Abusive tweets are those instances that combine offensive, aggressive, insulting or profanity content.
? Hate tweets are those instances that: (a) contain an abusive language, (b) dedicate the abusive language towards a specific person or a group of people and (c) demean or dehumanize that person or that group of people based on their descriptive identity (race, gender, religion, disability, skin color, belief).
We would like to acknowledge Hala Mulki for sharing this data.
Hala Mulki, Hatem Haddad, Chedi Bechikh Ali, Halima Alshabani, (2019), "L-HSAB: A Levantine Twitter Dataset for Hate Speech and Abusive Language
