Research Methodology Book In Urdu Pdf

1 view

Skip to first unread message

Olowookere Devost

unread,

Aug 4, 2024, 9:15:15 PM8/4/24

to gravheumasce

Socialmedia have become a very viable medium for communication, collaboration, exchange of information, knowledge, and ideas. However, due to anonymity preservation, the incidents of hate speech and cyberbullying have been diversified across the globe. This intimidating problem has recently sought the attention of researchers and scholars worldwide and studies have been undertaken to formulate solution strategies for automatic detection of cyberaggression and hate speech, varying from machine learning models with vast features to more complex deep neural network models and different SN platforms. However, the existing research is directed towards mature languages and highlights a huge gap in newly embraced resource poor languages. One such language that has been recently adopted worldwide and more specifically by south Asian countries for communication on social media is Roman Urdu i-e Urdu language written using Roman scripting. To address this research gap, we have performed extensive preprocessing on Roman Urdu microtext. This typically involves formation of Roman Urdu slang- phrase dictionary and mapping slangs after tokenization. We have also eliminated cyberbullying domain specific stop words for dimensionality reduction of corpus. The unstructured data were further processed to handle encoded text formats and metadata/non-linguistic features. Furthermore, we performed extensive experiments by implementing RNN-LSTM, RNN-BiLSTM and CNN models varying epochs executions, model layers and tuning hyperparameters to analyze and uncover cyberbullying textual patterns in Roman Urdu. The efficiency and performance of models were evaluated using different metrics to present the comparative analysis. Results highlight that RNN-LSTM and RNN-BiLSTM performed best and achieved validation accuracy of 85.5 and 85% whereas F1 score was 0.7 and 0.67 respectively over aggression class.

Recently, Roman Urdu language has been a contemporary trend and a viable language for communication on different social networking platforms. Urdu is national and official language of Pakistan and predominant among most communities across different regions. A survey statistic in [6] affirms that 300 million people are speaking Urdu language and approximately 11 million speakers are in Pakistan from which maximum users switched to Roman Urdu language for the textual communication, typically on social media. It is linguistically rich and morphologically complex language [7]. Roman Urdu language is highly variant with respect to word structures, writing styles, irregularities, and grammatical compositions. It is deficit of standard lexicon and available resources and hence become extremely challenging when performing NLP tasks.

This paper addresses toxicity/cyberbullying detection problem in Roman Urdu language using deep learning techniques and advanced preprocessing methods including usage of lexicons/resource that are typically developed to accomplish this work. Intricacies in analyzing the structure and patterns behind these typical aggressive behaviors, typically in a newly adopted language, and forming it as a comprehensive computational task is very complicated. The major contributions of this study are formation of a slang and contraction mapping procedure along with slang lexicon for Roman Urdu language and development of hybrid deep neural network models to capture complex aggression and bullying patterns.

The rest of the paper is organized as follows: Review of existing literature is presented in "Related Work" Section. "Problem statement" Section states research gap and gives formal definition of the addressed problem. "Methodology" Section describes the steps of research methodology and techniques and models used for the experimentations. Advanced preprocessing steps applied on Roman Urdu data are elaborated in "Data Preprocessing on Roman Urdu microtext" section. Implementation of proposed model architecture and hyperparameter settings are discussed in "Experimental Setup" section. "Results and Discussion" Section highlights and discusses study results and finally "Conlusion" Section concludes the research work and provides future research directions.

Due to the accretion of social media communication and adverse effects arising from its darker side on users, the field of automatic cyberbullying detection has become an emerging and evolving research trend [8]. Research work in [9] presents cyberbullying detection algorithm for textual data in English language. It is considered as one of the pioneers and highly cited research. They divided the task in text-classification sub problems related to sensitive topics and collected 4500 textual comments on controversial YouTube videos. This study implemented Naive Bayes, SVM and J48 binary and multiclass classifiers using general and specific feature sets. Study contributed in [10] applied deep learning architectures on Kaggle dataset and conducted experimental analysis to determine the effectiveness and performance of deep learning algorithms LSTM, BiLSTM, RNN and GRU in detecting antisocial behavior. Authors in [11] extracted data from four platforms i-e Twitter, YouTube, Wikipedia, and Reddit for developing an online hate classifier in English language using different classification techniques. Research carried out in [12] developed an automated approach to detect toxicity and unethical behavior in online communication using word embeddings and varying neural network layers. They suggested that LSTM layers and mimicked word embedding can uncover such behavior with good accuracy level.

Few of the studies in recent years has been contributed by researchers on other languages apart from English. Research work in [13] is unique and has gathered textual data from Instagram and twitter in Turkish language. They have implemented Nave Bayes Multinomial, SVM, KNN and decision trees for cyberbullying detection along with Chi-square and information gain (IG) for feature selection. Work accomplished in [14] also addresses the problem of cyberaggression in Turkish language. The work extends comparison of different machine learning algorithms and found optimal results using Light Gradient Boosting Model. Van Hee, Cynthia, et al. in [15] proposed cyberbullying detection scheme for Dutch language. This is the first study on Dutch social media. Data was collected from ASKfm where users can ask and answer questions. The research uses default parameter settings for un-optimized linear kernel SVM based on n-grams and keyword system to identify bullying traces. F1 score for Dutch language was 61%. Problem of Arabic language cyberbullying detection was addressed and accomplished in [16]. This study used Dataiku DSS and WEKA for ML tasks. The data was scrapped from facebook and twitter. The study concluded that even though the detection approach was not comparable with the other studies in English language but overall Naive Bayes and SVM yield reasonable performance. Research work in [17] by Gomez-Adorno, Helena, et al. proposed automatic aggression detection for Spanish tweets. Several types of n-grams and linguistically motivated patterns were used but the best run could only achieve F1 score of 42.85%. Studies presented in [18,19,20] are based on automatic detection of cyberbullying content in German language. Research conducted in [18] proposed an approach based on SVM, CNN and ensemble model using unigram, bigrams and character N-grams for categorizing offensive tweets in German language. Research presented in [21] attempted for the very first time to identify bullying traces in Indonesian language. Association Rule mining and FP growth text mining were used to identify trends for bullying patterns in Jakarta and Surabaya cities using social media text. This baseline study on Indonesian language was further extended by Nurrahmi, Hani et al. in [22]. Study in [23] made first attempt to develop a corpus of code-mixed data considering Hindi and English language. They proposed a scheme for hate speech detection using N-grams and lexical features. An ensemble approach by combining the predictions of Convolutional Neural Network (CNN) and SVM algorithms were used for identifying such patterns. The weighted F1 score for Hindi language ranged between 0.37 and 0.55 for different experiments [24]. In the year 2019, Association for computational linguistics initiated the project for automatic detection of cyberbullying in Polish language [25]. Research conducted in [26] attempted to uncover cyberbullying patterns in Bengali language implementing passive aggressive, SVM and logistic regression. The optimum accuracy achieved was 78.1%. Recently, work contributed in [27] presented first study in Roman Urdu using lexicon based approach. The dataset was highly skewed comprising of only 2.2% toxic data. As according to [28], biased sampling and measurement errors are highly prone to classification errors when working on such datasets. Moreover, pattern detection based on predefined bullying and non-bullying lexicons were shortcomings of this study.

For automated detection of complex cyberbullying patterns, studies contributed by different scholars employ supervised, unsupervised, hybrid and deep learning models, vast feature engineering techniques, corpora, and social media platforms. However, the existing literature is mainly oriented towards unstructured data in English language. Some recent studies and projects have been initiated in other languages as discussed previously. To the best of our knowledge and literature review, no detailed work has been contributed in Roman Urdu to systematically analyze cyberbullying detection phenomenon using advanced preprocessing techniques (involving the usage of Roman Urdu resources) and deep learning approaches under different configurations.

The escalated usage of social networking sites and freedom of speech has given optimal ground to individuals across all demographics for cyberbullying and cyberaggression. This leaves drastic and noticeable impacts on behavior of a victim, ranging from disturbance in emotional wellbeing and isolation from society to more severe and deadly consequences [29]. Automatic Cyberbullying detection has remained very challenging task since social media content is in natural language and is usually posted in unstructured free-text form leaving behind the language norms, rules, and standards. Evidently, there exists a substantial number of research studies which primarily focus on discovering cyberbullying textual patterns over diverse social media platforms as discussed previously in literature review section. However, most of the detection schemes and automated approaches formulated are for resource-rich and mature languages spoken worldwide. Roman Urdu is typically spoken in South Asia and is a highly resource deficient language. Hence this research puts novel efforts to propose data pre-processing techniques on Roman Urdu scripting and develop deep learning-based hybrid models for automated cyberbullying detection in Roman Urdu language. The outcomes of this study, if implemented, will assist cybercrime centers and investigation agencies for monitoring social media contents and in making cyberspace secure and safer place for all segments of society.