Measuring the amount of shared information between two documents is a key to address a number of Natural Language Processing (NLP) challenges such as Information Retrieval (IR), Semantic Textual Similarity (STS), Sentiment Analysis (SA) and Plagiarism Detection (PD). In this paper, we report a plagiarism detection system based on two layers of assessment: 1) Fingerprinting which simply compares the documents fingerprints to detect the verbatim reproduction; 2) Word embedding which uses the semantic and syntactic properties of words to detect much more complicated reproductions. Moreover, Word Alignment (WA), Inverse Document Frequency (IDF) and Part-of-Speech (POS) weighting are applied on the examined documents to support the identification of words that are most descriptive in each textual unit. In the present work, we focused on Arabic documents and we evaluated the performance of the system on a data-set of holding three types of plagiarism: 1) Simple reproduction (copy and paste); 2) Word and phrase shuffling; 3) Intelligent plagiarism including synonym substitution, diacritics insertion and paraphrasing. The results show a recall of 88% and a precision of 86%. Compared to the results obtained by the systems participating in the Arabic Plagiarism Detection Shared Task 2015, our system outperforms all of them with a plagiarism detection score (Plagdet) of 83%.
Writing any kind of work or school assignment includes lots of research ahead of the actual writing. When you need a plagiarism checker in Arabic, Copyleaks is able to scan the Arabic language you have submitted for any similar content online.
Misconduct in Arabic research is not an exception. Unfortunately, however, most of the plagiarism detection tools act on ASCII (American Standard Code for Information Interchange) data and very few support Unicode data for plagiarism comparisons. Plagiarism detection for scholarly research written in the Arabic language is not well supported. The scarcity of Arabic literature and resources on the Internet as well as the shortage of commitment to research in Arabic NLP (Natural Language Processing) are the main reasons behind the absence of efficient plagiarism tools that support a language spoken and written by around 423 million people.
The main contribution of this ongoing project is twofold. At its preliminary stage, it will construct a plagiarism corpus made of defended dissertations in the thesis repository at the library of the University of Jordan. The second is to develop a plagiarism detection system dedicated to the Arabic language that is capable of detecting verbatim plagiarism and some intelligent plagiarism including word order changes, paraphrasing and synonym replacement. Hereafter, we refer to the corpus as JUPlag and to the plagiarism detection system as PD system.
The remaining of the paper is organized as follows. Section 2 provides a background and discusses related literature. Section 3 introduces the research methodology. Section 4 discusses the experiments and findings. Finally, section 5 presents the conclusion of this paper and future work.
The lack of fundamental research skills could be the common reason why university students/researchers plagiarize (Devlin & Gray, 2007). However, academic writing is not an easy task. It requires clarity, conciseness, focus, structure, and evidence. It requires a lot of reading, appropriate usage of words and grammar, and learning how to express ideas and thoughts. Several studies pointed to other reasons for plagiarism: lack of author confidence, shortage of time, fear of failure, pressure of parents and scholarship committees to maintain high grades, lack of punishment by the institution, ease of appropriation, and absence of good plagiarism detection systems (Devlin & Gray, 2007; Eret & Ok, 2014; Franklin-Stokes & Newstead, 1995).
Plagiarism detection software (PDS) can be content-based (extrinsic) or stylometry-based (intrinsic) (Rahman, 2015). Extrinsic plagiarism detection (EPD) discovers instances of appropriation by comparing a suspicious document with reference documents (a database or a corpus). Intrinsic plagiarism detection (IPD), on the other hand, discovers instances of appropriation in the suspicious document without using any reference corpus. Figure 1 depicts the common types of text plagiarism and the classification of plagiarism detection software tools.
A plagiarism detection system has to ideally handle most types of plagiarism, including text modification by word-shifting, translation, and summarization that bypass string-matching tools. At this preliminary stage, our present work handles string-matching-based plagiarism detection and it is planned that it will be enhanced with such NLP techniques as stemming and part-of-speech tagging, and by the use of such lexical resources as the work of (Baras, Sawalha, and Yagi: A more extensive wordnet for Arabic, submitted), Arabic-WordNet,Footnote 2 dictionaries, and thesauri.
Zaher, Shehab, Elhoseny, and Osman (2017) developed a web-based plagiarism detection system for Arabic documents, called APDS. The system operated in three phases: preparation, preprocessing, and similarity detection. After preprocessing, the query document was presented as n-gram chunks for similarity detection. The proposed system was tested on a dataset of 10 Arabic documents and evaluated in terms of precision and recall. The authors claimed an average precision of 82% and an average recall of (92.5%). However, the paper does not tell what kind of plagiarism was detected, how the documents were presented or how the precision and recall measures were obtained.
Mahmoud and Zrigui (2017) proposed a system for detecting semantic plagiarism in Arabic documents that benefited from machine learning technology. In the preprocessing phase, the suspicious and source documents were split into sentences then into words without removing stopwords. In the feature extraction phase, the TF*IDF (Term Frequency-Inverse Document Frequency) measure was calculated for weighting words in terms of importance. Then the word2vec algorithm was used for learning word embeddings, and the skip-gram model was employed for predicting the context of words given a current word vector. For similarity calculation, they used cosine and the Euclidean distance measures. The degrees of similarity between sentences were compared to a predefined threshold. Experiments were conducted on an open source Arabic corpus and they claimed a precision rate of (85%) and a recall rate of (84%).
Mahmoud, Zrigui, and Zrigui (2017) used a Convolutional Neural Network (CNN) approach for detecting paraphrasing plagiarism in Arabic documents. This method is said to detect paraphrasing plagiarism through the measurement of semantic relatedness between the suspicious and the original documents. Their approach has three phases: preprocessing, feature extraction, and paraphrase detection. After preprocessing, the feature extraction phase employed a skip-gram model for word-to-vector representation, where each document is represented by a vector in a multidimensional space. The paraphrase detection phase applied the cosine similarity measure on the vectors of both the suspicious and the original documents to reduce dimensionality. Finally, a mathematical function called Softmax was used for paraphrase detection according to some predefined threshold. Experiments showed a precision rate of (88%).
However, Mahmoud et al. (2017) and Mahmoud and Zrigui (2017) conducted their experimentation on an open source Arabic corpus, named OSAC (Saad & Ashour, 2010). The corpus was organized in ten different categories collected from multiple websites. The sources of the articles were news channels and social and commercial websites, which clearly makes it inappropriate for academic plagiarism detection. Specialized content is what the PD corpus ought to consist of, because academics do not normally plagiarize the news or social media.
Abdelrahman, Khalid, and Osman (2017) presented a framework for content-based PD in Arabic documents. Their framework has two phases: preprocessing and document representation. They used a tree-structure model with the document at the root of the tree, the paragraphs at the second level, and the sentences at the third level of the tree. A Longest Common Substring (LCS) matching algorithm was used for comparing hashed text chunks (i.e. words in their case). No experiments were made to evaluate the system or show its effectiveness and therefore there was no plagiarism detection corpus.
Ghanem, Arafeh, Rosso, and Snchez-Vega (2018) presented a system for detecting extrinsic plagiarism in Arabic texts. Their system, Hybrid Plagiarism (HYPLAG), followed a hybrid detection approach. They adopted corpus-based and knowledge-based approaches for the detection of both the verbatim and rephrasing types of plagiarism. The system was compared to other systems that participated in the Arabic Plagiarism Detection PAN-Forum for Information Retrieval Evaluation (AraPlagDet PAN@FIRE) competition and was tested on a corpus called External Arabic Plagiarism Detection (ExAraPlagDet-2015). The authors reported that HYPLAG outperformed others with a success rate of (89%). They chunked the query (suspicious) document and the source documents into n-term sentences. Then the synonyms of the query document were extracted from the Arabic-WordNet. The original sentences were ranked with respect to the suspicious sentences and the ones with the highest scores were extracted as potentially plagiarized sentences. Finally, the candidate sentences and suspicious sentences were compared for similarity using the vector space model and the TF*IDF weighting measure. A similarity value that exceeded a predefined maximum threshold indicated plagiarism, while a similarity value between minimum and maximum thresholds required a call for the next phase of feature-based semantic similarity measurement based on the synonyms extracted from the Arabic-WordNet.
03c5feb9e7