DETECTING SIMILAR DOCUMENT USING WORD-LENGTH N-GRAMS

  • Fatma Indriani Universitas Lambung Mangkurat
  • Irwan Budiman Universitas Lambung Mangkurat
Keywords: text similarity, plagiarism detection, word length n-gram, Dice-Coefficient, Indonesian language

Abstract

Various methods to detect plagiarism of text documents have been researched and
developed. One type of method for detecting literal plagiarism is lexical-based, mostly
character-based n-gram and word-based n-gram. N-gram based on word length has the
advantage of smaller space and computing time, because it only encodes word length
(number of letters per word). This paper discusses the application of the word length
based n-gram word for representing documents and using the Dice coefficient to measure
the similarity between the n-grams. The system is tested on a corpus of Indonesian
language documents containing literal plagiarism, and succeeded in detecting all pairs of
documents that contain plagiarism. From the test results, the optimal value of n is n> = 6
and the limit for Dice Coefficient is t <0.3.

Downloads

Download data is not yet available.

References

Alzahrani, S. M., Salim, N., & Abraham, A. (2012) “Understanding plagiarism
linguistic patterns, textual features, and detection methods”. IEEE Transactions on Systems, Man and Cybernetics Part C: Applications and Reviews, 42(2), 133-149. doi:10.1109/TSMCC.2011.2134847.

Barrón-Cedeño, A., Basile, C., Esposti, M. D., & Rosso, P. (2010) ”Word length n-grams for text re-use detection”. Lecture Notes in Computer Science (Vol. 6008 LNCS, pp. 687-699). doi:10.1007/978-3-642-12116-6_58
Dice, Lee R. (1945). "Measures of the Amount of Ecologic Association Between Species". Ecology 26 (3): 297-302. doi:10.2307/1932409.
Lisangan, E. A., (2013). ”Implementasi n-Gram technique dalam deteksi plagiarism pada tugas mahasiswa”. TEMATIKA, Journal of Informatics and Information Systems. Vol 1, No 2.
N-gram. (n.d.). Di Wikipedia. Diakses: 12 Maret 2016, darihttp://en.wikipedia.org/wiki/N-gram.

Ridhatillah, Ardini.(2003) Dealing with Plagiarism in the Information System Research Community: A Look at Factors That Drive Plagiarism and Ways to Address Them”.MIS Quarterly; Vol. 27, No. 4, p. 511-532/December 2003.

Stamatatos, E. (2011) Plagiarism detection using stopword n-grams. Journal of
the American Society for Information Science and Technology, 62(12), 2512-2527. doi:10.1002/asi.21630.
Published
2017-11-20