Developing Machine Learning Coding Similarity Indicators for C and C++ Corpuses
AUTHORS
Ajinkya Kunjir,Student, Department of Computer Science, Lakehead University, Thunder Bay, Ontario, Canada
Jinan Fiaidhi,Professor, Department of Computer Science, Lakehead University, Thunder Bay, Ontario, Canada
ABSTRACT
In the digital era of technology and advanced automation, data or information is vulnerable to copying, altering, and claiming someone else's work as their own. Source code theft or e-plagiarism is challenging to track in hundreds of assignments submitted by students. Despite the year's efforts, the digital plagiarism detection software currently available performs well enough for a naïve programmer to detect literal plagiarism. The available source code similarity detectors provide insufficient results when a student uses complex strategies such as word substitution or reordering programming constructs. To overcome the above-mentioned challenges, this research aims to deliver an assistive forensic engine for the professors and teaching assistants to evaluate the similarities in the student's assignments. This research's primary objective is to help the evaluators get closer to the sophisticated code thieves and abide by the university's academic dishonesty regulations. The proposed forensic similarity detection engine's constructive methodology is specially designed for studies where C and C++ programming languages are majorly used in academic assignments. After selecting the ATM (Attribute counting metrics), the system implementation is divided into two phases, where phase one consists of lexical analysis and tokenizer customization. The second phase mostly consists of rolling out the supervised learning algorithm on the generated data to classify the comparison of two files as a truth value. The similarity elements and observations recorded can be represented to the evaluators in the form of visualizations for ease of understanding and efficient decision making. The paper also relates the proposed system with the previous and existing system and mitigates the past issues noted in the latter half.
KEYWORDS
Similarity detection, E-Plagiarism, Tokenization, Lexical analysis, Distance algorithm, Euclidean distance
REFERENCES
[1] A. Parker and J. Hamblen, “Computer algorithms for plagiarism detection,” IEEE Transactions on Education, vol.32, no.2, pp.94-99, May
[2] K. J. Ottenstein, “An algorithmic approach to the detection and prevention of plagiarism,” SIGCSE Bull, vol.8, no.4, pp.30-41, December, DOI: 10.1145/382222.382462(CrossRef)(Google Scholar)
[3] M. Novak, M. Joy, and D. Kermek, “Source-code similarity detection tools used in Academia: A systematic review,” ACM (2019)
[4] Z. Al-Khanjari, J. A. Fiaidhi, R. Al-Hinai, and N. S. Kutti, “PlagDetect: A java programming plagiarism detection,” Plug-in. ACM Inroads magazine, (2010)
[5] J. A. W. Faidhi and S. K. Robinson, “An empirical approach for detecting program similarity within a university programming environment,” Computers and Education, vol.11, no.1, pp.11-19
[6] T. McCabe, “A complexity measure,” IEEE Transactions on Software Engineering, vol.2, no.4, pp.308-320
[7] Z. Đurić and D. Gasevic, “A source code similarity system for plagiarism detection,” The Computer Journal, vol.56, pp.70-86, (2013) DOI: 10.1093/comjnl/bxs018(CrossRef)(Google Scholar)
[8] M. J. Wise, “YAP3: Improved detection of similarities in computer program and other texts,” ACM SIGCSE Bulletin, vol.28, DOI: 10.1145/236452.236525.(CrossRef)(Google Scholar)
[9] G. Whale, “Identification of program similarity in large populations,” The Computer Journal, vol.33, no.2, pp.140-146, DOI: 10.1093/comjnl/33.2.140(CrossRef)(Google Scholar)
[10] K. T. Stolee, S. G. Elbaum, and D. Dobos, “Solving semantic searches for source code,” (2012)
[11] S. Yousf, M. Ahmad, and S. Nasrullah, “A review of plagiarism detection based on lexical and semantic approach,” pp.1-5, (2013) DOI:10.1109/C2SPCA.2013.6749430(CrossRef)(Google Scholar)
[12] M. Mozgovoy, “Enhancing computer-aided plagiarism detection,” Ph.D. Dissertation, University of Joensuu, Department of Computer Science and Statistics, (2007)
[13] V. Ljubovic, “Programming homework dataset for plagiarism detection,” IEEE Dataport, Aug., (2020) DOI: 10.21227/71fw-ss32(CrossRef)(Google Scholar)
[14] V. Ljubovic and E. Pajic, “Plagiarism detection in computer programming using feature extraction from ultra-fine-grained repositories,” in IEEE Access, vol.8, pp.96505-96514, (2020) DOI: 10.1109/ACCESS.2020.2996146(CrossRef)(Google Scholar)
[15] R. Cilibrasi and P. M. B. Vitanyi, “Clustering by compression,” Trans Inf Theory, vol.51, no.4, pp.1523-1545, (2005)
[16] D. Gunawan, C. A. Sembiring, M. A. Budiman, “The implementation of cosine similarity to calculate text relevance between two documents,” Journal of Physics: Conference Series. 978. 012120. 10.1088/1742-6596/978/1/012120, (2018)
[17] G. Kondrak, “N-gram similarity and distance in: Consens M., Navarro G. (eds) string processing and information retrieval,” SPIRE 2005, Lecture Notes in Computer Science, vol.3772, Springer, Berlin, Heidelberg, (2005) DOI: 10.1007/11575832_13(CrossRef)(Google Scholar)
[18] S. Sultana, and I. Biskri, “Identifying similar sentences by using n-grams of characters,” (2018)
[19] J. Turk and M. Stephens, "A python library for doing approximate and phonetic matching of strings," https://github.com/jamesturk/jellyfish, accessed date: 14 Feb., (2016)
[20] M. Joy and M. Luck, “Plagiarism in programming assignments,” Technical Report, University of Warwick, Coventry [PDF] (BibTeX)
[21] M. S. Joy and M. Luck, “Plagiarism in programming assignments,” IEEE Transactions on Education, vol.42, no.2, pp.129-133
[22] C. Ragkhitwetsagul, J. Krinke, and D. Clark, “A comparison of code similarity analyzers,” Empir Software Eng, vol.23, pp.2464-2519, (2018) DOI: 10.1007/s10664-017-9564-7(CrossRef)(Google Scholar)
[23] G. David and T. Nicholas, “Sim: A utility for detecting similarity in computer programs,” SIGCSE Bulletin (Association for Computing Machinery, Special Interest Group on Computer Science Education), vol.31, pp.266-270, DOI: 10.1145/299649.299783(CrossRef)(Google Scholar)
[24] R. Pike and Loki, “The sherlock plagiarism detector,” http://www.cs.usyd.edu.au/∼scilect/sherlock/, accessed date: 14 Feb., (2016)
[25] L. Prechelt, G. Malpohl, and M. Philippsen, “Finding plagiarisms among a set of programs with JPlag,” Journal of Universal Computer Science, vol.8, (2003)
[26] T. Kamiya, S. Kusumoto, and K. Inoue, “CCFInder: Amultilinguistic token-based code clone detection system for large scale source code,” Trans Softw Eng, vol.28, no.7, pp.654-670, (2002)
[27] A. Ahtiainen, S. Surakka, and M. Rahikainen, “Plaggie: GNU-licensed source code plagiarism detection engine for java exercises,” In: Baltic sea, no.6, pp.141-142, (2006)
[28] L. Jiang, G. Misherghi, Z. Su, and S. Glondu, “DECKARD: Scalable and accurate tree-based detection of code clones,” pp.96-105, (2007) DOI: 10.1109/ICSE.2007.30(CrossRef)(Google Scholar)
[29] C. K. Roy and J. R. Cordy, “NICAD: Accurate detection of near-miss intentional clones using flexible pretty printing and code normalization,” In: ICPC’08, pp 172-181, (2008)
[30] N. Gode and R. Koschke, “Incremental clone detection,” In: CSMR’09, pp 219-228, (2009)
[31] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, and V. Dubourg, “Scikit-learn: Machine learning in python,” J Mach Learn Res, pp.2825-2830, Oct,. (2011)
[32] A. Cohen, “Fuzzywuzzy: Fuzzy string matching in python,” http://chairnerd.seatgeek.com/fuzzywuzzy-fuzzy-string-matching-in-python/, accessed date: 14 Mar., (2016)
[33] G. Poulter, “Python ngram 3.3,” https://pythonhosted.org/ngram/, accessed date: 14 Feb., (2016)
[34] S. Harris, Simian - similarity analyzer, version 2.4. http://www.harukizaemon.com/simian/, accessed date: 14 Feb., (2016)
[35] Python Software Foundation Difflib – helpers for computing deltas, http://docs.python.org/2/library/ difflib.html, accessed date: 14 Feb., (2016)
[36] Wise and Michael., “String similarity via greedy string tiling and running Karp−Rabin matching,” Unpublished Basser Department of Computer Science Report