Developing Machine Learning Coding Similarity Indicators for C and C++ Corpuses

DOI:10.21742/AJEMR.2020.5.3.01

AUTHORS

Ajinkya Kunjir,Student, Department of Computer Science, Lakehead University, Thunder Bay, Ontario, Canada

Jinan Fiaidhi,Professor, Department of Computer Science, Lakehead University, Thunder Bay, Ontario, Canada

ABSTRACT

In the digital era of technology and advanced automation, data or information is vulnerable to copying, altering, and claiming someone else's work as their own. Source code theft or e-plagiarism is challenging to track in hundreds of assignments submitted by students. Despite the year's efforts, the digital plagiarism detection software currently available performs well enough for a naïve programmer to detect literal plagiarism. The available source code similarity detectors provide insufficient results when a student uses complex strategies such as word substitution or reordering programming constructs. To overcome the above-mentioned challenges, this research aims to deliver an assistive forensic engine for the professors and teaching assistants to evaluate the similarities in the student's assignments. This research's primary objective is to help the evaluators get closer to the sophisticated code thieves and abide by the university's academic dishonesty regulations. The proposed forensic similarity detection engine's constructive methodology is specially designed for studies where C and C++ programming languages are majorly used in academic assignments. After selecting the ATM (Attribute counting metrics), the system implementation is divided into two phases, where phase one consists of lexical analysis and tokenizer customization. The second phase mostly consists of rolling out the supervised learning algorithm on the generated data to classify the comparison of two files as a truth value. The similarity elements and observations recorded can be represented to the evaluators in the form of visualizations for ease of understanding and efficient decision making. The paper also relates the proposed system with the previous and existing system and mitigates the past issues noted in the latter half.

KEYWORDS

Similarity detection, E-Plagiarism, Tokenization, Lexical analysis, Distance algorithm, Euclidean distance