VOCS: A Versatile Online Clustering System for Source Code Plagiarism Detection

Document Type


Publication Date



Computer Sciences | Physical Sciences and Mathematics


James Schnepf, Computer Science


The pervasive connectivity of the Internet has contributed to increased productivity and plagiarism among students. Detecting plagiarism is important for academic institutions, but because of the enormous amount of student work available on the Internet, manual detection is difficult and tedious. While automated plagiarism detection systems scalable to massive submission corpora exist for natural language essays, current systems for detecting software code plagiarism are limited to small submission sets, like a single class.

The limits on current source code plagiarism detection systems are due to dependence on time inefficient pair-wise comparisons, implemented by comparing each submission to every other submission. Consequently, for any number of submissions n, the number of comparisons is equal to (n2 -n)/2, which is large even for moderately sized n.

This project introduces an approach to detecting source come plagiarism that uses clustering to efficiently identify likely plagiarism through pair-wise comparison. The approach uses tokens to create a normalized version of each submission, and then uses an adaptation of the k-gram plagiarism detection technique to create a summary of each tokenized submission. Next, the k-gram set is clustered by grouping fields that share certain code characteristics. Within a group, the similarity of each k-gram is quantified using three distance metrics and any potential plagiarism is highlighted. By comparing only submissions within a cluster, we believe that the total number of comparisons will be lowered dramatically. This approach will allow the system to be scalable to massive repositories, while maintaining the accuracy of available source code plagiarism detection systems.