Inspiration
The inspiration for this project was my computer science teacher, who stated that he needed some form or plagiarism checker between students, as he started to look into many similar files of code. He said that students were able to change the variable names and order of some of the lines, however it was still clearly not their own work, and he would have to go through everyone's code from now on.
What it does
The job of the program is to take in two files of text, either of writing or of code, and return a "percent match".
How we built it
The program looks over two specified files and compares each line of the first file to every line of the second one, altering the percent match as it goes. The checking is done by breaking down each line into strings of comparable words, usually items separated by spaces.
Challenges we ran into
One of the challenges I ran into was overly high matches on two files that were fairly distinct in content. In other words, the program would often return a strikingly high percent for files that were most likely not plagiarized. I realized that this was due to the fact that common words would increase the percent such as "and", "the", "a", etc. The program was simply looking instead of understanding.
Accomplishments that we're proud of
I was proud of how I was able to combat the challenge I mentioned above, and make the program look at content of the two files as opposed to simply words. In other words, I started to make the program understand meaning. To do this, I created a "most common words" file. This file contains around 100 words that don't really add meaning to writing, and are found in almost every text. When the code came across these words, it would ignore it, and match more off meanings. This highly increased the reliability and accuracy of the program, so it wouldn't increase the match ###percentage over items that are most likely not plagiarism.
What we learned
During the course of this project, I learned that there is really no defined match percent for what can be considered copied or original material. It is always up to the teacher or whoever is viewing the writing, but the program is still beneficial in flagging what has the potential to be such.
What's next for Original Work Detection
Next, I would like to be able to create an app or webpage for the file comparison, as everyone may not understand how to use it as a program. It would also be beneficial if the program could match over multiple files as opposed to comparing just 2 at a time.
Log in or sign up for Devpost to join the conversation.