The software and text similarity tester SIM
Dick GruneSIM tests lexical similarity in natural language texts and in programs in C, C++, Java, Pascal, Modula-2, Miranda, Lisp, and 8086 assembler code. It has been used
Similarities are reported in two parallel columns, with chunks of similar text side by side; in diff-format; or in percentages.
The similarity tester consists of nine separate programs, eight for the
programming languages mentioned above, and one for text.
The text similarity tester, sim_text, is very efficient and can analyse documentation for duplicate sections, or text-compare large numbers of publications. Sim_text is language-independent. It assumes the input to be in UTF-8, and will therefore work on any language presented in that way: Korean, English, Japanese, Icelandic, Hindi, etc.
The computer language versions are very useful in finding duplicate code in software projects. They can also efficiently compare large bodies of students' code with code that was collected from many years in the past, to find signs of cheating.
Obtaining reliable similarity percentages efficiently is not easy. The paper Similarity_Percentage_Computation details the problems, and the solutions sim uses.
Since this piece of handicraft did not qualify as research, there are no international papers on it. The work was described in Dutch in Dick Grune, Matty Huntjens, Het detecteren van kopieën bij informatica-practica, Informatie, 31, 11, Nov 1989, pp. 864-867 ( lit. ref.)). An English translation of the paper is also available.
The software and text similarity tester SIM / Dick Grune /