The software and text similarity tester SIM

Dick Grune

SIM tests lexical similarity in natural language texts and in programs in C, C++, Java, Pascal, Modula-2, Miranda, Lisp, and 8086 assembler code. It has been used

to detect duplicated code in large software projects, in program text, in shell scripts and in documentation;
to detect plagiarism in (software) projects, educational and otherwise, and in scientific papers on the Web.

Similarities are reported in two parallel columns, with chunks of similar text side by side; in diff-format; or in percentages.

The similarity tester consists of nine separate programs, eight for the programming languages mentioned above, and one for text.

Download the Manual (pdf)

Download the sources (in C) and test material

Download the MSDOS .exe

The text similarity tester, sim_text, is very efficient and can analyse documentation for duplicate sections, or text-compare large numbers of publications. Sim_text is language-independent. It assumes the input to be in UTF-8, and will therefore work on any language presented in that way: Korean, English, Japanese, Icelandic, Hindi, etc.

The computer language versions are very useful in finding duplicate code in software projects. They can also efficiently compare large bodies of students' code with code that was collected from many years in the past, to find signs of cheating.

Obtaining reliable similarity percentages efficiently is not easy. The paper Similarity_Percentage_Computation details the problems, and the solutions sim uses.

Since this piece of handicraft did not qualify as research, there are no international papers on it. The work was described in Dutch in Dick Grune, Matty Huntjens, Het detecteren van kopieën bij informatica-practica, Informatie, 31, 11, Nov 1989, pp. 864-867 ( lit. ref.)). An English translation of the paper is also available.

[Home Page]

The software and text similarity tester SIM / Dick Grune / dick@dickgrune.com