Vector comparison of documents
CSCI 8570 - M. Covington - 2008 Feb. 12
The following documents can be used for testing:
doc1.txt
doc2.txt
doc3.txt
doc4.txt
Values of cos θ should be:
Documents 1 and 2: 1.000
Documents 2 and 3: 0.707
Documents 3 and 4: 0.000
Here are additional documents for experiments. They are, respectively, the whole and the first half of the Gospel of Mark in the King James Version and the Douay-Rheims translation (which are very similar to each other), both from Project Gutenberg. Each file name contains the word count according to one tokenizer, but different tokenizers will give different word counts. Also, I suggest that you modify your tokenizer to ignore numerals and punctuation marks.
KJV-Mark-22823words.txt
KJV-Mark-firsthalf-10798words.txt
Douay-Rheims-Mark-22324words.txt
Douay-Rheims-Mark-firsthalf-10645words.txt