Files
Abstract
In this paper, a ranking mechanism is presented that ranks documents based on their Semantic Association Similarity, which is defined as the close-ness (based on degrees of separation) of associations between the entities found in each document. A large semantic knowledge base with over 1.6 million entities and 24 million associations is used as the backend dataset for comparison. Multiple ranking techniques are evaluated and speed concerns are addressed. Bloom filters are used to improve ranking speed while introducing a small percentage of false positives. A real world example of spam page identification is investigated.