Honours Project: Matthew Diener (April 20, 2017 - 7:01pm)

Carleton University - School of Computer Science Honours Project

Winter 2017

Applying Document Clustering to Wikipedia Articles

Matthew Diener

ABSTRACT

Methods for document clustering provide insights into how large documents from a large corpus relate to each other. One common approach for document clustering is applying a k-means clustering algorithm to documents which are represented as vectors of tf-idf values. This project applies that approach to a large portion of documents from Wikipedia, and uses the results to demonstrate a realistic way to apply document clustering to build recommender systems. This project also investigates potential methods for analyzing the effectiveness of this clustering method through comparisons with user contribution history and crawl graphs.