machine learning - Fast k-NN search over bag-of-words models -

April 15, 2014

i have large amount of documents of equal size. each of documents i'm building bag of words model (bow). number of possible words in documents limited , large (2^16 example). speaking, have n histograms of size k, n number of documents , k histogram width. can calculate distance between 2 histograms.

first optimization opportunity. documents uses small subset of words (usually less 5%, of them less 0.5%).

second optimization opportunity subset of used words varying document document can use bits instead of word counts.

query content

query document well. need find k similar documents.

naive approach

calculate bow model query.
for each document in dataset:
- calculate it's bow model.
- find distance between query , document.

obviously, data structure should used track top-ranked documents (priority queue example).

i need sort of index rid of full database scan. kd-tree comes mind dimensionality , size of dataset high. 1 can suggest use subset of possible words features don't have separate training phase , can't extract features beforehand.

i've thought using minhash algorithm prune search space can't design appropriate hash functions task.

k-d-tree , similar indexes dense , continuous data.

your data sparse.

a index finding nearest neighbors on sparse data inverted lists. same way search engines google work.

Search This Blog

JVParth