machine learning - Fast k-NN search over bag-of-words models -
i have large amount of documents of equal size. each of documents i'm building bag of words model (bow). number of possible words in documents limited , large (2^16 example). speaking, have n histograms of size k, n number of documents , k histogram width. can calculate distance between 2 histograms.
first optimization opportunity. documents uses small subset of words (usually less 5%, of them less 0.5%).
second optimization opportunity subset of used words varying document document can use bits instead of word counts.
query content
query document well. need find k similar documents.
naive approach
- calculate bow model query.
- for each document in dataset:
- calculate it's bow model.
- find distance between query , document.
obviously, data structure should used track top-ranked documents (priority queue example).
i need sort of index rid of full database scan. kd-tree comes mind dimensionality , size of dataset high. 1 can suggest use subset of possible words features don't have separate training phase , can't extract features beforehand.
i've thought using minhash algorithm prune search space can't design appropriate hash functions task.
k-d-tree , similar indexes dense , continuous data.
your data sparse.
a index finding nearest neighbors on sparse data inverted lists. same way search engines google work.
Comments
Post a Comment