machine learning - Fast k-NN search over bag-of-words models -


i have large amount of documents of equal size. each of documents i'm building bag of words model (bow). number of possible words in documents limited , large (2^16 example). speaking, have n histograms of size k, n number of documents , k histogram width. can calculate distance between 2 histograms.

first optimization opportunity. documents uses small subset of words (usually less 5%, of them less 0.5%).

second optimization opportunity subset of used words varying document document can use bits instead of word counts.

query content

query document well. need find k similar documents.

naive approach

  • calculate bow model query.
  • for each document in dataset:
    • calculate it's bow model.
    • find distance between query , document.

obviously, data structure should used track top-ranked documents (priority queue example).

i need sort of index rid of full database scan. kd-tree comes mind dimensionality , size of dataset high. 1 can suggest use subset of possible words features don't have separate training phase , can't extract features beforehand.

i've thought using minhash algorithm prune search space can't design appropriate hash functions task.

k-d-tree , similar indexes dense , continuous data.

your data sparse.

a index finding nearest neighbors on sparse data inverted lists. same way search engines google work.


Comments

Popular posts from this blog

How to provide Authorization & Authentication using Asp.net, C#? -

toolbar - How to add link to user registration inside toobar in admin joomla 3 custom component -

How to use Authorization & Authentication in Asp.net, C#? -