apache spark - How to use Mahout classifiers in action? -


i classify bunch of documents using apache mahout , using naive bayes classifier. pre-processing , convert training data set feature vector , train classifier. want pass bunch of new instances (to-be-classified instances) model in order classify them.

however, i'm under impression pre-processing must done to-be-classified instances , training data set together? if so, how come can use classifier in real world scenarios don't have to-be-classified instances @ time i'm building model?

how apache spark? howe thing work there? can make classification model , use classify unseen instances later?

as of mahout 0.10.0, mahout provides spark backed naive bayes implementation can run cli, mahout shell or embedded application:

http://mahout.apache.org/users/algorithms/spark-naive-bayes.html

regarding classification of new documents outside of training/testing sets, there tutorial here:

http://mahout.apache.org/users/environment/classify-a-doc-from-the-shell.html

which explains how tokenize (using trival java native string methods), vectorize , classify unseen text using dictionary , df-count training/testing sets.

please note tutorial meant used mahout-samsara environment's spark-shell, basic idea can adapted , embedded application.


Comments

Popular posts from this blog

toolbar - How to add link to user registration inside toobar in admin joomla 3 custom component -

linux - disk space limitation when creating war file -