apache spark - How to use Mahout classifiers in action? -
i classify bunch of documents using apache mahout , using naive bayes classifier. pre-processing , convert training data set feature vector , train classifier. want pass bunch of new instances (to-be-classified instances) model in order classify them.
however, i'm under impression pre-processing must done to-be-classified instances , training data set together? if so, how come can use classifier in real world scenarios don't have to-be-classified instances @ time i'm building model?
how apache spark? howe thing work there? can make classification model , use classify unseen instances later?
as of mahout 0.10.0, mahout provides spark backed naive bayes implementation can run cli, mahout shell or embedded application:
http://mahout.apache.org/users/algorithms/spark-naive-bayes.html
regarding classification of new documents outside of training/testing sets, there tutorial here:
http://mahout.apache.org/users/environment/classify-a-doc-from-the-shell.html
which explains how tokenize (using trival java native string methods), vectorize , classify unseen text using dictionary , df-count training/testing sets.
please note tutorial meant used mahout-samsara environment's spark-shell, basic idea can adapted , embedded application.
Comments
Post a Comment