Apache tika() returning empty string for pdf. Java -


i trying content of documents using apache tika() function. able contents of .doc , .docx files, it's not working on .pdf files. didn't specified document type in code, don't know why it's not working .pdf files.

here code:-

in extractdocument function:

    int indexedchars = -1;     metadata metadata = new metadata();     int experiance=0;     string parsedcontent;      parsedcontent = tika().parsetostring(new bytesstreaminput(                 base64.decode(document.getcontent().getbytes()), false), metadata, indexedchars);     system.out.println("parsedcontent "+parsedcontent); 

here getting parsedcontent empty string. here function calling this.

public document push(document document, string username,httpservletrequest req)  {      if (logger.isdebugenabled()) logger.debug("push({})", document.getcontent());     if (document == null)         return null;     system.out.println("document.getcontent() "+ document.getcontent());       /*       if (document.getindex() == null || document.getindex().isempty()) {         document.setindex(smdsearchproperties.index_name);     }     if (document.gettype() == null || document.gettype().isempty()) {         document.settype(smdsearchproperties.index_type_doc);     }      */     getnodeclient(username);      try {          system.out.println("client "+ username);          indexresponse response = client                 .prepareindex(username, document.gettype(),                         document.getid())                 .setsource(extractdocument(document)).execute()                 .actionget();         document.setid(response.getid());     } catch (exception e) {         e.printstacktrace();         logger.warn("can not index document {}", document.getname());         system.out.println("can not index document {}"+ document.getname()+" e.getmessage() "+e.getmessage());         //throw new restapiexception("can not index document : "+ document.getname() + ": "+e.getmessage());     }     if (logger.isdebugenabled()) logger.debug("/push()={}", document);     return document; } 

got solution here

error while parsing binary files... (mostly pdf)

download these 3 jar files , copy them lib folder , add them project.

fontbox-1.5.0.jar  jempbox-1.5.0.jar  pdfbox-1.5.0.jar         

Comments

Popular posts from this blog

toolbar - How to add link to user registration inside toobar in admin joomla 3 custom component -

linux - disk space limitation when creating war file -

How to provide Authorization & Authentication using Asp.net, C#? -