Apache tika() returning empty string for pdf. Java -
i trying content of documents using apache tika() function. able contents of .doc , .docx files, it's not working on .pdf files. didn't specified document type in code, don't know why it's not working .pdf files.
here code:-
in extractdocument function:
int indexedchars = -1; metadata metadata = new metadata(); int experiance=0; string parsedcontent; parsedcontent = tika().parsetostring(new bytesstreaminput( base64.decode(document.getcontent().getbytes()), false), metadata, indexedchars); system.out.println("parsedcontent "+parsedcontent); here getting parsedcontent empty string. here function calling this.
public document push(document document, string username,httpservletrequest req) { if (logger.isdebugenabled()) logger.debug("push({})", document.getcontent()); if (document == null) return null; system.out.println("document.getcontent() "+ document.getcontent()); /* if (document.getindex() == null || document.getindex().isempty()) { document.setindex(smdsearchproperties.index_name); } if (document.gettype() == null || document.gettype().isempty()) { document.settype(smdsearchproperties.index_type_doc); } */ getnodeclient(username); try { system.out.println("client "+ username); indexresponse response = client .prepareindex(username, document.gettype(), document.getid()) .setsource(extractdocument(document)).execute() .actionget(); document.setid(response.getid()); } catch (exception e) { e.printstacktrace(); logger.warn("can not index document {}", document.getname()); system.out.println("can not index document {}"+ document.getname()+" e.getmessage() "+e.getmessage()); //throw new restapiexception("can not index document : "+ document.getname() + ": "+e.getmessage()); } if (logger.isdebugenabled()) logger.debug("/push()={}", document); return document; }
got solution here
error while parsing binary files... (mostly pdf)
download these 3 jar files , copy them lib folder , add them project.
fontbox-1.5.0.jar jempbox-1.5.0.jar pdfbox-1.5.0.jar
Comments
Post a Comment