python - How open and read pdf (originally .html) file using Python3 -

April 15, 2011

i need open file in python3:

http://www.arch.gob.ec/index.php/descargas/doc_download/478-historial-de-produccion-nacional-de-crudo-2011.html

here have read it, , extract data tables. have searched several hours nothing seem work. new scraping/parsing , first time have looked in file handling of pdf.

thanks kind of help!

obtaining pdf internet called scraping. trying read pdf obtain data quite problem!

there many utilities available try convert pdf text - not entirely successful. this article explains, pdf files nice use (look at), internals aren't elegant. reason visible text, not present directly inside document, , has reconstructed tables. in cases pdf doesn't contain text, image of text.

the article contains several tools (try to) convert pdf text. have 'wrappers' in python access them. there few modules sound interesting, such pypdf (which not convert text), aren't.

atxt looks interesting data mining - haven't tested yet.

as mentioned above, of these wrappers (or guis) around existing - command-line - tools. eg. simple tool (which works pdf!) in linux pdftotext (if want stay in python, can call subprocess's call, or os.system.

after this, text file, can process more basic python string functions, or regular expressions, or sophisticated things pyparser.

Search This Blog

JVParth

python - How open and read pdf (originally .html) file using Python3 -

Comments

Post a Comment

Popular posts from this blog

toolbar - How to add link to user registration inside toobar in admin joomla 3 custom component -

linux - disk space limitation when creating war file -

How to provide Authorization & Authentication using Asp.net, C#? -