python - How to correctly decode/encode file names when using `pdfrw` to add metadata? -
i'm writing script add pdf metadata list of pdfs.
my problem dealing pdfs names have characters in them -- in example tried, name had "en dash" in it, i'm sure in future (i don't control these file names) there other similar issues.
i'm using pdfrw
, python 2.7. have:
from pdfrw import pdfreader,pdfwriter os import listdir def get_files(pwy): tr_files=listdir(pwy) tr_files2=[] t in tr_files: tr_files2.append(pwy+'/'+t) return tr_files2 def add_keywords(filename,keywords): writer=pdfwriter() trailer=pdfreader(filename) trailer.info.keywords=keywords writer.trailer=trailer writer.write(filename) file_list=get_files('c:/example_folder') f in file_list: add_keywords(f,'some exciting metadata!')
this works fine files without "en dash". files "en dash" shows modified when run this, when check metadata in adobe acrobat, there's nothing there.
i'm pretty sure encoding problem of kind. since shows "en dash" x\96
, must using cp1252. i'm using spyder 2.3.1 , have # -- coding: utf-8 -- @ top of script.
i read through the absolute minimum every softward developer absolutely, positively must know unicode , character sets , pragmatic unicode , know, in general, want decode input, run rest of code (not printed above, use file name extract information database, format information , want put resulting string metadata), , encode again. haven't been able figure out works.
i think solution going 1 of following:
[best] correctly deal encoding issue.
run sort of batch file on subfolder renaming files script can handle, reverse names @ end (they need end original file names).
i appreciate help! haven't been able find that's worked.
you freshly acquired unicode knowhow not pdf. pdf came being before there unicode.
you should "annex d (normative): character sets , encodings" in official iso 32000_2008 pdf-1.7 specification published adobe, page 651.
there you'll find should use en-dash:
\263
standardencoding\230
macencoding\226
winencoding\205
pdfencoding
for metadata (/info
dictionary) use pdfencoding.
Comments
Post a Comment