python - How to correctly decode/encode file names when using `pdfrw` to add metadata? -

May 15, 2011

i'm writing script add pdf metadata list of pdfs.

my problem dealing pdfs names have characters in them -- in example tried, name had "en dash" in it, i'm sure in future (i don't control these file names) there other similar issues.

i'm using pdfrw , python 2.7. have:

from pdfrw import pdfreader,pdfwriter os import listdir  def get_files(pwy):    tr_files=listdir(pwy)    tr_files2=[]    t in tr_files:       tr_files2.append(pwy+'/'+t)    return tr_files2  def add_keywords(filename,keywords):    writer=pdfwriter()    trailer=pdfreader(filename)    trailer.info.keywords=keywords    writer.trailer=trailer    writer.write(filename)  file_list=get_files('c:/example_folder') f in file_list:    add_keywords(f,'some exciting metadata!')

this works fine files without "en dash". files "en dash" shows modified when run this, when check metadata in adobe acrobat, there's nothing there.

i'm pretty sure encoding problem of kind. since shows "en dash" x\96, must using cp1252. i'm using spyder 2.3.1 , have # -- coding: utf-8 -- @ top of script.

i read through the absolute minimum every softward developer absolutely, positively must know unicode , character sets , pragmatic unicode , know, in general, want decode input, run rest of code (not printed above, use file name extract information database, format information , want put resulting string metadata), , encode again. haven't been able figure out works.

i think solution going 1 of following:

[best] correctly deal encoding issue.
run sort of batch file on subfolder renaming files script can handle, reverse names @ end (they need end original file names).

i appreciate help! haven't been able find that's worked.

you freshly acquired unicode knowhow not pdf. pdf came being before there unicode.

you should "annex d (normative): character sets , encodings" in official iso 32000_2008 pdf-1.7 specification published adobe, page 651.

there you'll find should use en-dash:

\263 standardencoding
\230 macencoding
\226 winencoding
\205 pdfencoding

for metadata (/info dictionary) use pdfencoding.

Search This Blog

JVParth

python - How to correctly decode/encode file names when using `pdfrw` to add metadata? -

Comments

Post a Comment

Popular posts from this blog

toolbar - How to add link to user registration inside toobar in admin joomla 3 custom component -

linux - disk space limitation when creating war file -

I can see elements on storyboard from one screen on the other one - Objective C -