python - Making NLTK work for UTF8 punctuation? -

March 15, 2013

i started using nltk , noticed doesn't work non-ascii punctuation. example, “ being tagged noun. also, having non-ascii punctuation messes pos tagging rest of words because nltk interpreting “ word instead of punctuation.

is there setting can allow nltk recognize non-ascii punctuation? since having single non-unicode punctuation messes pos tagging entire document, can't replace every “ ".

i'm not aware of such setting.

but have similar issues pos-tagging non-plain-text (text augmented xml-like tags in between). these xml-tags not pos-tagged correctly. take them out before start pos-tagging, keep track of indices , re-insert them after tagging (and assign them proper tag manually). arguably, presence or absence of punctuation won't change nltk's pos-tagging output much, try same. since guess set of 'problematic' punctuation characters pretty limited?

Search This Blog

JVParth

python - Making NLTK work for UTF8 punctuation? -

Comments

Post a Comment

Popular posts from this blog

toolbar - How to add link to user registration inside toobar in admin joomla 3 custom component -

linux - disk space limitation when creating war file -

I can see elements on storyboard from one screen on the other one - Objective C -