python - Making NLTK work for UTF8 punctuation? -


i started using nltk , noticed doesn't work non-ascii punctuation. example, being tagged noun. also, having non-ascii punctuation messes pos tagging rest of words because nltk interpreting word instead of punctuation.

is there setting can allow nltk recognize non-ascii punctuation? since having single non-unicode punctuation messes pos tagging entire document, can't replace every ".

i'm not aware of such setting.

but have similar issues pos-tagging non-plain-text (text augmented xml-like tags in between). these xml-tags not pos-tagged correctly. take them out before start pos-tagging, keep track of indices , re-insert them after tagging (and assign them proper tag manually). arguably, presence or absence of punctuation won't change nltk's pos-tagging output much, try same. since guess set of 'problematic' punctuation characters pretty limited?


Comments

Popular posts from this blog

toolbar - How to add link to user registration inside toobar in admin joomla 3 custom component -

linux - disk space limitation when creating war file -