python - Making NLTK work for UTF8 punctuation? -
i started using nltk , noticed doesn't work non-ascii punctuation. example, “
being tagged noun. also, having non-ascii punctuation messes pos tagging rest of words because nltk interpreting “
word instead of punctuation.
is there setting can allow nltk recognize non-ascii punctuation? since having single non-unicode punctuation messes pos tagging entire document, can't replace every “
"
.
i'm not aware of such setting.
but have similar issues pos-tagging non-plain-text (text augmented xml-like tags in between). these xml-tags not pos-tagged correctly. take them out before start pos-tagging, keep track of indices , re-insert them after tagging (and assign them proper tag manually). arguably, presence or absence of punctuation won't change nltk's pos-tagging output much, try same. since guess set of 'problematic' punctuation characters pretty limited?
Comments
Post a Comment