c# - Regex to match words but not html entities -


i'm parsing html node text regex looking words perform operations on.
i'm using (\w+)

i have situations word word , nbsp gets recognized word.

i can match html entity \&[a-z0-9a-z]+\; don't know how unmatch word if part of entity.

is there way have regex match word not if html entity following?

 
&lt; <
&#253; ý
etc etc

a negative lookbehind assertion might trick:

(?<!&#?)\b\w+ 

matches if word not preceded & or &#. doesn't check semicolon, though, since might legitimately follow normal word.


Comments

Popular posts from this blog

toolbar - How to add link to user registration inside toobar in admin joomla 3 custom component -

linux - disk space limitation when creating war file -