python - Removing <wbr> tags and grabbing the info between -

August 15, 2015

i'm scrapping data webpage , have done section has <br> tag.

<div class="scrollwrapper">     <h3>smiles</h3>     cc=o<br>     <button type="button" id="downloadsmiles">download</button> </div>

i solved problem doing below script output cc=o.

from lxml import html  page = requests.get('http://chem.sis.nlm.nih.gov/chemidplus/name/'+ substance) tree = html.fromstring(page.text) if ("smiles" in page.text):         smiles = tree.xpath('normalize-space(//*[text()="smiles"]/..//br[1]/preceding-sibling::text()[1])') else:         smiles = ""

however, browsing through other pages of different chemicals encountered pages had tag in them. have no idea how rid of them while grabbing information between them. example shown below desired output c1(c2ccccc2)ccc(n)cc1.

<div class="scrollwrapper">    <h3>smiles</h3>    c1(c2ccccc2)<wbr>ccc(n)<wbr>cc1<br>    <button type="button" id="downloadsmiles">download</button> </div>

the easiest thing replace <wbr> string in page.text empty string, before parse html. since within < , > doubt if of useful info looking have it.

example -

from lxml import html  page = requests.get('http://chem.sis.nlm.nih.gov/chemidplus/name/'+ substance) tree = html.fromstring(page.text.replace('<wbr>','')) if ("smiles" in page.text):         smiles = tree.xpath('normalize-space(//*[text()="smiles"]/..//br[1]/preceding-sibling::text()[1])') else:         smiles = ""

otherwise can use @bun's solution of using beautifulsoup , or write complex xpaths.

also, easier xpath case should -

'normalize-space(//*[text()="smiles"]/following-sibling::text()[1])'

rather finding out smiles, element , taking parent find out first br element descendent taking preceding sibling , text.

you should directly take following sibling smiles element , text.

Search This Blog

JVParth

python - Removing <wbr> tags and grabbing the info between -

Comments

Post a Comment

Popular posts from this blog

toolbar - How to add link to user registration inside toobar in admin joomla 3 custom component -

linux - disk space limitation when creating war file -

How to provide Authorization & Authentication using Asp.net, C#? -