python - Removing <wbr> tags and grabbing the info between -
i'm scrapping data webpage , have done section has <br> tag.
<div class="scrollwrapper"> <h3>smiles</h3> cc=o<br> <button type="button" id="downloadsmiles">download</button> </div> i solved problem doing below script output cc=o.
from lxml import html page = requests.get('http://chem.sis.nlm.nih.gov/chemidplus/name/'+ substance) tree = html.fromstring(page.text) if ("smiles" in page.text): smiles = tree.xpath('normalize-space(//*[text()="smiles"]/..//br[1]/preceding-sibling::text()[1])') else: smiles = "" however, browsing through other pages of different chemicals encountered pages had tag in them. have no idea how rid of them while grabbing information between them. example shown below desired output c1(c2ccccc2)ccc(n)cc1.
<div class="scrollwrapper"> <h3>smiles</h3> c1(c2ccccc2)<wbr>ccc(n)<wbr>cc1<br> <button type="button" id="downloadsmiles">download</button> </div>
the easiest thing replace <wbr> string in page.text empty string, before parse html. since within < , > doubt if of useful info looking have it.
example -
from lxml import html page = requests.get('http://chem.sis.nlm.nih.gov/chemidplus/name/'+ substance) tree = html.fromstring(page.text.replace('<wbr>','')) if ("smiles" in page.text): smiles = tree.xpath('normalize-space(//*[text()="smiles"]/..//br[1]/preceding-sibling::text()[1])') else: smiles = "" otherwise can use @bun's solution of using beautifulsoup , or write complex xpaths.
also, easier xpath case should -
'normalize-space(//*[text()="smiles"]/following-sibling::text()[1])' rather finding out smiles, element , taking parent find out first br element descendent taking preceding sibling , text.
you should directly take following sibling smiles element , text.
Comments
Post a Comment