Parse XML Sitemap with Python -
i have sitemap this: http://www.site.co.uk/sitemap.xml structured this:
<sitemapindex> <sitemap> <loc> http://www.site.co.uk/drag_it/dragitsitemap_static_0.xml </loc> <lastmod>2015-07-07</lastmod> </sitemap> <sitemap> <loc> http://www.site.co.uk/drag_it/dragitsitemap_alpha_0.xml </loc> <lastmod>2015-07-07</lastmod> </sitemap> ...
and want extract data it. first of need count how many <sitemap>
in xml , each of them, extract <loc>
, <lastmod>
data. there easy way in python?
i've seen other questions of them extract example every <loc>
element inside xml, need extract data individually each element.
i've tried use lxml
code:
import urllib2 lxml import etree u = urllib2.urlopen('http://www.site.co.uk/sitemap.xml') doc = etree.parse(u) element_list = doc.findall('sitemap') element in element_list: url = store.findtext('loc') print url
but element_list
empty.
i chose use requests , beautifulsoup libraries. created dictionary key url , value last modified date.
from bs4 import beautifulsoup import requests xmldict = {} r = requests.get("http://www.site.co.uk/sitemap.xml") xml = r.text soup = beautifulsoup(xml) sitemaptags = soup.find_all("sitemap") print "the number of sitemaps {0}".format(len(sitemaptags)) sitemap in sitemaptags: xmldict[sitemap.findnext("loc").text] = sitemap.findnext("lastmod").text print xmldict
or lxml:
from lxml import etree import requests xmldict = {} r = requests.get("http://www.site.co.uk/sitemap.xml") root = etree.fromstring(r.content) print "the number of sitemap tags {0}".format(len(root)) sitemap in root: children = sitemap.getchildren() xmldict[children[0].text] = children[1].text print xmldict
Comments
Post a Comment