Parse XML Sitemap with Python -


i have sitemap this: http://www.site.co.uk/sitemap.xml structured this:

<sitemapindex>   <sitemap>     <loc>     http://www.site.co.uk/drag_it/dragitsitemap_static_0.xml     </loc>     <lastmod>2015-07-07</lastmod>   </sitemap>   <sitemap>     <loc>     http://www.site.co.uk/drag_it/dragitsitemap_alpha_0.xml     </loc>     <lastmod>2015-07-07</lastmod>   </sitemap> ... 

and want extract data it. first of need count how many <sitemap> in xml , each of them, extract <loc> , <lastmod> data. there easy way in python?

i've seen other questions of them extract example every <loc> element inside xml, need extract data individually each element.

i've tried use lxml code:

import urllib2 lxml import etree  u = urllib2.urlopen('http://www.site.co.uk/sitemap.xml') doc = etree.parse(u)  element_list = doc.findall('sitemap')  element in element_list:     url = store.findtext('loc')     print url 

but element_list empty.

i chose use requests , beautifulsoup libraries. created dictionary key url , value last modified date.

from bs4 import beautifulsoup import requests  xmldict = {}  r = requests.get("http://www.site.co.uk/sitemap.xml") xml = r.text  soup = beautifulsoup(xml) sitemaptags = soup.find_all("sitemap")  print "the number of sitemaps {0}".format(len(sitemaptags))  sitemap in sitemaptags:     xmldict[sitemap.findnext("loc").text] = sitemap.findnext("lastmod").text  print xmldict 

or lxml:

from lxml import etree import requests  xmldict = {}  r = requests.get("http://www.site.co.uk/sitemap.xml") root = etree.fromstring(r.content) print "the number of sitemap tags {0}".format(len(root)) sitemap in root:     children = sitemap.getchildren()     xmldict[children[0].text] = children[1].text print xmldict 

Comments

Popular posts from this blog

toolbar - How to add link to user registration inside toobar in admin joomla 3 custom component -

linux - disk space limitation when creating war file -