python - `How to use a chain of callbacks in scrapy -

April 15, 2014

i trying build crawler using scrapy , selenium webdriver. trying set of urls in parse() , pass callback function parse_url() again gets different set of urls , passes parse_data()

the first callback parse_url works second parse_data gives assertionerror

i.e if run without parse_data prints list of urls. if include assertion error

i have this

class myspider(scrapy.spider):     name = "myspider"     allowed_domains = ["example.com"]     start_urls = [         "http://www.example.com/url",     ]      def parse(self, response):         driver = webdriver.firefox()     driver.get(response.url)     urls = get_urls(driver.page_source) # get_url returns list         yield scrapy.request(urls, callback=self.parse_url(urls, driver))      def parse_url(self, url, driver):         url_list = []     in urls:     driver.get(i)     url_list.append( get_urls(driver.pagesource)) # gets more urls      yeild scrapy.request(urls, callback=self.parse_data(url_list, driver))      def parse_data(self, url_list, driver):         data = get_data(driver.pagesource)

this traceback,

traceback (most recent call last):   file "/usr/local/lib/python2.7/dist-packages/scrapy/utils/defer.py", line 45, in mustbe_deferred     result = f(*args, **kw)   file "/usr/local/lib/python2.7/dist-packages/scrapy/core/spidermw.py", line 48, in process_spider_input     return scrape_func(response, request, spider)   file "/usr/local/lib/python2.7/dist-packages/scrapy/core/scraper.py", line 145, in call_spider     dfd.addcallbacks(request.callback or spider.parse, request.errback)   file "/usr/local/lib/python2.7/dist-packages/twisted/internet/defer.py", line 299, in addcallbacks     assert callable(callback) assertionerror

there 2 problems:

you're not passing function request. passing return value of function request.
a callback function request must have signature (self, response).

a solution dynamic content here : https://stackoverflow.com/a/24373576/2368836

it eliminate need pass driver function.

so when yielding request should so...

scrapy.request(urls, callback=self.parse_url)

if want include driver function read closures.

edit: here closure solution think should use link shared because of reasons ghajba pointed out.

   def parse_data(self, url_list, driver):         def encapsulated(spider, response)             data = get_data(driver.pagesource)             .....             .....             yield item     return encapsulated

then request looks like

yield scrapy.request(url, callback=self.parse_data(url_list, driver)

Search This Blog

JVParth

python - `How to use a chain of callbacks in scrapy -

Comments

Post a Comment

Popular posts from this blog

toolbar - How to add link to user registration inside toobar in admin joomla 3 custom component -

linux - disk space limitation when creating war file -

How to provide Authorization & Authentication using Asp.net, C#? -