python - `How to use a chain of callbacks in scrapy -
i trying build crawler using scrapy , selenium webdriver. trying set of urls in parse() , pass callback function parse_url() again gets different set of urls , passes parse_data()
the first callback parse_url works second parse_data gives assertionerror
i.e if run without parse_data prints list of urls. if include assertion error
i have this
class myspider(scrapy.spider): name = "myspider" allowed_domains = ["example.com"] start_urls = [ "http://www.example.com/url", ] def parse(self, response): driver = webdriver.firefox() driver.get(response.url) urls = get_urls(driver.page_source) # get_url returns list yield scrapy.request(urls, callback=self.parse_url(urls, driver)) def parse_url(self, url, driver): url_list = [] in urls: driver.get(i) url_list.append( get_urls(driver.pagesource)) # gets more urls yeild scrapy.request(urls, callback=self.parse_data(url_list, driver)) def parse_data(self, url_list, driver): data = get_data(driver.pagesource) this traceback,
traceback (most recent call last): file "/usr/local/lib/python2.7/dist-packages/scrapy/utils/defer.py", line 45, in mustbe_deferred result = f(*args, **kw) file "/usr/local/lib/python2.7/dist-packages/scrapy/core/spidermw.py", line 48, in process_spider_input return scrape_func(response, request, spider) file "/usr/local/lib/python2.7/dist-packages/scrapy/core/scraper.py", line 145, in call_spider dfd.addcallbacks(request.callback or spider.parse, request.errback) file "/usr/local/lib/python2.7/dist-packages/twisted/internet/defer.py", line 299, in addcallbacks assert callable(callback) assertionerror
there 2 problems:
you're not passing function request. passing return value of function request.
a callback function request must have signature (self, response).
a solution dynamic content here : https://stackoverflow.com/a/24373576/2368836
it eliminate need pass driver function.
so when yielding request should so...
scrapy.request(urls, callback=self.parse_url) if want include driver function read closures.
edit: here closure solution think should use link shared because of reasons ghajba pointed out.
def parse_data(self, url_list, driver): def encapsulated(spider, response) data = get_data(driver.pagesource) ..... ..... yield item return encapsulated then request looks like
yield scrapy.request(url, callback=self.parse_data(url_list, driver)
Comments
Post a Comment