Why scrapy not iterating over all the links on the page even the xpaths are correct?
By : user3504043
Date : March 29 2020, 07:55 AM
With these it helps Because your for loop has nothing to loop on the given website. Change your statement code :
sites = response.xpath('//div[@id="allreviews"]')
sites = response.xpath('//div[@id="allreviews"]/ul/li')
|
Scrapy (Python): Iterating over 'next' page without multiple functions
By : Nick Ballinger
Date : March 29 2020, 07:55 AM
I wish did fix the issue. You see, a parse callback is just a function that takes the response and returns or yields either Items or Requests or both. There is no issue at all with reusing these callbacks, so you can just pass the same callback for every request. Now, you could pass the current page info using the Request meta but instead, I'd leverage the CrawlSpider to crawl across every page. It's really easy, start generating the Spider with the command line: code :
scrapy genspider --template crawl finance finance.yahoo.com
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule
from yfinance.items import YfinanceItem
class FinanceSpider(CrawlSpider):
name = 'finance'
allowed_domains = ['finance.yahoo.com']
start_urls = ['http://finance.yahoo.com/q/hp?s=PWF.TO&a=04&b=19&c=2005&d=04&e=19&f=2010&g=d&z=66&y=132']
rules = (
Rule(LinkExtractor(restrict_css='[rel="next"]'),
callback='parse_items',
follow=True),
)
def parse_items(self, response):
for line in response.css('.yfnc_datamodoutline1 table tr')[1:-1]:
yield YfinanceItem(date=line.css('td:first-child::text').extract()[0])
|
Scrapy iterating over selector yields n duplicated items for number of selectors found on page
By : John King
Date : March 29 2020, 07:55 AM
I hope this helps you . Once you have your "sub-selector" reviewSelector you need to use . before your xpath to indicate sub-selector level. i.e. this: code :
reviewSelector.xpath('//*[@itemprop="author"]/@content').extract_first()
reviewSelector.xpath('.//*[@itemprop="author"]/@content').extract_first()
|
scrapy ,Why does the scrapy.Request class call the parse() method by default?
By : Mene
Date : March 29 2020, 07:55 AM
Hope that helps This is something that is decided inside the Scrapy core, see this request.callback or spider.parse part: code :
def call_spider(self, result, request, spider):
result.request = request
dfd = defer_result(result)
dfd.addCallbacks(request.callback or spider.parse, request.errback)
return dfd.addCallback(iterate_spider_output)
|
Iterating through select items on AJAX page with Scrapy and Splash
By : Arash Sakhaee
Date : March 29 2020, 07:55 AM
this will help You might try to use Splash's execute endpoint with LUA script that will fill the select with each option's value and return the result. Something like: code :
...
script = """
function main(splash)
splash.resource_timeout = 10
splash:go(splash.args.url)
splash:wait(1)
splash:runjs('document.getElementsByClassName("foo")[0].value = "' .. splash.args.value .. '"')
splash:wait(1)
return {
html = splash:html(),
}
end
"""
# base_url refers to page with the select
values = response.xpath('//select[@class="foo"]/option/@value').extract()
for value in values:
yield scrapy_splash.SplashRequest(
base_url, self.parse_result, endpoint='execute',
args={'lua_source': script, 'value': value, 'timeout': 3600})
|