logo
down
shadow

Added iterating over page id in Scrapy, responses in parse method no longer run


Added iterating over page id in Scrapy, responses in parse method no longer run

By : Williglimes
Date : November 29 2020, 09:01 AM
Hope that helps The information you are looking for only occurs once on each page and the body tag is on every page so the loop and the line
code :
grants = sel.xpath('.//html//body')
# -*- coding: utf-8 -*-

from scrapy.spiders import Spider
from scrapy.http import Request

from scraper_app.items import NSERCGrant

class NSERC_Spider(Spider):

    name = 'NSERCSpider'
    allowed_domains = ["http://www.nserc-crsng.gc.ca"]
    # Maximum page id to use.
    max_id = 5

    def start_requests(self):
        for i in range(1, self.max_id):
            yield Request("http://www.nserc-crsng.gc.ca/ase-oro/Details-Detailles_eng.asp?id=%d" % i,
                          callback=self.parse_grant)

    def parse_grant(self, response):

        print("Being called")

        item = NSERCGrant()

        # Row one
        item['Competition_Year'] = response.xpath('//tr[1]//td[2]//text()').extract()
        item['Fiscal_Year'] = response.xpath('//tr[1]//td[4]//text()').extract()

        # Row two
        item['Project_Lead_Name'] = response.xpath('//tr[2]//td[2]//text()').extract()
        item['Institution'] = response.xpath('//tr[2]//td[4]//text()').extract()

        # Row three
        item['Department'] = response.xpath('//tr[3]//td[2]//text()').extract()
        item['Province'] = response.xpath('//tr[3]//td[4]//text()').extract()

        # Row four
        item['Award_Amount'] = response.xpath('//tr[4]//td[2]//text()').extract()
        item['Installment'] = response.xpath('//tr[4]//td[4]//text()').extract()

        # Row five
        item['Program'] = response.xpath('//tr[5]//td[2]//text()').extract()
        item['Selection_Committee'] = response.xpath('//tr[5]//td[4]//text()').extract()

        # Row six
        item['Research_Subject'] = response.xpath('//tr[6]//td[2]//text()').extract()
        item['Area_of_Application'] = response.xpath('//tr[6]//td[4]//text()').extract()

        # Row seven
        item['Co_Researchers'] = response.xpath("//tr[7]//td[2]//text()").extract()
        item['Partners'] = response.xpath('//tr[7]//td[4]//text()').extract()

        # Award Summary
        item['Award_Summary'] = response.xpath('//p//text()').extract()

        yield item


Share : facebook icon twitter icon
Why scrapy not iterating over all the links on the page even the xpaths are correct?

Why scrapy not iterating over all the links on the page even the xpaths are correct?


By : user3504043
Date : March 29 2020, 07:55 AM
Scrapy (Python): Iterating over 'next' page without multiple functions

Scrapy (Python): Iterating over 'next' page without multiple functions


By : Nick Ballinger
Date : March 29 2020, 07:55 AM
I wish did fix the issue. You see, a parse callback is just a function that takes the response and returns or yields either Items or Requests or both. There is no issue at all with reusing these callbacks, so you can just pass the same callback for every request.
Now, you could pass the current page info using the Request meta but instead, I'd leverage the CrawlSpider to crawl across every page. It's really easy, start generating the Spider with the command line:
code :
scrapy genspider --template crawl finance finance.yahoo.com
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule
from yfinance.items import YfinanceItem


class FinanceSpider(CrawlSpider):
    name = 'finance'
    allowed_domains = ['finance.yahoo.com']
    start_urls = ['http://finance.yahoo.com/q/hp?s=PWF.TO&a=04&b=19&c=2005&d=04&e=19&f=2010&g=d&z=66&y=132']

    rules = (
        Rule(LinkExtractor(restrict_css='[rel="next"]'),
             callback='parse_items',
             follow=True),
    )
    def parse_items(self, response):
        for line in response.css('.yfnc_datamodoutline1 table tr')[1:-1]:
            yield YfinanceItem(date=line.css('td:first-child::text').extract()[0])
Scrapy iterating over selector yields n duplicated items for number of selectors found on page

Scrapy iterating over selector yields n duplicated items for number of selectors found on page


By : John King
Date : March 29 2020, 07:55 AM
I hope this helps you . Once you have your "sub-selector" reviewSelector you need to use . before your xpath to indicate sub-selector level.
i.e. this:
code :
reviewSelector.xpath('//*[@itemprop="author"]/@content').extract_first()
reviewSelector.xpath('.//*[@itemprop="author"]/@content').extract_first()
scrapy ,Why does the scrapy.Request class call the parse() method by default?

scrapy ,Why does the scrapy.Request class call the parse() method by default?


By : Mene
Date : March 29 2020, 07:55 AM
Hope that helps This is something that is decided inside the Scrapy core, see this request.callback or spider.parse part:
code :
def call_spider(self, result, request, spider):
    result.request = request
    dfd = defer_result(result)
    dfd.addCallbacks(request.callback or spider.parse, request.errback)
    return dfd.addCallback(iterate_spider_output)
Iterating through select items on AJAX page with Scrapy and Splash

Iterating through select items on AJAX page with Scrapy and Splash


By : Arash Sakhaee
Date : March 29 2020, 07:55 AM
this will help You might try to use Splash's execute endpoint with LUA script that will fill the select with each option's value and return the result. Something like:
code :
...
script = """
function main(splash)
    splash.resource_timeout = 10
    splash:go(splash.args.url)
    splash:wait(1)
    splash:runjs('document.getElementsByClassName("foo")[0].value = "' .. splash.args.value .. '"')
    splash:wait(1)
    return {
        html = splash:html(),
    }
end
"""

# base_url refers to page with the select
values = response.xpath('//select[@class="foo"]/option/@value').extract()
for value in values:
    yield scrapy_splash.SplashRequest(
        base_url, self.parse_result, endpoint='execute',
        args={'lua_source': script, 'value': value, 'timeout': 3600})
Related Posts Related Posts :
  • How to parse HTML table with rows contaning both <th> and <td> tags under the <tr> tag?
  • Using recursion with map in python
  • Creating lists with loops in Python
  • I have a program to check if five numbers are prime but it is not working correctly
  • Returned answer wrong in pyschool Topic 2: Q 6
  • What are the centroid of k-means clusters with PCA decomposition?
  • How do mongoengine filter field not null?
  • Categorize results based on Model in haystack?
  • Error installing pycrypto on my mac
  • Can Django ORM has strip field?
  • Python pack / unpack converts to Objective C
  • Python - Selenium Locate elements by href
  • Couldn't iterate over a dictionary context variable in template, despite having all in place, as far as I know?
  • Test if Django ModelForm has instance on customized model
  • Reading excel column 1 into Python dictionary key, column 2 into value
  • AttributeError: 'module' object has no attribute 'timeit' while doing timeit a python function
  • Accessing button using selenium in Python
  • Removing White Spaces in a Python String
  • Sort timestamp in python dictionary
  • How to use Python 2 packages in Python 3 project?
  • retrieve links from web page using python and BeautifulSoup than select 3 link and run it 4 times
  • applying lambda to tz-aware timestamp
  • Having two Generic ListViews on the same page
  • Merging numpy array elements using join() in python
  • pythonic way to parse/split URLs in a pandas dataframe
  • wanting to add an age gate to my quiz
  • Removing top empty line when writing a text file Python
  • How to use a template html in different folder on Google App Engine python?
  • Access ndarray using list
  • unable to post file+data using python-requests
  • How to test aws lambda functions locally
  • inconsistent plot between matplotlib and seaborn in Python
  • How matplotlib show obvious changes?
  • Project in Python3, reading files, word data
  • Check for specific Item in list without Iteration or find()
  • Unicode encoding when reading from text file
  • Overloaded variables in python for loops?
  • All elements have same value after appending new element
  • Python Threading loop
  • `_pickle.UnpicklingError: the STRING opcode argument must be quoted`
  • Python: How to stop a variable from exceeding a value?
  • python textblob and text classification
  • Django - Context dictionary for attribute inside a class
  • Database is not updated in Celery task with Flask and SQLAlchemy
  • Shapely intersections vs shapely relationships - inexact?
  • How to extract a percentage column from a periodic column and the sum of the column?
  • Zombie ssh process using python subprocess.Popen
  • Python regex to capture a comma-delimited list of items
  • joining string and long in python
  • Value Error in python numpy
  • Check if any character of a string is uppercase Python
  • TensorFlow - why doesn't this sofmax regression learn anything?
  • Python Anaconda Proxy Setup via .condarc file on Windows
  • Creating django objects from emails
  • Get spotify currently playing track
  • Select multiple columns and remove values according to a list
  • Python - How to Subtract a Variable By 1 Every Second?
  • Tkinter unable to alloc 71867 bytes
  • How to add Variable to JSON Python Django
  • CSRF token missing or invalid Django
  • shadow
    Privacy Policy - Terms - Contact Us © animezone.co