books
select elements
Find the following elements from the page to add to the script from the following page,
https://books.toscrape.com/
Quick way to get any attribute
books = response.css('article.product_pod')
books[0].css('div.image_container a img::attr(alt)').get()
This will return
'A Light in the Attic'
Get the following elements
fetch('https://books.toscrape.com/')
books = response.css('article.product_pod')
book = books[0]
book.css('h3 a::text').get()
We can get the title also with the attrib
method(
is it called a method??)
book.css('h3 a').attrib['title']
book.css('.product_price .product_color::text').get()
book.css('h3 a').attrib['href']
Add in the above selectors to the parse()
method inside the BooksSpider
class
import scrapy
class BookspiderSpider(scrapy.Spider):
name = "bookspider"
allowed_domains = ["books.toscrape.com"]
start_urls = ["https://books.toscrape.com/"]
def parse(self, response):
books = response.css('article.product_pod')
for book in books:
yield{
'name': book.css('h3 a::text').get(),
'price': book.css('.product_price .product_color::text').get(),
'url': book.css('h3 a').attrib['href']
}
exit
the scrapy shell, change directory into the bookscaper
folder
and run
scrapy crawl bookspider
You must be in the same directory that contains the scrapy.cfg
file
cd /bookscraper
. ├── bookscraper │ ├── __init__.py │ ├── __pycache__ │ │ ├── __init__.cpython-310.pyc │ │ └── settings.cpython-310.pyc │ ├── items.py │ ├── middlewares.py │ ├── pipelines.py │ ├── settings.py │ └── spiders │ ├── __init__.py │ ├── __pycache__ │ │ ├── __init__.cpython-310.pyc │ │ └── bookspider.cpython-310.pyc │ └── bookspider.py └── scrapy.cfg
You can see the last elements returned from the page in the console