Did you know there are over 1.9 billion websites in the world, brimming with valuable data? Imagine the possibilities! Web scraping unlocks this potential by extracting information automatically, saving you countless hours of manual data collection.
This blog explores Scrapy, a powerful Python framework that automates data extraction from websites. We'll show you how to navigate the web like a pro, extract valuable information efficiently, and a case study to help you understand its practical implications.
Let's get started!
Why Web Scraping?
Web scraping automates the process of extracting data from websites. Consider automatically collecting product listings from an e-commerce store to track price changes, or gathering news articles on a specific topic for analysis. Web scraping can be a powerful tool for research, data collection, and automation tasks.
Introducing Scrapy
Scrapy is a free and open-source Python framework specifically designed for web scraping. It takes the complexity out of the process by offering a structured approach. Here's how Scrapy makes your life easier.
- Build efficient spiders - Scrapy uses "spiders" – programs that crawl websites and extract data. You don't need to worry about the low-level details of navigating web pages and parsing HTML.
- Focus on what matters - Scrapy handles the technical aspects, allowing you to concentrate on writing code to target the specific data you need.
- Handle complex structures - Scrapy can navigate websites with pagination (multiple pages of data), forms, and even some JavaScript-heavy content.
Setting Up Your Scrapy Environment
The first step is setting up your Scrapy environment. Luckily, installation is straightforward for all major operating systems. Once installed, you'll create a new Scrapy project using a simple command-line tool. This project serves as the foundation for your web scraping tasks.
Amazon Case Study
Web scraping offers a powerful tool for extracting valuable data from websites. This case study delves into how we leveraged web scraping techniques to unlock insights from the vast Amazon marketplace.
1. Site Analysis
Before writing a spider for any site, it’s a good approach to analyze the site structure and divide it into sections based on the type of pages we’re going to iterate on to get to the target data.
The most common site structure for any e-commerce site can be categorized into 4 sections.
Home Page > Categories > Listing > Product Page
In our case, we are only going to target the Books category of Amazon so we can start right from the Listing Page.
2. Selecting the Spider Type
Scrapy has two main types when it comes to spiders - Spider and CrawlSpider.
I. Spider: Base class of spider, suitable for sites where the structure is not well defined or iterating through anchor links is not optimal.
Ii. CrawlSpider: Subclass of Spider, suitable for sites with straightforward flow and easy to iterate through anchor links.
Looking at the Amazon Books page we can determine that each page of the site can be accessed through direct request to its anchor link available on the page so CrawlSpider would be more suitable.
Let’s create a custom amazon_spider file and import the relevant modules
#file: spiders > amazon_spider.py
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule
3. Defining Data Fields for Parsed Data
Let’s define the structure of our parsed data using scrapy Items before continuing with the spider.
#file: items.py
from scrapy import Item, Field
from typing import Optional, Dict, List
class BookDetailsItem(Item):
pages: Optional[int] = Field()
language: Optional[str] = Field()
publisher: Optional[str] = Field()
publication_date: Optional[str] = Field()
isbn_10: Optional[str] = Field()
isbn_13: Optional[str] = Field()
class AuthorItem(Item):
name: Optional[str] = Field()
about: Optional[str] = Field()
other_books_link: Optional[str] = Field()
class ReviewItem(Item):
name: Optional[str] = Field()
rating: Optional[float] = Field()
title: Optional[str] = Field()
review: Optional[str] = Field()
class AmazonItem(Item):
asin: Optional[str] = Field()
title: Optional[str] = Field()
subtitle: Optional[str] = Field()
cover_image: Optional[str] = Field()
author: Optional[Dict[AuthorItem, any]] = Field()
rating: Optional[float] = Field()
review_count: Optional[int] = Field()
price: Optional[float] = Field()
availability: Optional[str] = Field()
description: Optional[str] = Field()
book_details: Optional[Dict[BookDetailsItem, any]] = Field()
page_url: Optional[str] = Field()
reviews: Optional[List[ReviewItem]] = Field()
4. Amazon Spider
Let’s continue with our custom Amazon spider and its different components.
Each site has its own merits in how it handles page requests, like some requiring a specific type of token or session ID for each request to be considered authentic and returns the requested page.
Fortunately, in our case, Amazon loads its data through simple HTML pages. This only requires a user agent to function properly. Therefore, let's add it as DEFAULT_REQUEST_HEADERS
in the spider's custom settings. Additionally, we'll include a couple of other settings to ensure the spider runs without any errors.
#file: spiders > amazon_spider.py
class AmazonCrawler(CrawlSpider):
name = "amazon_crawler"
custom_settings = {
"CONCURRENT_REQUESTS": 1,
"DOWNLOAD_DELAY": 2,
"DEFAULT_REQUEST_HEADERS": {
"accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp," \
"image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.7",
"user-agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 " \
"(KHTML, like Gecko) Chrome/125.0.0.0 Safari/537.36",
}
}
Avoiding Blocking
You might have noticed that we’ve made the spider quite slow by setting the CONCURRENT_REQUESTS
to 1 and adding a delay of 2 seconds between each request through the setting DOWNLOAD_DELAY
. This was done to avoid blocking during scraping.
Exploring Advanced Scrapy Features
Scrapy goes beyond the basics of web scraping. Dive into its treasure trove of functionalities to become a web scraping wizard! Here's a glimpse.
- Customizing Scrapy - Craft bespoke Scrapy shells tailored to your specific scraping needs. No more one-size-fits-all approaches!
- Mastering Middleware - Utilize Scrapy's middleware system to intercept requests and responses, fine-tune scraping behavior, and handle tasks like authentication or data transformation.
- Extensions - Explore functionalities like handling telescoping URLs, pagination, or scraping infinite scrolling websites.
- Scalable Scrapers - Implement features like distributed scraping with Scrapy Cluster or integrate with scheduling tools to run your scrapes efficiently.
Leveling Up Your Skills
The world of web scraping is ever-evolving. Here's how you can stay ahead of the curve.
- Scrapy Documentation - Explore advanced functionalities, delve into customization options, and unlock Scrapy's full potential.
- Scrapy Community - The Scrapy community is a vibrant hub of developers and enthusiasts. Engage in discussions, learn from others' experiences, and contribute your knowledge.
- Practice - The more you scrape, the better you'll become. Work and experiment with different techniques, and refine your scraping skills.
- Stay Updated - Keep yourself informed about the latest trends and advancements in web scraping technologies. Explore new libraries, tools, and best practices to stay on top of your game.
Conclusion
The world of web scraping is brimming with possibilities. By making use of the power of Scrapy's advanced features, committing to responsible practices, and keeping your skills sharp, you'll unlock a world of valuable data. The vast expanse of the web will become your domain, filled with information waiting to be discovered. So keep learning, embrace responsible practices, and happy scraping!