Arbisoft is your one-stop shop when it comes to your eLearning needs. Our Ed-tech services are designed to improve the learning experience and simplify educational operations.
Discover More
Companies that we have worked with
- “Arbisoft has been a valued partner to edX since 2013. We work with their engineers day in and day out to advance the Open edX platform and support our learners across the world.”
  Ed Zarecor/Senior Director & Head of Engineering
Lets Build Your Next Project Together
With a team of 1000+ tech experts, we are always ready to discuss your project.
Schedule a Call
Get cutting-edge travel tech solutions that cater to your users’ every need. We have been employing the latest technology to build custom travel solutions for our clients since 2007.
Discover More
Companies that we have worked with
- “Arbisoft has been my most trusted technology partner for now over 15 years. Arbisoft has very unique methods of recruiting and training, and the results demonstrate that. They have great teams, great positive attitudes and great communication.”
  Paul English/Co-Founder, KAYAK
Lets Build Your Next Project Together
With a team of 1000+ tech experts, we are always ready to discuss your project.
Schedule a Call
As a long-time contributor to the healthcare industry, we have been at the forefront of developing custom healthcare technology solutions that have benefitted millions.
Discover More
Companies that we have worked with
- I wanted to tell you how much I appreciate the work you and your team have been doing of all the overseas teams I've worked with, yours is the most communicative, most responsive and most talented.
  Matt Hasel/Program Manager, eHuman
Lets Build Your Next Project Together
With a team of 1000+ tech experts, we are always ready to discuss your project.
Schedule a Call
We take pride in meeting the most complex needs of our clients and developing stellar fintech solutions that deliver the greatest value in every aspect.
Discover More
Companies that we have worked with
- “Arbisoft is an integral part of our team and we probably wouldn't be here today without them. Some of their team has worked with us for 5-8 years and we've built a trusted business relationship. We share successes together.”
  Jake Peters/CEO & Co-Founder, PayPerks
Lets Build Your Next Project Together
With a team of 1000+ tech experts, we are always ready to discuss your project.
Schedule a Call
Unlock innovative solutions for your e-commerce business with Arbisoft’s seasoned workforce. Reach out to us with your needs and let’s get to work!
Discover More
Companies that we have worked with
- The development team at Arbisoft is very skilled and proactive. They communicate well, raise concerns when they think a development approach wont work and go out of their way to ensure client needs are met.
  Veronika Sonsev/Co-Founder
Lets Build Your Next Project Together
With a team of 1000+ tech experts, we are always ready to discuss your project.
Schedule a Call
Arbisoft is a holistic technology partner, adept at tailoring solutions that cater to business needs across industries. Partner with us to go from conception to completion!
Discover More
Companies that we have worked with
- “The app has generated significant revenue and received industry awards, which is attributed to Arbisoft’s work. Team members are proactive, collaborative, and responsive”.
  Silvan Rath/CEO, Predict.io
Lets Build Your Next Project Together
With a team of 1000+ tech experts, we are always ready to discuss your project.
Schedule a Call

How to Use Scrapy for Web Scraping

July 5, 2024

Did you know there are over 1.9 billion websites in the world, brimming with valuable data? Imagine the possibilities! Web scraping unlocks this potential by extracting information automatically, saving you countless hours of manual data collection.

This blog explores Scrapy, a powerful Python framework that automates data extraction from websites. We'll show you how to navigate the web like a pro, extract valuable information efficiently, and a case study to help you understand its practical implications.

Let's get started!

Why Web Scraping?

Web scraping automates the process of extracting data from websites. Consider automatically collecting product listings from an e-commerce store to track price changes, or gathering news articles on a specific topic for analysis. Web scraping can be a powerful tool for research, data collection, and automation tasks.

Introducing Scrapy

Scrapy is a free and open-source Python framework specifically designed for web scraping. It takes the complexity out of the process by offering a structured approach. Here's how Scrapy makes your life easier.

Build efficient spiders - Scrapy uses "spiders" – programs that crawl websites and extract data. You don't need to worry about the low-level details of navigating web pages and parsing HTML.
Focus on what matters - Scrapy handles the technical aspects, allowing you to concentrate on writing code to target the specific data you need.
Handle complex structures - Scrapy can navigate websites with pagination (multiple pages of data), forms, and even some JavaScript-heavy content.

Setting Up Your Scrapy Environment

The first step is setting up your Scrapy environment. Luckily, installation is straightforward for all major operating systems. Once installed, you'll create a new Scrapy project using a simple command-line tool. This project serves as the foundation for your web scraping tasks.

Amazon Case Study

Web scraping offers a powerful tool for extracting valuable data from websites. This case study delves into how we leveraged web scraping techniques to unlock insights from the vast Amazon marketplace.

1. Site Analysis

Before writing a spider for any site, it’s a good approach to analyze the site structure and divide it into sections based on the type of pages we’re going to iterate on to get to the target data.

The most common site structure for any e-commerce site can be categorized into 4 sections.

Home Page > Categories > Listing > Product Page

In our case, we are only going to target the Books category of Amazon so we can start right from the Listing Page.

2. Selecting the Spider Type

Scrapy has two main types when it comes to spiders - Spider and CrawlSpider.

I. Spider: Base class of spider, suitable for sites where the structure is not well defined or iterating through anchor links is not optimal.

Ii. CrawlSpider: Subclass of Spider, suitable for sites with straightforward flow and easy to iterate through anchor links.

Looking at the Amazon Books page we can determine that each page of the site can be accessed through direct request to its anchor link available on the page so CrawlSpider would be more suitable.

Let’s create a custom amazon_spider file and import the relevant modules

#file: spiders > amazon_spider.py


from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule

3. Defining Data Fields for Parsed Data

Let’s define the structure of our parsed data using scrapy Items before continuing with the spider.

#file: items.py

from scrapy import Item, Field
from typing import Optional, Dict, List


class BookDetailsItem(Item):
    pages: Optional[int] = Field()
    language: Optional[str] = Field()
    publisher: Optional[str] = Field()
    publication_date: Optional[str] = Field()
    isbn_10: Optional[str] = Field()
    isbn_13: Optional[str] = Field()
    

class AuthorItem(Item):
    name: Optional[str] = Field()
    about: Optional[str] = Field()
    other_books_link: Optional[str] = Field()


class ReviewItem(Item):
    name: Optional[str] = Field()
    rating: Optional[float] = Field()
    title: Optional[str] = Field()
    review: Optional[str] = Field()
    

class AmazonItem(Item):
    asin: Optional[str] = Field()
    title: Optional[str] = Field()
    subtitle: Optional[str] = Field()
    cover_image: Optional[str] = Field()
    author: Optional[Dict[AuthorItem, any]] = Field()
    rating: Optional[float] = Field()
    review_count: Optional[int] = Field()
    price: Optional[float] = Field()
    availability: Optional[str] = Field()
    description: Optional[str] = Field()
    book_details: Optional[Dict[BookDetailsItem, any]] = Field()
    page_url: Optional[str] = Field()
    reviews: Optional[List[ReviewItem]] = Field()

4. Amazon Spider

Let’s continue with our custom Amazon spider and its different components.

Spider Settings

Each site has its own merits in how it handles page requests, like some requiring a specific type of token or session ID for each request to be considered authentic and returns the requested page.

Fortunately, in our case, Amazon loads its data through simple HTML pages. This only requires a user agent to function properly. Therefore, let's add it as DEFAULT_REQUEST_HEADERS in the spider's custom settings. Additionally, we'll include a couple of other settings to ensure the spider runs without any errors.

#file: spiders > amazon_spider.py

class AmazonCrawler(CrawlSpider):
    name = "amazon_crawler"
    
    custom_settings = {
        "CONCURRENT_REQUESTS": 1,
        "DOWNLOAD_DELAY": 2,
        
        "DEFAULT_REQUEST_HEADERS": {
            "accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp," \
                        "image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.7",
            "user-agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 " \
                            "(KHTML, like Gecko) Chrome/125.0.0.0 Safari/537.36",
                
        }
    }

Avoiding Blocking

You might have noticed that we’ve made the spider quite slow by setting the CONCURRENT_REQUESTS to 1 and adding a delay of 2 seconds between each request through the setting DOWNLOAD_DELAY. This was done to avoid blocking during scraping.

Exploring Advanced Scrapy Features

Scrapy goes beyond the basics of web scraping. Dive into its treasure trove of functionalities to become a web scraping wizard! Here's a glimpse.

Customizing Scrapy - Craft bespoke Scrapy shells tailored to your specific scraping needs. No more one-size-fits-all approaches!
Mastering Middleware - Utilize Scrapy's middleware system to intercept requests and responses, fine-tune scraping behavior, and handle tasks like authentication or data transformation.
Extensions - Explore functionalities like handling telescoping URLs, pagination, or scraping infinite scrolling websites.
Scalable Scrapers - Implement features like distributed scraping with Scrapy Cluster or integrate with scheduling tools to run your scrapes efficiently.

Leveling Up Your Skills

The world of web scraping is ever-evolving. Here's how you can stay ahead of the curve.

Scrapy Documentation - Explore advanced functionalities, delve into customization options, and unlock Scrapy's full potential.
Scrapy Community - The Scrapy community is a vibrant hub of developers and enthusiasts. Engage in discussions, learn from others' experiences, and contribute your knowledge.
Practice - The more you scrape, the better you'll become. Work and experiment with different techniques, and refine your scraping skills.
Stay Updated - Keep yourself informed about the latest trends and advancements in web scraping technologies. Explore new libraries, tools, and best practices to stay on top of your game.

Conclusion

The world of web scraping is brimming with possibilities. By making use of the power of Scrapy's advanced features, committing to responsible practices, and keeping your skills sharp, you'll unlock a world of valuable data. The vast expanse of the web will become your domain, filled with information waiting to be discovered. So keep learning, embrace responsible practices, and happy scraping!

Hijab e Fatima

I’m a technical content writer with a passion for all things AI and ML. I love diving deep into complex topics and breaking them down into digestible information. When I’m not writing, you can find me exploring anything and everything trending.