arbisoft brand logo
arbisoft brand logo

A Technology Partnership That Goes Beyond Code

  • company logo

    “Arbisoft is an integral part of our team and we probably wouldn't be here today without them. Some of their team has worked with us for 5-8 years and we've built a trusted business relationship. We share successes together.”

    Jake Peters profile picture

    Jake Peters/CEO & Co-Founder, PayPerks

  • company logo

    “They delivered a high-quality product and their customer service was excellent. We’ve had other teams approach us, asking to use it for their own projects”.

    Alice Danon profile picture

    Alice Danon/Project Coordinator, World Bank

1000+Tech Experts

550+Projects Completed

50+Tech Stacks

100+Tech Partnerships

4Global Offices

4.9Clutch Rating

Trending Blogs

    • company logo

      “Arbisoft has been a valued partner to edX since 2013. We work with their engineers day in and day out to advance the Open edX platform and support our learners across the world.”

      Ed Zarecor profile picture

      Ed Zarecor/Senior Director & Head of Engineering

    81.8% NPS78% of our clients believe that Arbisoft is better than most other providers they have worked with.

    • Arbisoft is your one-stop shop when it comes to your eLearning needs. Our Ed-tech services are designed to improve the learning experience and simplify educational operations.

      Companies that we have worked with

      • MIT logo
      • edx logo
      • Philanthropy University logo
      • Ten Marks logo

      • company logo

        “Arbisoft has been a valued partner to edX since 2013. We work with their engineers day in and day out to advance the Open edX platform and support our learners across the world.”

        Ed Zarecor profile picture

        Ed Zarecor/Senior Director & Head of Engineering

    • Get cutting-edge travel tech solutions that cater to your users’ every need. We have been employing the latest technology to build custom travel solutions for our clients since 2007.

      Companies that we have worked with

      • Kayak logo
      • Travelliance logo
      • SastaTicket logo
      • Wanderu logo

      • company logo

        “Arbisoft has been my most trusted technology partner for now over 15 years. Arbisoft has very unique methods of recruiting and training, and the results demonstrate that. They have great teams, great positive attitudes and great communication.”

        Paul English profile picture

        Paul English/Co-Founder, KAYAK

    • As a long-time contributor to the healthcare industry, we have been at the forefront of developing custom healthcare technology solutions that have benefitted millions.

      Companies that we have worked with

      • eHuman logo
      • Reify Health logo

      • company logo

        I wanted to tell you how much I appreciate the work you and your team have been doing of all the overseas teams I've worked with, yours is the most communicative, most responsive and most talented.

        Matt Hasel profile picture

        Matt Hasel/Program Manager, eHuman

    • We take pride in meeting the most complex needs of our clients and developing stellar fintech solutions that deliver the greatest value in every aspect.

      Companies that we have worked with

      • Payperks logo
      • The World Bank logo
      • Lendaid logo

      • company logo

        “Arbisoft is an integral part of our team and we probably wouldn't be here today without them. Some of their team has worked with us for 5-8 years and we've built a trusted business relationship. We share successes together.”

        Jake Peters profile picture

        Jake Peters/CEO & Co-Founder, PayPerks

    • Unlock innovative solutions for your e-commerce business with Arbisoft’s seasoned workforce. Reach out to us with your needs and let’s get to work!

      Companies that we have worked with

      • HyperJar logo
      • Edited logo

      • company logo

        The development team at Arbisoft is very skilled and proactive. They communicate well, raise concerns when they think a development approach wont work and go out of their way to ensure client needs are met.

        Veronika Sonsev profile picture

        Veronika Sonsev/Co-Founder

    • Arbisoft is a holistic technology partner, adept at tailoring solutions that cater to business needs across industries. Partner with us to go from conception to completion!

      Companies that we have worked with

      • Indeed logo
      • Predict.io logo
      • Cerp logo
      • Wigo logo

      • company logo

        “The app has generated significant revenue and received industry awards, which is attributed to Arbisoft’s work. Team members are proactive, collaborative, and responsive”.

        Silvan Rath profile picture

        Silvan Rath/CEO, Predict.io

    • Software Development Outsourcing

      Building your software with our expert team.

    • Dedicated Teams

      Long term, integrated teams for your project success

    • IT Staff Augmentation

      Quick engagement to boost your team.

    • New Venture Partnership

      Collaborative launch for your business success.

    Discover More

    Hear From Our Clients

    • company logo

      “Arbisoft partnered with Travelliance (TVA) to develop Accounting, Reporting, & Operations solutions. We helped cut downtime to zero, providing 24/7 support, and making sure their database of 7 million users functions smoothly.”

      Dori Hotoran profile picture

      Dori Hotoran/Director Global Operations - Travelliance

    • company logo

      “I couldn’t be more pleased with the Arbisoft team. Their engineering product is top-notch, as is their client relations and account management. From the beginning, they felt like members of our own team—true partners rather than vendors.”

      Diemand-Yauman profile picture

      Diemand-Yauman/CEO, Philanthropy University

    • company logo

      Arbisoft was an invaluable partner in developing TripScanner, as they served as my outsourced website and software development team. Arbisoft did an incredible job, building TripScanner end-to-end, and completing the project on time and within budget at a fraction of the cost of a US-based developer.

      Ethan Laub profile picture

      Ethan Laub/Founder and CEO

    Contact Us
    contact

    How to Use Scrapy for Web Scraping

    July 5, 2024
    https://d1foa0aaimjyw4.cloudfront.net/How_do_you_use_Scrapy_for_web_scraping_0e2b9bf5d5.png

    Did you know there are over 1.9 billion websites in the world, brimming with valuable data?  Imagine the possibilities!  Web scraping unlocks this potential by extracting information automatically, saving you countless hours of manual data collection.

     

    This blog explores Scrapy, a powerful Python framework that automates data extraction from websites. We'll show you how to navigate the web like a pro, extract valuable information efficiently, and a case study to help you understand its practical implications. 

     

    Let's get started!

     

    Why Web Scraping?

    Web scraping automates the process of extracting data from websites. Consider automatically collecting product listings from an e-commerce store to track price changes, or gathering news articles on a specific topic for analysis. Web scraping can be a powerful tool for research, data collection, and automation tasks.

     

    Introducing Scrapy

    Scrapy is a free and open-source Python framework specifically designed for web scraping. It takes the complexity out of the process by offering a structured approach. Here's how Scrapy makes your life easier.

     

    • Build efficient spiders - Scrapy uses "spiders" – programs that crawl websites and extract data. You don't need to worry about the low-level details of navigating web pages and parsing HTML.
    • Focus on what matters - Scrapy handles the technical aspects, allowing you to concentrate on writing code to target the specific data you need.
    • Handle complex structures - Scrapy can navigate websites with pagination (multiple pages of data), forms, and even some JavaScript-heavy content.

     

    Setting Up Your Scrapy Environment

    The first step is setting up your Scrapy environment. Luckily, installation is straightforward for all major operating systems. Once installed, you'll create a new Scrapy project using a simple command-line tool. This project serves as the foundation for your web scraping tasks.

    Amazon Case Study

    Web scraping offers a powerful tool for extracting valuable data from websites. This case study delves into how we leveraged web scraping techniques to unlock insights from the vast Amazon marketplace.

    1. Site Analysis

    Before writing a spider for any site, it’s a good approach to analyze the site structure and divide it into sections based on the type of pages we’re going to iterate on to get to the target data.

     

    The most common site structure for any e-commerce site can be categorized into 4 sections.

     

    Home Page > Categories > Listing > Product Page

     

    In our case, we are only going to target the Books category of Amazon so we can start right from the Listing Page.

    2. Selecting the Spider Type

    Scrapy has two main types when it comes to spiders - Spider and CrawlSpider.

     

    I. Spider: Base class of spider, suitable for sites where the structure is not well defined or iterating through anchor links is not optimal.

    Ii. CrawlSpider: Subclass of Spider, suitable for sites with straightforward flow and easy to iterate through anchor links.

     

    Looking at the Amazon Books page we can determine that each page of the site can be accessed through direct request to its anchor link available on the page so CrawlSpider would be more suitable.

     

    Let’s create a custom amazon_spider file and import the relevant modules

     

    #file: spiders > amazon_spider.py

    
    from scrapy.linkextractors import LinkExtractor
    from scrapy.spiders import CrawlSpider, Rule



    3. Defining Data Fields for Parsed Data

    Let’s define the structure of our parsed data using scrapy Items before continuing with the spider.

    #file: items.py

    from scrapy import Item, Field
    from typing import Optional, Dict, List
    
    
    class BookDetailsItem(Item):
        pages: Optional[int] = Field()
        language: Optional[str] = Field()
        publisher: Optional[str] = Field()
        publication_date: Optional[str] = Field()
        isbn_10: Optional[str] = Field()
        isbn_13: Optional[str] = Field()
        
    
    class AuthorItem(Item):
        name: Optional[str] = Field()
        about: Optional[str] = Field()
        other_books_link: Optional[str] = Field()
    
    
    class ReviewItem(Item):
        name: Optional[str] = Field()
        rating: Optional[float] = Field()
        title: Optional[str] = Field()
        review: Optional[str] = Field()
        
    
    class AmazonItem(Item):
        asin: Optional[str] = Field()
        title: Optional[str] = Field()
        subtitle: Optional[str] = Field()
        cover_image: Optional[str] = Field()
        author: Optional[Dict[AuthorItem, any]] = Field()
        rating: Optional[float] = Field()
        review_count: Optional[int] = Field()
        price: Optional[float] = Field()
        availability: Optional[str] = Field()
        description: Optional[str] = Field()
        book_details: Optional[Dict[BookDetailsItem, any]] = Field()
        page_url: Optional[str] = Field()
        reviews: Optional[List[ReviewItem]] = Field()

    4. Amazon Spider

    Let’s continue with our custom Amazon spider and its different components.

    • Spider Settings

    Each site has its own merits in how it handles page requests, like some requiring a specific type of token or session ID for each request to be considered authentic and returns the requested page.

     

    Fortunately, in our case, Amazon loads its data through simple HTML pages. This only requires a user agent to function properly. Therefore, let's add it as DEFAULT_REQUEST_HEADERS in the spider's custom settings. Additionally, we'll include a couple of other settings to ensure the spider runs without any errors.

    #file: spiders > amazon_spider.py

     

    class AmazonCrawler(CrawlSpider):
        name = "amazon_crawler"
        
        custom_settings = {
            "CONCURRENT_REQUESTS": 1,
            "DOWNLOAD_DELAY": 2,
            
            "DEFAULT_REQUEST_HEADERS": {
                "accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp," \
                            "image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.7",
                "user-agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 " \
                                "(KHTML, like Gecko) Chrome/125.0.0.0 Safari/537.36",
                    
            }
        }

    Avoiding Blocking

    You might have noticed that we’ve made the spider quite slow by setting the CONCURRENT_REQUESTS to 1 and adding a delay of 2 seconds between each request through the setting DOWNLOAD_DELAY. This was done to avoid blocking during scraping.

     

    Exploring Advanced Scrapy Features 

    Scrapy goes beyond the basics of web scraping. Dive into its treasure trove of functionalities to become a web scraping wizard! Here's a glimpse.

     

    • Customizing Scrapy - Craft bespoke Scrapy shells tailored to your specific scraping needs. No more one-size-fits-all approaches!
    • Mastering Middleware - Utilize Scrapy's middleware system to intercept requests and responses, fine-tune scraping behavior, and handle tasks like authentication or data transformation.
    • Extensions - Explore functionalities like handling telescoping URLs, pagination, or scraping infinite scrolling websites.
    • Scalable Scrapers - Implement features like distributed scraping with Scrapy Cluster or integrate with scheduling tools to run your scrapes efficiently.

     

    Leveling Up Your Skills

    The world of web scraping is ever-evolving. Here's how you can stay ahead of the curve.

     

    • Scrapy Documentation - Explore advanced functionalities, delve into customization options, and unlock Scrapy's full potential.
    • Scrapy Community - The Scrapy community is a vibrant hub of developers and enthusiasts. Engage in discussions, learn from others' experiences, and contribute your knowledge.
    • Practice - The more you scrape, the better you'll become. Work and experiment with different techniques, and refine your scraping skills.
    • Stay Updated - Keep yourself informed about the latest trends and advancements in web scraping technologies. Explore new libraries, tools, and best practices to stay on top of your game.

     

    Conclusion

    The world of web scraping is brimming with possibilities. By making use of the power of Scrapy's advanced features, committing to responsible practices, and keeping your skills sharp, you'll unlock a world of valuable data. The vast expanse of the web will become your domain, filled with information waiting to be discovered. So keep learning, embrace responsible practices, and happy scraping!

      Share on
      https://d1foa0aaimjyw4.cloudfront.net/Screenshot_2023_09_26_at_12_27_12_PM_50c6c92048.png

      Hijab e Fatima

      I’m a technical content writer with a passion for all things AI and ML. I love diving deep into complex topics and breaking them down into digestible information. When I’m not writing, you can find me exploring anything and everything trending.

      Related blogs

      0

      Let’s talk about your next project

      Contact us