In 2011, a team of scientists led by Martin Hilbert discovered that the total amount of data in the world in 2007 was 295 optimally compressed exabytes. If you decided to store all this data on standard 730-MB CD-ROM discs, you could build a ladder to the moon and (96900 km/59713.7 miles) beyond with all the CDs. CD stack to the stars.
But of course this was the capacity 10 years ago and now CD-ROMs themselves are on the way out. And the Internet, that vast virtual behemoth sprung from human ingenuity, is bigger than ever before. With over a billion registered websites and the average web page clocking in at the same file size as a compressed copy of the videogame Doom, Big Data is here to stay. There’s a lot of value locked up in those bytes. Whether it’s inventorying the goods sold on an infinitely expandable online marketplace or keeping up with updates from around the world, the Web takes the term “information overload” to a whole new level.
So what are the unique challenges posed by scraping meaningful data from massive, increasingly complex websites? And how do we deal with these effectively? That’s what we’re going to find out.
Someone once pointed out, “Facebook is like a giant castle where new rooms are constantly built, old ones destroyed, and about a billion people have moved in.” That chaos captures Web 2.0 very well. Websites don’t just get bigger--they get diverse, branching out into new content, features, and structures. Modern websites belong to their users, and these users are unpredictable. Even where the platform restricts users to only specific types content, people find ways to stand out. That’s before you consider how spread out most major websites are nowadays, with different versions for different locations, languages, devices, and markets.
None of this makes data scraping any easier, and variations are one of the biggest challenges when it comes to scraping large websites.
Along with all that content come more places to put it. Large aggregator websites typically have several hundred thousand URLs and many more data repositories than smaller sites, which means your spiders need to make a lot more hits to collect all the required information. That means more chances of raising red flags and increasing likelihood of getting slammed with bandwidth restrictions or other anti-bot measures. This is a problem when you have a legitimate need to access that data, especially when your other business processes depend on it.
This one is obvious: the larger a website is, the more data it contains. And the more data it contains, the longer it takes to scrape that site. This may be fine if your purpose for scanning the site isn’t time-sensitive, but that isn’t often the case. Stock prices don’t stay the same over hours, let alone days. Sales listings, currency exchange rates, media trends, and market prices are just a few examples of time-sensitive data critical to good business strategy and decision-making. When you don’t get this information in time, you lose opportunities, income, and ultimately credibility.
At Arbisoft, we’ve scraped some very large websites--including a few with product listings of over 1 million. According to Arbisoft engineer Jawad, a task like that could have easily taken several days or longer. So how did we make sure it finished quickly, efficiently, and without busting the budget?
Data scraping looks deceptively easy these days; there are a lot more free and open-source tools now, and infrastructure costs have gone down significantly in recent years with the mainstreaming of cloud computing. But scaling this for massive websites is another matter entirely.
We’ve been doing this for 6 years now and what we’ve learned is that success depends on three things:
#1 is easy: our preferred tool is Scrapy, an open-source framework written in Python that lets our engineering team do what they do best without getting in the way. Jawad says, “My favorite thing about Scrapy is how flexible it is--I can write crawlers exactly the way we need and Scrapy integrates well with that. It’s customizable by design.”
Scrapy also gives us access to two other important tools to tackle large-scale website scraping projects: Frontera and Scrapy Redis. Frontera lets you send out only one request per domain at a time--but can hit multiple domains at once, making it great for parallel scraping. Scrapy Redis, which is made by the same people as Scrapy, lets you send out multiple requests to one domain. The right combination of these can result in a very powerful web spider that can handle both the bulk and variation seen in large websites.
This leaves you with the problem of bot detection and bandwidth. There’s a rule of thumb for that: play nice with other people’s servers, and they won’t have a reason to ban you. That’s not only good for avoiding throttling issues, it’s also the ethical way to do data scraping. We recommend pacing crawls by incorporating random intervals between individual requests, and switching up IPs with Crawlera.
Ultimately, tackling the issues of scraping large websites comes down to finding the right balance. Large websites often serve large numbers of users and you have to work with that fact in mind; prioritize your processes so you don’t hog resources you don’t need or make the user experience worse for everyone. Data scraping large websites doesn’t have to be us versus them.
Authored by: Fatima A. AtharShare on Twitter Share on Facebook