Scrapy and Realtime Crawling: To Do or Not To Do?

Posted by: Nudrrat Khawaja 1 year, 5 months ago

If you read our recent blog post, you know we’re big on using Scrapy for a lot of our large-scale website crawling needs. But there’s another kind of website crawling scenario that also requires a well-crafted crawler and the right strategy: realtime crawling. Realtime crawling is the process of providing crawl results nearly instantaneously, as they are updated at the source. You don’t need to rebuild your entire inventory of data--just pull what’s requested, as it’s requested.

Realtime crawling becomes increasingly important and perhaps even unavoidable when you’re working with large amounts of data prone to rapid changes--especially in a scenario where users can’t wait to see the results, or you don’t want them to wait. Think: stock listings, trending hashtags, live feeds, and room availability. Because it is so situation-dependent, the best tools for the job vary from project to project depending on the exact requirements. But some things don’t change: if you’re doing realtime crawling, you need speed and you need a quick response time. It’s not realtime crawling if it’s not, you know, realtime.

So could Scrapy fit the bill? Time to bring in our engineers again.

Right away, our engineers want us to know there’s one big disadvantage: Scrapy has an increased overhead when starting up. That means in a situation where milliseconds matter, Scrapy will take a few whole seconds before it even sends out a request for the required data. That causes a delay. Does this mean we can’t use Scrapy for any realtime crawling projects?

That’s where there’s a catch. Or rather, two catches.

Scrapy’s extensible design means there’s already a purpose-built HTTP API designed to let Scrapy bust out of its default behavior into something that works better for realtime crawling. It’s called Scrapy RT and gives businesses a pain-free way to enable realtime interactions between end users and Scrapy. That means instead of running spiders according to a fixed schedule, Scrapy can run on-demand. It’s great for crawling unit data, such as tickets counts for events or venues.

That sorts out the realtime part of realtime crawling. What’s the other advantage of using Scrapy? This one takes a bit of critical thinking: when you’re updating large volumes of data piecemeal, on-demand and with a focus on speed,  what’s one easily overlooked thing that could lead to big trouble?

Answer: quality of data. Because you’re crawling data in realtime to update as opposed to rebuilding a database, minor errors have the potential to be both virtually undetectable and damage the integrity of the entire data set. Not to mention the loss of credibility and potential legal problems such an error could create.

Fortunately, Scrapy has a solution for this: it has built-in mechanisms to let you implement data validation just as you would for other crawl scenarios. To Scrapy,it’s just another crawl--except it’s on-demand. You don’t have to trade data quality for speed, as often happens in realtime crawling. That means it can save you a lot of headache down the datastream.

Our final verdict? Realtime crawling isn’t an amateur sport; there might be legitimate reasons to put Scrapy RT to work for you, but it takes a professional to understand the how, why, and (most importantly) when of it.

 

Authored by: Fatima A. Athar

 

Current rating: 2