If you read our recent blog post, you know we’re big on using Scrapy for a lot of our large-scale website crawling needs. But there’s another kind of website crawling scenario that also requires a well-crafted crawler and the right strategy: realtime crawling. Realtime crawling is the process of providing crawl results nearly instantaneously, as they are updated at the source. You don’t need to rebuild your entire inventory of data--just pull what’s requested, as it’s requested.
In 2011, a team of scientists led by Martin Hilbert discovered that the total amount of data in the world in 2007 was 295 optimally compressed exabytes. If you decided to store all this data on standard 730-MB CD-ROM discs, you could build a ladder to the moon and (96900 km/59713.7 miles) beyond with all the CDs. CD stack to the stars.