Scrapy and Realtime Crawling: To Do or Not To Do?

If you read our recent blog post, you know we’re big on using Scrapy for a lot of our large-scale website crawling needs. But there’s another kind of website crawling scenario that also requires a well-crafted crawler and the right strategy: realtime crawling. Realtime crawling is the process of providing crawl results nearly instantaneously, as they are updated at the source. You don’t need to rebuild your entire inventory of data--just pull what’s requested, as it’s requested.

Large-scale website scraping issues and the tools to resolve them

In 2011, a team of scientists led by Martin Hilbert discovered that the total amount of data in the world in 2007 was 295 optimally compressed exabytes. If you decided to store all this data on standard 730-MB CD-ROM discs, you could build a ladder to the moon and (96900 km/59713.7 miles) beyond with all the CDs. CD stack to the stars.

Trust is a core value

Tags  Core values

When we started Arbisoft we were very clear about a set of core values that we agreed to uphold in the face of all resistance and opposition.  Here are the ones we started with:

Marc Andreessen agrees with us!

Tags  Training

This is something we have been saying and preaching at Arbisoft for at least a couple of years now i.e. we have to take high quality training and education into our own hands.  We jumped on to the Udacity bandwagon when very few people knew about it.  I have even been encouraging our team members to go and take some Khan Academy courses to refresh their concepts of core subjects like Linear Algebra and we are really excited about the possibilities created by other such ventures (edX, Coursera, Code Academy).  If you know any other great startups/ventures in this space, please use the comments section.

