Crawl Pipeline

The "Crawl Pipeline" is an open infrastructural project, which gives access to data which has been collected by web Crawlers: automated processes which collect the content of websites. Instead of crawling your own data for your own project, you can get your usefull data from us: cooperation to get better quality, a wider reach, using less resources.

All public websites together contain more than one PetaByte of data. At the moment, many organizations run their own crawlers for their own purpose. When you switch to use our infrastructure, you do not have to crawl yourself: just collect the extract which is required for your project. You may only need to handle a few megabytes per day: cheap and fast.

Let us explain the details.



The infrastructure

At the moment, the "Crawl Pipeline" runs on one server, with 9 batches in parallel. It is able to process the CommonCrawl published data-set, which is 350TB per month. It runs three Tasks.

Have a look at the provided filter rules and extraction options, to see whether we offer enough to keep you from crawling yourself. Can we run your Task?


Application

  • The use of our infrastructure is available for free (fair use);
  • You may use the data commercially, but the main use is research and education;
  • You have to respect the GDPR (and other EU law) on the source data, which means that each project starts with an action plan.

Sponsors