The core feature for the Pipeline is open cooperation: we sincery hope that many projects will share their crawling efforts, both in reusing crawled data as in implementation of quality improvements. Please join us!

Current set-up

At the moment, the "Crawl Pipeline" runs on one server, provided for free by independ Dutch hosting provider ProcoliX. The server runs 6 pipes in parallel. It is able to process the CommonCrawl published data-set, which is 350TB per month. It runs three Tasks.

Joining the Pipeline

You can join in different ways:

  1. contributing (WARC) archives, collected by your crawler
  2. become a data user: submit you Task
  3. host a server which runs a set of pipeline processes

The preferred contribution is number 2: please use our service instead of processing CommonCrawl data yourself. Let us reuse resources!

Contact, Mark Overmeer (in English, Dutch, or German)


The NLnet Foundation supports hundreds of internet related development projects, currently especially search and security related. They also help this project with a very generous donation.

ProcoliX is a non-nonsense hosting provider with its fundaments in the Open Source community. It has its own data-centers in The Netherlands, and is not owned by any international conglomerate. They provide the hardware and networking needs to run the pipeline.


The Open Search Foundation targets to create a full open search engine (a competitor to Google and Bing) under European privacy law. It's openess will facilitate cooperation and research.

The American Common Crawl Foundation crawls web-pages since 2008. At the moment, they produce an new data-set about 9 times per year. Each set is about 350TB in size. Our Pipeline current uses this data-set for all projects.