Cooperation

The core feature for the Pipeline is open cooperation: we sincery hope that many projects will share their crawling efforts, both in reusing crawled data as in implementation of quality improvements. Please join us!

Current set-up

At the moment, the "Crawl Pipeline" runs on one server, provided for free by independ Dutch hosting provider ProcoliX. The server runs 6 pipes in parallel. It is able to process the CommonCrawl published data-set, which is 350TB per month. It runs three Tasks.

Joining the Pipeline

You can join in different ways:

contributing (WARC) archives, collected by your crawler
become a data user: submit you Task
host a server which runs a set of pipeline processes

The preferred contribution is number 2: please use our service instead of processing CommonCrawl data yourself. Let us reuse resources!

Contact mark@overmeer.net, Mark Overmeer (in English, Dutch, or German)

Cooperating

The Open Search Foundation targets to create a full open search engine (a competitor to Google and Bing) under European privacy law. It's openess will facilitate cooperation and research.

The American Common Crawl Foundation crawls web-pages since 2008. At the moment, they produce an new data-set about 9 times per year. Each set is about 350TB in size. Our Pipeline current uses this data-set for all projects.

Cooperation

Current set-up

Joining the Pipeline

Sponsors

Cooperating