Internals

The Pipeline is an open infrastructure: you may contribute hardware, but you can also share other people's offerings. At the moment, there is sufficient harware capacity.

At the moment, the pipeline only processes crawl information which are published statically by third parties. A logical next step would be, to integrate this Pipeline logic inside the crawling efforts.

The Pipeline process

The abstract concept is quite simple:

The "Crawl Pipeline" runs on a large number of servers;
Each server runs a number of pipeline batches in parallel;
Each batch process takes a large number of crawled Pages;
Interested parties define Tasks, which are applied to each collected page;
Each Task defines filter rules and extraction tools. The result are packaged for download;
You MUST download the result packages at least every day, otherwise they are removed;
Legal restrictions MAY apply to the data you have received, especially around long-term storage. You need to study the data license.

Explaining

The process

Terminology:

Pipeline Tasks;
Pipeline Pages;

Legal issues:

Data License
SLA