Website Home

Pipeline Tasks

When you are interested in crawled website information, you submit a Task to the people running this project. Tasks are components which run in the processing pipelines.

Every Task should have exactly one purpose: write different Tasks when you have different needs.

Task components

Each Task has the following components:

Filter rules: which page contains useful data for you;
Extraction tools: what data would you like to collect from it (knowledge extract effort is shared); and
Packaging options: how the extracted data is transported to you. Usually downloading zips or WARCs via http or ftp.

Explaining

The process

Terminology:

Pipeline Tasks;
Pipeline Pages;

Legal issues:

Data License
SLA