Pipeline Tasks

When you are interested in crawled website information, you submit a Task to the people running this project. Tasks are components which run in the processing pipelines.

Every Task should have exactly one purpose: write different Tasks when you have different needs.

Task components

Each Task has the following components:

  • Filter rules: which page contains useful data for you;
  • Extraction tools: what data would you like to collect from it (knowledge extract effort is shared); and
  • Packaging options: how the extracted data is transported to you. Usually downloading zips or WARCs via http or ftp.



