Result packagers

The shear quantity of extracted data require adaptation to the needs of processing it on your servers. Many Tasks will produce hundreds of Gigabytes per day, so optimization is lower various costs.

The crawl Pipeline is able to generate close to any format. At the moment, the following are foreseen:

  • WARC-archives, about 1Gbyte each, accompanied by a meta-data index
  • Zip files, about 1Gbyte each: data records and metadata per Product in separate directories.
  • 7zip files, like zip files but much smaller so better for longer storage.

Writing Tasks

  1. Configure filters
  2. Composing filters
  3. Extracting knowlegde
  4. Packaging Results