Pipeline filters

The "Crawl Pipeline" converts crawl results from third party crawlers into a standardized "Product". Potentially, there are many kinds of products, but at the moment these are only successful HTTP-request/response/text results, produced by CommonCrawl. The filters, however, are abstract enough to work with many other data providers and formats. What do you need? What data can you offer to other people?

Filter hit information

Each filter rule can provide you with the reason why it got triggered. You may be interested to collect those facts (for instance for debugging purposes), but you may also not be interested. The hit information has a standard format:

  { "rule": "<some name>", "<fact>": "<data>" }

Each filter rule describes which name it uses, and which additional information it provides.


A Task can use the following Filters to select Products:

More filters will be created when the need arises. Expensive filter actions are shared with other Tasks which run on the same Pipeline.