Pipeline filters

The "Crawl Pipeline" converts crawl results from third party crawlers into a standardized "Product". Potentially, there are many kinds of products, but at the moment these are only successful HTTP-request/response/text results, produced by CommonCrawl. The filters, however, are abstract enough to work with many other data providers and formats. What do you need? What data can you offer to other people?

Composing filters

Filter hit information

Each filter rule can provide you with the reason why it got triggered. You may be interested to collect those facts (for instance for debugging purposes), but you may also not be interested. The hit information has a standard format:

  { "rule": "<some name>", "<fact>": "<data>" }

Each filter rule describes which name it uses, and which additional information it provides.

Filters

A Task can use the following Filters to select Products:

filter Status, which response codes do you want to see
filter Origin, restrict the source of the data
filter Language, restrict the results based on the detected language of the response
filter Text Size, restrict based on the size of the extracted text from the response
filter Content Type, restrict based on the content (mime) type of the response
filter Domain, restrict to a set of websites or TLDs
filter Full Words, one of the words must be present in the extracted text
filter Match Text, one of the patterns (regular expressions) must be present in the extracted text

More filters will be created when the need arises. Expensive filter actions are shared with other Tasks which run on the same Pipeline.