Pipeline filters
The "Crawl Pipeline" converts crawl results from third party crawlers into a standardized "Product". Potentially, there are many kinds of products, but at the moment these are only successful HTTP-request/response/text results, produced by CommonCrawl. The filters, however, are abstract enough to work with many other data providers and formats. What do you need? What data can you offer to other people?
Filter hit information
Each filter rule can provide you with the reason why it got triggered. You may be interested to collect those facts (for instance for debugging purposes), but you may also not be interested. The hit information has a standard format:
{ "rule": "<some name>", "<fact>": "<data>" }
Each filter rule describes which name it uses, and which additional information it provides.
Filters
A Task can use the following Filters to select Products:
- filter Status, which response codes do you want to see
- filter Origin, restrict the source of the data
- filter Language, restrict the results based on the detected language of the response
- filter Text Size, restrict based on the size of the extracted text from the response
- filter Content Type, restrict based on the content (mime) type of the response
- filter Domain, restrict to a set of websites or TLDs
- filter Full Words, one of the words must be present in the extracted text
- filter Match Text, one of the patterns (regular expressions) must be present in the extracted text
More filters will be created when the need arises. Expensive filter actions are shared with other Tasks which run on the same Pipeline.