Filter on text size

This filter will return a Hit, only when there is a text extract for the response, and when this extract has a certain minimum size (in UTF8)

The text extraction effort is shared. The text extraction may be done by the origin (like in the CommonCrawl case), and may have their own restrictions as well.

You have to specify exactly one of the following:

  • minimum_size, counts any printable or blanks
  • minimum_chars, counts only printables, not blanks
  • minimum_words, counts "\w+" sequences

Hit information

When you are interested in the actual counts, you get one of these three forms:

[ { "rule": "text size", "size": 2345 },
  { "rule": "text size", "size": 2345, "chars": 1234 },
  { "rule": "text size", "size": 2345, "words":  312 }
]

Respectively for minimum_size, minimum_chars and minimum_words.