Filter on Origin
The "Origin" of the crawled data is the (abstract) name of an organization which has provided the data for the pipeline. The origin gives a license to use the data which you MUST comply to. See the License page.
You may specify more than one source as one rule. You may also give an abstract description like "sources which allow me to keep the data for one month". On the moment, we do not know which kinds of restrictions sources impose or users need to have.
Hit information
When this filter-rule matches, you may collect this hit:
{ "rule": "origin", "origin": "CommonCrawl" }