Filter on language

You may filter results based on the language detected in the response. The languages are in ISO-639-3 form (three letters long)

Hit information

When you are interested in the detected language, then you may as for the Hit which has the form:

{ "rule": "language", "lang": "nld" }

How the language is found

There are various ways to determine the language used.

For the CommonCrawl data-set, the language is determined via the CDL2 detector which produces a very detailed count on various languages seen in the HTML pages. The filter will check only the major language discovered per page.

Which information will be used to detect the language on other data-sets is not yet determined. There are many ways: in the response headers, request headers, and the HTML.