URL normalization

Various extracts contain (relative) URLs. As long as we now that the extract contains a URL, we will normalize that into an normalized (cannonicalized) absolute URL. This is a complex process.

The main normalization rules are written down in RFC3986 and explained on Wikipedia page about URI normalization. But it also handles HTML quircks, as long as most browsers support them.

Interesting real-life HTML link statistics

Writing Tasks

Configure filters
Composing filters
Extracting knowlegde:

URL normalization
classic <meta>
<meta> names
all <meta>
references
<link>
OpenGraph
Packaging Results

Normalization rules

We use the Perl module HTML::Inspect to normalize (partial, relative) URLs found on webpages into absolute, RFC compliant URLs (not on CPAN yet).

Explicitly, our URIs are created according to the following rules:

As absolute base for relative URLs, it uses
- the <base href> from the document, otherwise
- the Location: header of the response,
- otherwise the request URI
HTML formatting issues are circumvented
- leading and trailing blanks are stripped
- backslashes become forward slashes
- tab, cr, lf, and vt are removed including following blanks
- blanks are recoded to %20
- utf8 auth, path and query characters are hex encoded and verified
- utf8 in hostnames is converted into IDN

RFC3986 normalizations:
- "../" and "./" removal
- remove repeating slashes
- "+" is recoded to "%20"
- lower-casing scheme
- upper-casing percent-encoded characters
- unneeded percent encodings are removed
- hostnames are lower-cased
- remove trailing dot from hostname
- superfluous port numbers are removed
- superfluous port number digits are removed
- The fragment is removed.
Probably most important: all url components are validated so do not need additional checks.