Extract the <link> elements

Links have many different purposes, and more and more are added. They also have a wide variety in attributes, as can be seen on the MDN page about 'link'

The links will be normalized into a canonical URL: you do not need to clean it up any further.

Extracted structure

Sometime webpages have a huge number of links. The URIs themselves (when a href attribute is present) will also be listed in the references extractor, but this link extractor will take all attribute fields it finds. The lower-case version of the attribute name is used as key.

Example output:

{  "canonical": [ {
          "href": "https://boat.com/en_us/rental-of/yacht/puerto-portals"
       }
    ],
    "shortcut icon": [ {
          "type": "image/x-icon",
          "href": "https://boat.com/assets/favicon/favicon-783e85a14e.ico"
       }
    ],
    "alternate": [ {
          "href": "https://boat.com/es/alquiler-de/yate/puerto-portals",
          "hreflang": "es"
       },
       {  "href": "https://boat.com/en/rental-of/yacht/puerto-portals",
          "hreflang": "en"
       }
    ]
 }

Extract the <link> elements

Extracted structure

Writing Tasks