Extract the <link> elements
Links have many different purposes, and more and more are added. They also have a wide variety in attributes, as can be seen on the MDN page about 'link'
The links will be normalized into a canonical URL: you do not need to clean it up any further.
Extracted structure
Sometime webpages have a huge number of links. The URIs themselves (when a href
attribute is present) will also be listed in the references extractor, but this link extractor will take all attribute fields it finds. The lower-case version of the attribute name is used as key.
Example output:
{ "canonical": [ { "href": "https://boat.com/en_us/rental-of/yacht/puerto-portals" } ], "shortcut icon": [ { "type": "image/x-icon", "href": "https://boat.com/assets/favicon/favicon-783e85a14e.ico" } ], "alternate": [ { "href": "https://boat.com/es/alquiler-de/yate/puerto-portals", "hreflang": "es" }, { "href": "https://boat.com/en/rental-of/yacht/puerto-portals", "hreflang": "en" } ] }