Extraction of References

Many HTML elements contain URIs pointing to other internet locations.

The following attributes are collected when you need all:

Reference attributes with links
  a        href
  area     href
  base     href
  embed    src
  form     action
  iframe   src
  img      src
  link     href
  script   src

Extracted data-structure

All links are collected and converted to the canonical (normalized) form, then deduplicated.

How and where the data-structure with the facts are transported is your decision, but the output looks like this:

{
     "form_action" : [
        "https://grupovilanova.es/#wpcf7-f399-p11-o1"
     ],
     "link_href" : [
        "https://grupovilanova.es/",
        "https://www.google.com/"
     ],
     "script_src" : [
        "https://my.es/wp-includes/js/jquery/jquery.min.js?ver=3.5.1"
     ],
     "a_href" : [
        "mailto:info@grupovilanova.es",
        "https://www.facebook.com/VilanovaInmo"
     ]
}

Custom subsetting

You may select a sub-set for your extraction (for instance only img_src) because the full list is really large: average about 150 URLs per HTML file.

You may also restrict the returned data with the following limits:

  • maximum links per set
  • only http/https
  • only mailto
  • matching some regular expression, for instance top-level domain or extension.