Extraction of References

Many HTML elements contain URIs pointing to other internet locations.

The following attributes are collected when you need all:

Reference attributes with links
  a        href
  area     href
  base     href
  embed    src
  form     action
  iframe   src
  img      src
  link     href
  script   src

Extracted data-structure

All links are collected and converted to the canonical (normalized) form, then deduplicated.

How and where the data-structure with the facts are transported is your decision, but the output looks like this:

     "form_action" : [
     "link_href" : [
     "script_src" : [
     "a_href" : [

Custom subsetting

You may select a sub-set for your extraction (for instance only img_src) because the full list is really large: average about 150 URLs per HTML file.

You may also restrict the returned data with the following limits:

  • maximum links per set
  • only http/https
  • only mailto
  • matching some regular expression, for instance top-level domain or extension.