Extraction of References
Many HTML elements contain URIs pointing to other internet locations.
The following attributes are collected when you need all:
Reference attributes with links
a href area href base href embed src form action iframe src img src link href script src
Extracted data-structure
All links are collected and converted to the canonical (normalized) form, then deduplicated.
How and where the data-structure with the facts are transported is your decision, but the output looks like this:
{ "form_action" : [ "https://grupovilanova.es/#wpcf7-f399-p11-o1" ], "link_href" : [ "https://grupovilanova.es/", "https://www.google.com/" ], "script_src" : [ "https://my.es/wp-includes/js/jquery/jquery.min.js?ver=3.5.1" ], "a_href" : [ "mailto:info@grupovilanova.es", "https://www.facebook.com/VilanovaInmo" ] }
Custom subsetting
You may select a sub-set for your extraction (for instance only
img_src
) because the full list is really large: average
about 150 URLs per HTML file.
You may also restrict the returned data with the following limits:
- maximum links per set
- only http/https
- only
mailto
- matching some regular expression, for instance top-level domain or extension.