"Classic" <meta> elements
Many extensions have been made to the <meta>
elements in HTML pages. Not only have people be adding new values
for name, but also various attributes where added,
like property. Let's call the official short list of
possibilities "classic".
The classic <meta> elements come in three forms:
- with attribute
name, with a restricted set of names; - with attribute
charset, maximum one; and - with attribute
http-equiv, all of them.
This extractor does take all http-equiv records, because
there are few and old extensions have been made to the list reported
by W3Schools.
There SHOULD be only one meta element with a charset.
Writing Tasks
The classic name attributes
The classic set of names can be found at MDN "standard metadata names", but with convincing arguments, a few names MAY be added.
This extractor currently takes name values for:
application-name author creator color-scheme description generator googlebot keywords publisher referrer robots theme-color viewport
Produced data-structure
How and where the data-structure with the facts are transported is your decision, but the output looks like this:
{ "name" : { "description" : "The Open Graph protocol enables...", "generator" : "Хей, гиди Ванчо" }, "charset" : "utf-8", "http-equiv" : { "content-type" : "text/html;charset=utf-8", "refresh" : "3;url=https://www.mozilla.org", "content-disposition" : "" } }
Both the name and http-equiv can appear with
multiple <meta>-elements, and have a unique label.
Therefore, they are produces as simple associative array (HASH) with
only simple values. Always UTF-8 and entity decoded.
