Real-life HTML links statistics

MarkOv collected some data from one CommonCrawl WARC file, published April 2021. The WARC file has name CC-MAIN-20210410105831-20210410135831-00000.

Some facts about the _Normalized_ URL (URL in the _canonical_ form), created via URI::Fast v0.52. Most facts are on the string as discovered in any attribute that HTML has to offer which MAY contain a link.

# 50_025 100.0% successful responses in the WARC archive # 41_912 83.8% parseable HTML, used to extract links

About the contained relative links:

# 9_525_806 100.0% links # 29_140 0.3% blank links # 2_746 0.0% empty query (trailing '?') # 165_393 1.7% empty fragment (trailing '#') # 56_843 0.6% contains blank(s) # 151_208 1.6% contains '+' # 114_258 1.2% contains utf8 (1_347_928 characters) # 1_326 0.0% normalized to ipv4 hosts # 6_753 0.1% normalized with default port number

About the used schemes: (in normalize results)

# 7_613_385 80.0% https: # 1_695_666 17.8% http: # 60_295 0.6% data: # 550 0.0% about: # 112_975 1.2% javascript: # 4 0.0% ftp: # 18_771 0.2% mailto: # 16_697 0.2% tel: # 7_463 0.1% other schemes and broken links