Extract Product information

After the filters have selected products which you want to receive, you should narrow the data down to what you really are interested in. The input of the Pipeline is more than 10 TeraByte per day, and you probably want to reduce that to a few GigaBytes per day to process yourself. The extraction can be formatted and packaged in many ways. Here we only focus on the content.

At the moment, we only process CommonCrawl data. For sources to be added in the future, we do not yet know what can be offered.


Available data

Standardized data:

  • (raw) HTTP request (bytes sent)
  • (raw) HTTP response (bytes received)

In standard WARC Format

  • request as WARC record; raw HTTP request wrapped in WARC headers
  • response as WARC record; raw HTTP response wrapped in WARC headers

CommonCrawl specific content


Pipeline:

  • Hits produced by the filter rules; JSON or other serialization

HTML inspection, produces JSON or other serialization. It is implemented via our perl module HTML::Inspect


Ideas for other extracts

Ask for them to be implemented, if you have a use for it:

  • Various Microformats like vcard, hMedia, ...
  • TwitterCards
  • AGLS Metadata Standard, Australian governmental standard which shows as <meta name="AGLSTERMS.*">
  • RDFa content in HTML
  • DublinCore <meta name="dc.*"> and <meta name="dcterm.*">
  • Cookies set by the Response
  • Which CMS framework is used
  • Response time

Useful information:

More ideas? Please report them, even if you do not need them (yet).