Extract Product information

After the filters have selected products which you want to receive, you should narrow the data down to what you really are interested in. The input of the Pipeline is more than 10 TeraByte per day, and you probably want to reduce that to a few GigaBytes per day to process yourself. The extraction can be formatted and packaged in many ways. Here we only focus on the content.

At the moment, we only process CommonCrawl data. For sources to be added in the future, we do not yet know what can be offered.

Writing Tasks

Configure filters
Composing filters
Extracting knowlegde:

URL normalization
classic <meta>
<meta> names
all <meta>
references
<link>
OpenGraph
Packaging Results

Available data

Standardized data:

(raw) HTTP request (bytes sent)
(raw) HTTP response (bytes received)

In standard WARC Format

request as WARC record; raw HTTP request wrapped in WARC headers
response as WARC record; raw HTTP response wrapped in WARC headers

CommonCrawl specific content

CommonCrawl specific metadata (WAT), JSON string
CommonCrawl specific metadata; JSON wrapped in WARC headers
CommonCrawl extracted text (WET); multi-line file without any mark-up
CommonCrawl extracted text; wrapped in WARC headers

Pipeline:

Hits produced by the filter rules; JSON or other serialization

HTML inspection, produces JSON or other serialization. It is implemented via our perl module HTML::Inspect

all produced URLs are URL normalization rules with respect to the page they were found in. You have various options to reduce the list of links, because there are so many!
the classic meta elements from <meta> elements: the relatively small subset of traditional meta fields.
a table with extract meta names from <meta> elements, simple key values.
a list of extract meta all from <meta> elements, including all attributes.
all kinds of extract references, like href and src
facts from extract link elements, via <link> elements
extract OpenGraph data extract

Ideas for other extracts

Ask for them to be implemented, if you have a use for it:

Various Microformats like vcard, hMedia, ...
TwitterCards
AGLS Metadata Standard, Australian governmental standard which shows as <meta name="AGLSTERMS.*">
RDFa content in HTML
DublinCore <meta name="dc.*"> and <meta name="dcterm.*">
Cookies set by the Response
Which CMS framework is used
Response time

Useful information:

WARC specification v1.1

More ideas? Please report them, even if you do not need them (yet).