Extract Product information
After the filters have selected products which you want to receive, you should narrow the data down to what you really are interested in. The input of the Pipeline is more than 10 TeraByte per day, and you probably want to reduce that to a few GigaBytes per day to process yourself. The extraction can be formatted and packaged in many ways. Here we only focus on the content.
At the moment, we only process CommonCrawl data. For sources to be added in the future, we do not yet know what can be offered.
Writing Tasks
Available data
Standardized data:
- (raw) HTTP request (bytes sent)
- (raw) HTTP response (bytes received)
In standard WARC Format
- request as WARC record; raw HTTP request wrapped in WARC headers
- response as WARC record; raw HTTP response wrapped in WARC headers
CommonCrawl specific content
- CommonCrawl specific metadata (WAT), JSON string
- CommonCrawl specific metadata; JSON wrapped in WARC headers
- CommonCrawl extracted text (WET); multi-line file without any mark-up
- CommonCrawl extracted text; wrapped in WARC headers
Pipeline:
- Hits produced by the filter rules; JSON or other serialization
HTML inspection, produces JSON or other serialization. It is implemented via our perl module HTML::Inspect
- all produced URLs are URL normalization rules with respect to the page they were found in. You have various options to reduce the list of links, because there are so many!
- the classic meta elements
from
<meta>
elements: the relatively small subset of traditional meta fields. - a table with extract meta names from
<meta>
elements, simple key values. - a list of extract meta all from
<meta>
elements, including all attributes. - all kinds of extract references,
like
href
andsrc
- facts from extract link elements, via
<link>
elements - extract OpenGraph data extract
Ideas for other extracts
Ask for them to be implemented, if you have a use for it:
- Various Microformats like vcard, hMedia, ...
- TwitterCards
- AGLS Metadata Standard,
Australian governmental standard which shows as
<meta name="AGLSTERMS.*">
- RDFa content in HTML
- DublinCore
<meta name="dc.*">
and<meta name="dcterm.*">
- Cookies set by the Response
- Which CMS framework is used
- Response time
Useful information:
More ideas? Please report them, even if you do not need them (yet).