Projects

The Pipeline needs your project: the more projects, the more efficiënt our infrastructure becomes.

2021-04 WARC statistics, counting link types.

Example project; Bavarian Food

One of the first Tasks for the Pipeline was contributed by the University of Passau (Germany). It want to detect (and extract) websites in Bayern (county of Bavaria) which relate to local food. In the first step, their Task contains all webpages written in German (under any domain) or in any language under TLD .bayern, which have an extracted text of at least 300 words. It also report discovered city-names.

Example project; link extract

For the use of other infrastructural projects, related to the Skrodon Project, there is link extractor Task. For each HTML response, <meta>, <link>, and href/src link data is collected for further processing. Links and data are nicely normalized.

Ideas for student projects

It is not hard to extract many useful facts from huge numbers of web-pages via the Crawl Pipeline, but it still has some challenges which make this into educational student projects.

Count distribution of web-server software;
Count distribution of authoring tools;
Estimate the percentage of internet which CommonCrawl collects monthly;
Detector for pages which need headless crawling;
Discover the languages available for certain page, from response headers and OpenGraph;
Build a simple searchable index based on OpenGraph and "classic" meta fields only;

You may ideas to add to this list. We will help the students to collect the right raw data.