The Pipeline needs your project: the more projects, the more efficiënt our infrastructure becomes.
- 2021-04 WARC statistics, counting link types.
Example project; Bavarian Food
One of the first Tasks for the Pipeline was contributed by the
University of Passau (Germany). It want to detect (and extract)
websites in Bayern (county of Bavaria) which relate to local food.
In the first step, their Task contains all webpages written in German
(under any domain) or in any language under TLD
which have an extracted text of at least 300 words. It also report
Example project; link extract
For the use of other infrastructural projects, related to the
Skrodon Project, there is link extractor Task. For each HTML response,
<link>, and href/src link
data is collected for further processing. Links and data are nicely
Ideas for student projects
It is not hard to extract many useful facts from huge numbers of web-pages via the Crawl Pipeline, but it still has some challenges which make this into educational student projects.
- Count distribution of web-server software;
- Count distribution of authoring tools;
- Estimate the percentage of internet which CommonCrawl collects monthly;
- Detector for pages which need headless crawling;
- Discover the languages available for certain page, from response headers and OpenGraph;
- Build a simple searchable index based on OpenGraph and "classic" meta fields only;
You may ideas to add to this list. We will help the students to collect the right raw data.