Filter on full words

Matches when one of the filtered words is found in the text extract of the page. When you want to search for parts of words, you must use the filter which matches text.

This filter has one option: case_sensitive (default false: defaults to case-insensitive matching).

Writing Tasks

Configure filters:

Content Type
Domain
Full Words
Language
Match Text
Origin
Status
Text Size
Extracting knowlegde
Packaging results
Task Composition

Hit information

The filter will produce a Hit for every word which is found in the text, removing duplicates.

When the match is case-sensitive (not the default), every word will be shown as this Hit:

{ "rule": "full word", "word": "Pizza" }

When your match is case-insensitive (the default behavior), the word is returned with the casing you provided: it does not show the detected capitalization. Each Hit looks like this:

{ "rule": "full word-i", "word": "Pizza" }

Characters with various notations

There is a big issue with none-ASCII characters: they usually have many different ways to write them. For instance, when you would like to find the city of München in HTML, there are four options

M\N{LATIN SMALL LETTER U WITH CIRCUMFLEX}nchen
Mu\N{COMBINING CIRCUMFLEX ACCENT}nchen
Muenchen
Munchen           # lazy author

M\N{LATIN SMALL LETTER U WITH CIRCUMFLEX}nchen
Mu\N{COMBINING CIRCUMFLEX ACCENT}nchen
Muenchen
Munchen           # lazy author

Some words have many of these, so only a regular expression can help you here. Use the text match filter with a pattern like this:

Linenumbers just as example

\b M (?:
     u |
     ue |
     \N{LATIN SMALL LETTER U WITH CIRCUMFLEX} |
     u\N{COMBINING CIRCUMFLEX ACCENT}
     ) nchen \b

```
\b M (?:
```
```
     u |
```
```
     ue |
```

     \N{LATIN SMALL LETTER U WITH CIRCUMFLEX} |

```
     u\N{COMBINING CIRCUMFLEX ACCENT}
```
```
     ) nchen \b
```