While doing some testing, I noticed that the tokenizer treats gullermets «
, »
differently from the more common "
, '
.
Look a this string: «a sentence between guillemet». Your tokenizer get this: «a
, sentence
, between
, guillemet»
.
«a
, and guillemet»
, it seems to, that they are not identified for further processing, like the lower case transformation. So if the tokenizer also encounters the upper case versions GUILLEMET»
and «A
in the text, it will create two more entries.
This results in the creation of fancy entries in the inverted index dictionary, and the loss of information associated with the loss of the term guillemet. All that worsens the user experience in terms of accuracy, and recall. Not least it unnecessarily exacerbates RAM usage, with useless tokens