Tokenization vs Guillemet

While doing some testing, I noticed that the tokenizer treats gullermets «, » differently from the more common ", '.

Look a this string: «a sentence between guillemet». Your tokenizer get this: «a, sentence, between, guillemet».

«a, and guillemet», it seems to, that they are not identified for further processing, like the lower case transformation. So if the tokenizer also encounters the upper case versions GUILLEMET» and «A in the text, it will create two more entries.

This results in the creation of fancy entries in the inverted index dictionary, and the loss of information associated with the loss of the term guillemet. All that worsens the user experience in terms of accuracy, and recall. Not least it unnecessarily exacerbates RAM usage, with useless tokens

Hey @Luca,

Thanks for pointing it out, mind open an issue of github for this?

Now I understand why sometimes documents are not returned in some searches, when there are those angle quotes. The same problem is also present in the enterprise version, whose licence is not exactly cheap! A problem of this nature is no longer justified in this day, in the age where unicode is the de facto standard in every document.