Hello.
Shingle token filter seems to ignore single token values.
{
"tokenFilters": [
{
"maxShingleSize": 5,
"minShingleSize": 2,
"type": "shingle"
},
{
"maxGram": 30,
"minGram": 3,
"type": "edgeGram"
}
],
...
}
``
Strings like '*some value*' or '*some other value*' are tokenized with shingle and are searchable, but any single-token string (like '*somevalue*') is ignored and is not searchable. Atlas enforces **minShingleSize** min value = 2 (as I know Lucene doesn't).
Are there options like ES 'output_unigrams' or 'output_unigrams_if_no_shingles' for shingle token filter? Or is there another way to combine shingle for multi-token string with 'use-as-is' for single-token strings?
Both of those parameters of the shingle filter are set to true in Atlas Search - so you’ll get unigrams automatically, whether there’s a shingle generated or not.
You’re right - this is not working as I had expected. I’ve created this playground to demonstrate the issue. I’ll will surface this internally and see if I can get some clarification.
Quick follow-up - I had mis-read the code earlier. We indeed do set outputUnigrams to false when using the ShingleFilter. Apologies for the initial mistake. And, alas, there’s no way to adjust this setting currently.