Unigrams in shingle token filter

Timur_Iskhakov · 2024-10-10T20:09:27.094Z

Hello.
Shingle token filter seems to ignore single token values.

{     
      "tokenFilters": [      
        {
          "maxShingleSize": 5,
          "minShingleSize": 2,
          "type": "shingle"
        },
        {
          "maxGram": 30,
          "minGram": 3,
          "type": "edgeGram"
        }
      ],
      ...
    }
``
Strings like '*some value*' or '*some other value*' are tokenized with shingle and are searchable, but any single-token string (like '*somevalue*') is ignored and is not searchable. Atlas enforces **minShingleSize** min value = 2 (as I know Lucene doesn't). 

Are there options like ES 'output_unigrams' or 'output_unigrams_if_no_shingles'  for shingle token filter? Or is there another way to combine shingle for multi-token string with 'use-as-is' for single-token strings?

Erik_Hatcher · 2024-10-11T13:38:30.521Z

Both of those parameters of the shingle filter are set to true in Atlas Search - so you’ll get unigrams automatically, whether there’s a shingle generated or not.

Timur_Iskhakov · 2024-10-15T09:58:28.930Z

This playground example shows (or correct me if I’m doing something wrong) that unigrams are not emitted by shingle filter

Index:

{
  "analyzer": "my-analyzer",
  "searchAnalyzer": "lucene.keyword",
  "mappings": {
    "dynamic": true
  },
  "analyzers": [    
    {
      "charFilters": [],
      "name": "my-analyzer",
      "tokenFilters": [ { "type": "trim" }, { "type": "lowercase"},
        {
          "minShingleSize": 2,
          "maxShingleSize": 5,
          "type": "shingle"
        }
      ],
      "tokenizer": {  "maxTokenLength": 100,   "type": "whitespace" }
    }
  ]
}

Data:

[
  {
    "value": "marry"
  },
  {
    "value": "marry had a little lamb"
  },
]

Searches:

[
  {
    $search: {
      index: "default",
      text: {
        query: "marry had",
        path: {
          wildcard: "*"
        }
      }
    }
  }
]

→ one document found

[
  {
    $search: {
      index: "default",
      text: {
        query: "had a little lamb",
        path: {
          wildcard: "*"
        }
      }
    }
  }
]

→ one document found

[
  {
    $search: {
      index: "default",
      text: {
        query: "marry",
        path: {
          wildcard: "*"
        }
      }
    }
  }
]

→ no documents found
so it seems that unigrams are not in the index

Timur_Iskhakov · 2024-10-24T07:37:13.313Z

So, given that unigram options are set to true and example shows that unigrams are not in the index, is it a bug in Atlas?

Erik_Hatcher · 2024-10-28T10:51:56.486Z

You’re right - this is not working as I had expected. I’ve created this playground to demonstrate the issue. I’ll will surface this internally and see if I can get some clarification.

Erik_Hatcher · 2024-10-28T10:59:18.431Z

Quick follow-up - I had mis-read the code earlier. We indeed do set outputUnigrams to false when using the ShingleFilter. Apologies for the initial mistake. And, alas, there’s no way to adjust this setting currently.

system · 2024-11-02T11:00:07.136Z

This topic was automatically closed 5 days after the last reply. New replies are no longer allowed.