edgar_chen
(Edgar Chen)
1
Hi Community!
I’m building a CAT tool (Computer-aided Translation) that involves doing fuzzy matches between an input file and a translation memory (database of previously translated strings). The input file is a list of String documents, and the translation memory contains a list of TranslationUnit documents. My tier is M10 (the basic one). My approach is as follows:
-
Run aggregate on String, use $lookup to cross check key and source.text of String and TranslationUnit, then find those with the same key and source.text for 101% matches, and those with different key but same source.text as 100% matches. In the “as” attribute of $lookup, these matches are added to 101match and 100match fields respectively.
-
Check the array length of 101match and 100match of returned documents, if the length >0, add a new field matchScore and assign 101 or 100. If length = 0, assign -1 to matchScore.
-
Since $search doesn’t support the $let keywords, I cannot perform fuzzy searches with in the same aggregate command. Instead, I filter out all String documents with “matchScore === -1” and pass each of them to a new aggregate command where the pipeline contains $search. This is where I have to send a really big amount of ocncurrent queries, and the response always times out with 60K String documents (all fuzzy) and 220K TranslationUnit documents.
`async function findFuzzy(string) {
let pipeline = [
{
$search: {
// find fuzzy strings in source text
index: “tmSourceTextOnly”,
text: {
query: string.source.text,
path: “source.text”,
// fuzzy: { maxEdits: 1 }
}
}
},
// check if these fuzzy source has the current target translation
{
$match: {
translations: {
$elemMatch: {
lang: taskTargetLang
}
},
parentTranslationMemory: {
$in: projectTranslationMemories
}
}
},
{
// 1 is enough for creating analyses
$limit: 1
}
];
const result = await TranslationUnit.aggregate(pipeline);
string['fuzzyMatch'] = result;
console.log('fuzzy single string result', result);
return string
}
let fuzzyPromises =
for (let string of allMatches.fuzzyMatch) {
fuzzyPromises.push(findFuzzy(string))
}
console.log(‘finding fuzzy matches…’, fuzzyPromises.length);
allMatches.fuzzyMatch = await Promise.all(fuzzyPromises);`
In comparison, if the file is mostly made of 101% and 100% matches, the whole process could take 30s or even 20s.
Indexes and Search Indexes are in place for all related fields, so I assume I’m not missing anything important.
I think my key question here is: is this the right way to use $search? If not, is there a plausible solution for my use case I can look into? I really hope I can achieve this with Atlas Search because of how convenient it is, and if I have to implement fuzzy matching outside of Atlas Search, then the M10 subscription may no longer be worth it.
Thank you!