3 / 3
Aug 2024

Hi everyone!

I’ll start with what I’m trying to do and then get to what hurdles I’ve met:

I made a nodejs mini app that :

  • Gets all the collections from my database
  • For each collection is spawns a worker that runs a mongoexport command with a query that exports to a json using the jsonArray flag

The aim of this app is to export all the information related to…let’s say a customer, from all collections

The main issue that I’m getting is that the mongoexport commands work by themselves but when I try to run them in parallel using the worker threads they all fail with

could not connect to server: connection() : dial tcp: i/o timeout

For now I’m stuck exporting iteratively but it would be great if I could do this async in some way, this is because I’m dealing with over 15gb of data .

Another question would be if I could parallelize AND divide mongoexport commands in batches for example: job1 exports the first 10000 results, job2 from 100001 to 20000 and so on …

I would be grateful if anyone has an idea on how I can manage this. Thanks in advance!

@Tudor_Palade , please modify the snippet to meet your needs . This seems working fine . Try not to execute all promise all once , it might bring up connection issue or whatsoever .

const { MongoClient } = require('mongodb'); // your connection uri const uri = 'XXX; const client = new MongoClient(uri); const dbName = 'sample_vector_search'; const { exec } = require("child_process"); const { resolve } = require('path'); async function main() { await client.connect(); console.log('Connected successfully to server'); const db = client.db(dbName); const collectionName = 'restaurant_reviews'; const restaurant_reviews = db.collection(collectionName); const count = await restaurant_reviews.find({}).count(); const tasks = []; const limit = 1000; for (let index = 0; index < count; index = index + limit) { tasks.push(spawnExport(uri, dbName, 'restaurant_reviews', `${dbName}_${collectionName}_batch${index}`, index + limit, limit)) } // Use any library of your choice to better handle the tasks // process tasks array in chunk , ideal is chunk = number of core Promise.allSettled(tasks).then(result => { console.log('Done') }).catch(error => { console.log('Error') }) return 'done.'; } function spawnExport(uri, db, col, filename, skip, limit) { return new Promise((resolve, reject) => { const command = `mongoexport --uri ${uri} -d ${db} -c ${col} -o ${filename}.json --skip=${skip} --limit=${limit}`; console.log(command) exec(command, (error, stdout, stderr) => { if (error) { console.log(`error: ${error.message}`); return reject(error.message); } if (stderr) { console.log(`stderr: ${stderr}`); return reject(stderr.message); } return resolve(); }); }); } main() .then(console.log) .catch(console.error) .finally(() => client.close());

Thank you so much, I’ll give this a try !
The tasks divided by a total count of documents is a good idea, although at the moment I am using countDocuments instead of count and apparently it brings lots of performance issues when querying big collections with billions of entries :confused:
But I’m pretty sure that this won’t exactly solve running 30+ exports at the same time.

I’ll definitely try this out and compare performance with the current setup though.