2 / 2
Nov 2024

The documentation says, to shard a GridFS collection:

To shard the chunks collection, use either { files_id : 1, n : 1 } or { files_id : 1 } as the shard key index.

But there is no explanation as to the advantages or disadvantages of either method. Where could I find the most recent information on this? Is there a knowledge base article somewhere? I have used a search engine and find many old postings that mention the two options, but not many that discuss which one is beneficial, under which conditions.

Our example: we are going to store a large quantity of images in a five-shard cluster (let’s say at least 60TB). Each individual image is small (less than 16MB). I’m not sure we should even use GridFS, but that decision has already been made. There will be many more writes than reads. Our main performance consideration is write speed. What is the best sharding key?

5 months later

I’m wondering about this too. Especially the converse case: I’ve got GridFS collections with large-ish files, like 1 GB each, and we usually read or write the entire file at once. We care more about the performance (primarily throughput) of one or two GridFS file queries at a time than handling lots of concurrent requests from multiple clients. Large data set at rest, so queries will often end up doing disk IO instead of being served from cache.

The unsharded chunks collections are sitting on one shard, and thus I think only using the IO capacity of one node. If I sharded the chunks collections on {n: 1, files_id: 1} instead of the recommended {files_id: 1, n: 1} key, it seems like that would spread each GridFS “file” across all our shards in a mostly-even manner. And then queries on these GridFS files would be parallelized across all our shards and be able to take advantage of all their aggregate IO capacity.

Is this a good idea? Am I understanding how the shard key works here? Does GridFS rely on any particular sharding arrangement, and it’ll actually break things if you use something besides the recommended shard keys with the leading file_id?