/ /

自动嵌入的工作原理

您可以将MongoDB Vector Search 配置为自动生成和托管集群中文本数据的vector embeddings。您可以在集群中创建一键式AI语义搜索索引，并使用 Voyage AI embedding model，从而通过向量简化索引、更新和查询。

启用自动嵌入后， MongoDB Vector Search 会在索引时为集合中的指定文本字段使用指定的 embedding model 自动生成嵌入，并在查询时为查询中的文本 string 针对自动嵌入的索引字段自动生成嵌入。

初始化同步(Resumable Initial Sync)

当您为自动嵌入创建MongoDB Vector Search索引时， MongoDB会执行初始同步，为集合中的所有现有 document 生成嵌入：

扫描document。
MongoDB Vector Search 扫描集合中包含索引文本字段的所有 document。
生成嵌入。
对于每个document， MongoDB Vector Search 会将text从索引字段发送到 Voyage AI embedding model以生成vector embeddings。
存储嵌入。
MongoDB Vector Search 将生成的嵌入存储在同一集群上的单独内部系统集合(__mdb_internal_search) 中，以使嵌入与应用程序程序数据隔离，同时保持数据局部性。
构建索引.
生成嵌入后， MongoDB Vector Search 使用生成的嵌入构建索引，以启用向量搜索。

在初始同步期间， MongoDB分批处理document，并使用特殊的 Flex 推理处理层级来优化吞吐量。

注意

初始同步持续时间取决于document数量、索引字段中文本的长度以及可用的速率限制配额。对于大型集合，可能需要几个小时才能完成初始同步。

持续更新

完成初始同步后， MongoDB Vector Search 会在数据发生变化时自动将嵌入与数据保持同步。

document 插入

当您使用索引文本字段插入新 document 时， MongoDB 向量搜索会自动：

通过变更流检测新document。
使用配置的model为文本字段生成嵌入。
将嵌入存储在系统集合中。
更新MongoDB Vector Search索引以包含新的嵌入。

Document Updates

当您更新 document 且索引文本字段发生更改时， MongoDB 向量搜索会自动：

通过变更流检测字段变更。
为更新后的文本生成新的嵌入。
替换系统集合中的旧嵌入。
使用新的嵌入更新MongoDB Vector Search索引。

注意

对于未针对自动嵌入进行索引的字段的更新， MongoDB Vector Search 不会 trigger 嵌入重新生成。

Document 删除

删除 document 时， MongoDB 向量搜索会自动从系统集合中删除相应的嵌入并更新索引。

model Hosting and Multi-Tenancy

自动嵌入使用 Voyage AI 的 embedding model，该模型由 MongoDB 在多租户环境中托管和托管：

model 基础架构

托管服务：所有 embedding model 均由MongoDB托管和维护。该model推断平台运行在美国某区域Google Cloud云中的 MongoDB 基础架构上。您无需部署、配置或管理任何 model 基础架构。
基于API 的访问：对于配置为使用 Voyage AI API密钥的自管理部署， MongoDB会将文本发送到 Voyage AI 的终结点以生成嵌入。嵌入将返回到MongoDB并存储在集群中。
多租户架构：嵌入服务由多个用户共享。这种多租户 model 提供：
- 通过共享基础架构提高成本效率
- 自动model更新和改进
- 高可用性和可扩展性

数据隐私

发送到自动嵌入服务的文本仅用于生成嵌入，不会存储或用于model培训。
嵌入将返回到您的MongoDB 集群并存储在您自己的数据库中。
与自动嵌入服务的所有通信都通过加密连接进行。

速率限制

嵌入服务是多租户的。因此， MongoDB实施速率限制以确保所有客户的公平使用。要详细学习；了解速率限制及其如何影响自动嵌入操作，请参阅速率限制。

查询处理

当您使用自动嵌入运行向量搜索查询时， MongoDB会自动处理查询文本的嵌入生成：

查询文本提交：您在 $vectorSearch 阶段的 query字段中提供文本字符串，而不是预先生成的向量。
嵌入生成： MongoDB将查询文本发送到自动嵌入服务，以使用索引中指定的相同model（如果使用 model 选项覆盖model，则为兼容model）生成嵌入。
向量搜索：生成的查询嵌入用于使用配置的相似度函数（余弦、dotProduct或欧几里得）搜索索引嵌入。
返回的结果： MongoDB返回按与查询的相似度排名的document。

查询速率限制

使用自动嵌入的每个查询都会计入自动嵌入速率限制，因为它需要调用API来生成嵌入。要学习；了解有关管理查询吞吐量和成本的更多信息，请参阅速率限制。

对运营的影响

初始化同步(Resumable Initial Sync)

如果达到速率限制，大型集合可能需要大量时间才能完成初始同步。
MongoDB会自动重试失败的嵌入请求并实现指数退避。
您可以通过搜索监控监控同步进度。

持续更新

document 更新在发生时进行处理，但受速率限制。
如果更新超过速率限制，则会对其进行排队并在容量可用时进行处理。
您的应用程序继续正常运行；只有嵌入生成可能会延迟。

查询

查询速率限制会影响您可以执行的并发搜索数量。
如果超过查询速率限制，查询将返回错误，指示已超过速率限制。
请考虑缓存常用的查询结果或升级到付费层级以获得更高的吞吐量。

生成的嵌入集合

自动嵌入使用单独的储备数据库来存储 vector embeddings。您可以查找为索引生成的嵌入集合，并从生成的嵌入集合中检索嵌入。

嵌入存储

MongoDB异步存储生成的嵌入，并将其保留在内部生成的嵌入集合中。生成的嵌入集合存在于同一集群上名为 __mdb_internal_search 的专用内部数据库中。集群中的每个自动嵌入索引在此数据库中都有一个对应的生成嵌入集合。要学习；了解更多信息，请参阅生成的嵌入集合。

警告

__mdb_internal_search数据库是由MongoDB创建和托管的保留内部命名空间。请勿操作此数据库或其集合。如果修改此保留的命名空间，可能会导致索引失败和搜索结果不一致。

生成的嵌入集合的结构

生成的嵌入集合的每个源集合 document 包含一个 document。每个生成的嵌入集合 document 具有与源相同的 _id、源的过滤字段的副本以及为每个自动嵌入字段生成的嵌入向量。

您可以看到以下字段：

字段	类型	说明
`_id`	ObjectId	与源 document 相同的 `_id`。
`<filter-field>`	Any	源document中过滤器字段的副本。
`_autoEmbed`	对象	包含每个“自动嵌入”字段的嵌入向量。
`_autoEmbed`. `<fieldPath>`	float 数组或量化向量	包含为 Automated Embedding（自动嵌入）字段生成的嵌入向量。

查找生成的嵌入集合

警告

查找自动嵌入索引的生成嵌入集合

使用mongosh查找为自动嵌入索引生成的嵌入集合。

使用 `mongosh` 连接到MongoDB 部署。

获取索引的ID 。

替换以下占位符后运行以下查询：

<database_name> - 包含自动嵌入索引的数据库名称。
<collection_name> - 包含自动嵌入索引的集合的名称。
<index_name> — 自动嵌入索引的名称。

示例：GET自动嵌入索引的ID

1 use <database_name>
2 db.<collection_name>.aggregate( [ { $listSearchIndexes: { name: "<index_name>" } } ] )

1 [
2    {
3       id: '69f382ecd6fa583100184fe7',
4       name: 'auto-embed-index',
5       type: 'vectorSearch',
6       status: 'READY',
7       numDocs: 0,
8       latestDefinition: { ... },
9       statusDetail: [ ... ]
10    }
11 ]

获取生成的嵌入集合：

将 <index_id> 替换为上一步中命令返回的自动嵌入索引的ID ，然后运行以下查询。

示例：获取生成的嵌入集合

1 use __mdb_internal_search
2 db.getCollectionNames().filter(n => n.startsWith("<index_id>"))

[ '69f382ecd6fa583100184fe7-96dad03b0a735a19fd9f1a22f9694efc-1-0' ]

输出是生成的嵌入集合的名称。

检查已在生成的嵌入集合中创建多少个document

使用mongosh查找为自动嵌入索引生成的嵌入集合中的document数量。

使用 `mongosh` 连接到MongoDB 部署。

检查在生成的嵌入集合中创建了多少 document。

替换以下占位符后运行以下查询：

<generated_embeddings_collection_name> - 生成的嵌入集合的名称。

示例：检查已创建多少 document

1 use __mdb_internal_search
2 const mvColl = "<generated_embeddings_collection_name>"
3 db.getCollection(mvColl).countDocuments()

1 100

检查生成的嵌入集合的存储大小

您可以检查生成的嵌入集合的存储大小，以了解生成的嵌入的磁盘和索引空间消耗。这对于容量规划、调试意外增长以及在删除或重新定义索引后验证清理非常有用。

重要

在检查存储大小之前，请找到生成的嵌入集合名称。要学习；了解更多信息，请参阅查找生成的嵌入集合。

检查生成的嵌入集合的存储大小

使用mongosh查找生成的嵌入集合的存储大小。

使用 `mongosh` 连接到MongoDB 部署。

检查生成的嵌入集合的存储大小。

将 <generated_embeddings_collection_name> 替换为生成的嵌入集合的名称后，运行以下查询：

示例：检查生成的嵌入集合的存储大小

1 use __mdb_internal_search
2 const mvColl = "<generated_embeddings_collection_name>"
3 db.getCollection(mvColl).stats()

collStats 命令为生成的嵌入集合提供详细的存储指标。当您需要脚本访问权限、分片集群聚合或计划监控时，请使用此方法。

以下 collStats 字段提供存储信息：

字段	说明
`count`	生成的 embeddings集合中的 document 数量。每个具有生成的嵌入的源 document 都存在一个 document。
`size`	所有 document 的未压缩逻辑大小（以字节为单位）。
`storageSize`	WiredTiger压缩后集合的数据文件在磁盘上的大小（以字节为单位）。
`totalIndexSize`	生成的嵌入集合上所有MongoDB索引的磁盘大小（以字节为单位）。
`totalSize`	`storageSize` 和 `totalIndexSize` 之和。表示总磁盘使用量。
`avgObjSize`	平均未压缩 document 大小。对于验证每个document的嵌入大小非常有用。

注意

storageSize 和 totalIndexSize 反映实际磁盘使用情况。
size 是未压缩的逻辑视图，通常更大。
这些指标仅显示MongoDB 集群中的存储。它们不包括 mongot托管上Lucene向量索引使用的磁盘。

检查副本集上的存储

使用mongosh或PyMongo检查生成的嵌入集合在副本集上的存储大小。

对 __mdb_internal_search数据库运行以下命令：

1 use __mdb_internal_search
2 
3 const mvColl = "<generated_embeddings_collection_name>";
4 
5 db.runCommand({ collStats: mvColl }).count
6 db.runCommand({ collStats: mvColl, scale: 1024 * 1024 })

1 {
2    ns: '__mdb_internal_search.69f382ecd6fa583100184fe7-96dad03b0a735a19fd9f1a22f9694efc-1-0',
3    size: 5142,
4    count: 1250000,
5    avgObjSize: 4312,
6    numOrphanDocs: 0,
7    storageSize: 1830,
8    freeStorageSize: 7,
9    capped: false,
10    wiredTiger: { ... },
11    nindexes: 1,
12    indexDetails: { ... },
13    indexBuilds: [],
14    totalIndexSize: 42,
15    indexSizes: { _id_: 0 },
16    totalSize: 1872,
17    scaleFactor: 1048576,
18    ok: 1,
19    '$clusterTime': {
20       clusterTime: Timestamp({ t: 1777646199, i: 1 }),
21       signature: {
22          hash: Binary.createFromBase64('pomqluUIpiZzLro3VWhO4dt2LKE=', 0),
23          keyId: Long('7634583163557117960')
24       }
25    },
26    operationTime: Timestamp({ t: 1777646199, i: 1 })
27 }

要获得格式化摘要，运行：

1 const s = db.runCommand({ collStats: mvColl, scale: 1024 * 1024 });
2 ({
3   count:          s.count,
4   avgObjSizeKB:   (s.avgObjSize / 1024).toFixed(2),
5   dataMB:         s.size,
6   storageMB:      s.storageSize,
7   indexesMB:      s.totalIndexSize,
8   totalMB:        s.totalSize,
9 })

1 {
2   "count": 1250000,
3   "avgObjSizeKB": "4.21",
4   "dataMB": 5142,
5   "storageMB": 1830,
6   "indexesMB": 42,
7   "totalMB": 1872
8 }

1 from pymongo import MongoClient
2 
3 MV_DATABASE = "__mdb_internal_search"
4 MB = 1024 * 1024
5 
6 def get_mv_storage_stats(client, mv_collection_name):
7     """Return storage metrics for a generated embeddings collection."""
8     db = client[MV_DATABASE]
9     stats = db.command("collStats", mv_collection_name, scale=MB)
10     return {
11         "count":         stats["count"],
12         "avg_obj_kb":    round(stats["avgObjSize"] / 1024, 2),
13         "data_mb":       stats["size"],
14         "storage_mb":    stats["storageSize"],
15         "indexes_mb":    stats["totalIndexSize"],
16         "total_mb":      stats["totalSize"],
17     }
18 
19 client = MongoClient("mongodb+srv://<user>:<pwd>@<cluster>/")
20 print(get_mv_storage_stats(client, "<generated_embeddings_collection_name>"))

示例输出：

{'count': 1250000, 'avg_obj_kb': 4.21, 'data_mb': 5142, 'storage_mb': 1830, 'indexes_mb': 42, 'total_mb': 1872}

检查分片集群上的存储

查询每个分片以获取集群范围的存储指标。

在分片的源集合上，每个分片在 __mdb_internal_search数据库中都有自己生成的嵌入集合。mongos 看不到这些集合，因此您必须直接查询每个分片的 mongod 并对结果求和。

获取分片列表。

运行以下命令列出所有分片：

db.adminCommand({ listShards: 1 })

连接到每个分片。

对于每个分片，直接连接到其副本集。

`collStats`在每个分片上运行命令。

针对该分片上生成的嵌入集合运行 collStats 命令。

对结果求和。

在所有分片中添加 count、storageSize、totalIndexSize 和 totalSize，以获得集群范围的总数。

以下脚本连接到每个分片，查询生成的嵌入集合，并返回每个分片的指标和总指标：

示例：检查分片集群上的存储

1 from pymongo import MongoClient
2 
3 MV_DATABASE = "__mdb_internal_search"
4 MB = 1024 * 1024
5 
6 def _resolve_mv_name(client, source_db, source_collection, index_name):
7    """Find the generated embeddings collection name for an index."""
8    src = client[source_db][source_collection]
9    indexes = list(src.aggregate([{"$listSearchIndexes": {"name": index_name}}]))
10    if not indexes:
11       raise LookupError(f"No search index named {index_name!r}")
12    index_id = indexes[0]["id"]
13    matches = [n for n in client[MV_DATABASE].list_collection_names()
14                if n.startswith(index_id)]
15    if not matches:
16       return None
17    matches.sort(reverse=True)
18    return matches[0]
19 
20 def get_mv_storage_per_shard(shard_uris, source_db, source_collection, index_name):
21    """Get per-shard and total storage for a sharded cluster."""
22    per_shard = {}
23    totals = {"count": 0, "data_mb": 0, "storage_mb": 0,
24             "indexes_mb": 0, "total_mb": 0}
25 
26    for shard_name, uri in shard_uris.items():
27       client = MongoClient(uri)
28       mv_name = _resolve_mv_name(client, source_db, source_collection, index_name)
29 
30       if mv_name is None:
31             per_shard[shard_name] = {"note": "no MV found (still building?)"}
32             continue
33 
34       s = client[MV_DATABASE].command("collStats", mv_name, scale=MB)
35       row = {
36             "mv":           mv_name,
37             "count":        s["count"],
38             "data_mb":      s["size"],
39             "storage_mb":   s["storageSize"],
40             "indexes_mb":   s["totalIndexSize"],
41             "total_mb":     s["totalSize"],
42       }
43       per_shard[shard_name] = row
44 
45       for k in totals:
46             totals[k] += row[k]
47 
48    return {"per_shard": per_shard, "totals": totals}
49 
50 # Usage
51 shard_uris = {
52    "shard-00": "mongodb://<user>:<pwd>@shard-00.example.net:27017/?replicaSet=shard-00",
53    "shard-01": "mongodb://<user>:<pwd>@shard-01.example.net:27017/?replicaSet=shard-01",
54    "shard-02": "mongodb://<user>:<pwd>@shard-02.example.net:27017/?replicaSet=shard-02",
55 }
56 
57 result = get_mv_storage_per_shard(
58    shard_uris,
59    source_db="<source_db>",
60    source_collection="<source_collection>",
61    index_name="<index_name>",
62 )
63 
64 for shard, row in result["per_shard"].items():
65    print(shard, row)
66 print("TOTAL:", result["totals"])

shard-00 {'mv': '69e183...-1-3', 'count': 416000, 'data_mb': 1714, 'storage_mb': 612, 'indexes_mb': 14, 'total_mb': 626}
shard-01 {'mv': '69e183...-1-3', 'count': 418200, 'data_mb': 1721, 'storage_mb': 615, 'indexes_mb': 14, 'total_mb': 629}
shard-02 {'mv': '69e183...-1-3', 'count': 415800, 'data_mb': 1707, 'storage_mb': 603, 'indexes_mb': 14, 'total_mb': 617}
TOTAL: {'count': 1250000, 'data_mb': 5142, 'storage_mb': 1830, 'indexes_mb': 42, 'total_mb': 1872}

检查所有生成的嵌入集合的存储

获取所有索引的自动嵌入总占用空间。

要检查集群上所有自动嵌入索引的存储，请为 __mdb_internal_search 中的每个集合求和 collStats。这对于容量审核和识别孤立的生成嵌入集合非常有用。

在 mongosh 中，对单个副本集或分片的集群的每个分片运行以下命令：

1 use __mdb_internal_search
2 
3 const MB = 1024 * 1024;
4 const rows = db.getCollectionNames().map(name => {
5 const s = db.runCommand({ collStats: name, scale: MB });
6 return {
7    collection: name,
8    count:      s.count,
9    storageMB:  s.storageSize,
10    indexesMB:  s.totalIndexSize,
11    totalMB:    s.totalSize,
12 };
13 });
14 
15 const total = rows.reduce((a, r) => ({
16 storageMB: a.storageMB + r.storageMB,
17 indexesMB: a.indexesMB + r.indexesMB,
18 totalMB:   a.totalMB   + r.totalMB,
19 }), { storageMB: 0, indexesMB: 0, totalMB: 0 });
20 
21 print("Per-collection:");
22 printjson(rows);
23 print("Cluster total:");
24 printjson(total);

1 Per-collection:
2 [
3    { "collection": "69e183...-1-3", "count": 1250000, "storageMB": 1830, "indexesMB": 42, "totalMB": 1872 },
4    { "collection": "71fa42...-1-1", "count":   84000, "storageMB":  121, "indexesMB":  3,  "totalMB":  124 }
5 ]
6 Cluster total:
7   { "storageMB": 1951, "indexesMB": 45, "totalMB": 1996 }
8 }

注意

在分片的集群上，对每个分片运行此命令并对结果求和。

从生成的嵌入集合中检索嵌入

检索 document 的嵌入

使用mongosh从生成的嵌入集合中检索嵌入。

使用 `mongosh` 连接到MongoDB 部署。

检索 document 的嵌入。

替换以下占位符后运行以下查询：

<generated_embeddings_collection_name> - 生成的嵌入集合的名称。
<document_id> - 源集合中 document 的 _id。
<auto_embed_field> - 为自动嵌入建立索引的字段名称。

示例：从生成的嵌入集合中检索嵌入

1 use __mdb_internal_search
2 const mvColl = "<generated_embeddings_collection_name>"
3 db.getCollection(mvColl).findOne(
4    { _id: "<document_id>" },
5    { _id: 1, "_autoEmbed.<auto_embed_field>": 1 }
6 )

1 [
2    { _autoEmbed: {}, _id: "ObjectId('573a1390f29313caabcd5c0f')" },
3    { _autoEmbed: {}, _id: "ObjectId('573a1390f29313caabcd5c0f')" },
4    {
5       _autoEmbed: {
6          fullplot: Binary.fromInt8Array(new Int8Array([
7             5, -30,  16,   4, -57,  -8, -17, -13,  16,  11, -22,  15,
8             -7,  13,   8,  -2,  -1, -14,  27,  10,  -9,  20,  14,  -2,
9             3, -56, -21,  10, -24,  12,  10,   9,  12,   7,   4,  14,
10             -7, -24, -15,  16,  13,  21,  -4, -16, -12, -15,   3, -33,
11             5, -21,   2,  -1,   0,  16,   7,  13,  19,   4,   5, -14,
12          -34,   7, -16,  38,   4,   4,   7, -22,   8,  14,  15, -14,
13             -4,   6,  22, -17,   8,  27,   8,  13,  46, -12,  -7,  -9,
14          -20,  13,  10,   4, -14, -11,  31,  -7,   0,  -3,   1,  16,
15             9,   5,   6,  -2,
16          ... 924 more items
17          ]))
18       },
19       _id: "ObjectId('573a1390f29313caabcd5c0f')"
20    },
21    {
22       _autoEmbed: {
23          fullplot: Binary.fromInt8Array(new Int8Array([
24             -5, -22,  22,  -6, -43, -13,  -5,   4,  5,   2,   4,  13,
25             0,  -3,  -3, -50,  -5,  -2,  -2,  27, -5,  36,  27,  12,
26          -12,  -6,  -1,   9,  -7,  25,   4, -28,  3,   9,   3,  23,
27             8,  11,  11,  25, -19,  27,  17,  18, -1,   0,   5, -12,
28             13,  -5,  -3,   3, -17,  16, -15,  43, -1,   1,   1,  -6,
29          -26,  16, -11,  13,  14,   0,  -9, -23, 25, -16,  11, -25,
30             7,   9,  -1,   0,  33,  -8,  -3, -18,  3,   4, -20, -14,
31             17,  -2,  -2, -10,  17, -25, -11,   9,  1,   2,  -8,   7,
32             20,  18,  17,  -2,
33          ... 924 more items
34          ]))
35       },
36       _id: "ObjectId('573a1390f29313caabcd5c0f')"
37    },
38    {
39       _autoEmbed: {
40          fullplot: Binary.fromInt8Array(new Int8Array([
41             0,  -1,  47,   6, -20, -14,  29,  -2,  13,  -1,  20,  11,
42          -18,  -7,  12, -10, -25,  10,   7, -15,  11,   9, -14,  12,
43             -9, -22,  16,   0,  18,   5,   9, -26,  14, -27,   6,  20,
44          -19,  -8,   1,  -5,  21,  13, -37,  -7,   0, -21, -51,   1,
45          -38, -14,   4,   6, -23,  15,  19,  33,   8,   0,  -7,  -3,
46          -25,   8, -29,  25,  -1,  12,   4, -21,  -1,   0, -14,  -3,
47             -6,  -3,   7,  30,   8,  -8,  34, -19, -12, -29, -15, -14,
48             1,  -4,   6,  -2, -36, -18,  -2,   4,  23,  17, -13,   1,
49             0,   7,  25, -19,
50          ... 924 more items
51          ]))
52       },
53       _id: "ObjectId('573a1390f29313caabcd5c0f')"
54    }
55 ]

检索多个 document 的嵌入

使用mongosh从生成的嵌入集合中检索嵌入。

使用 `mongosh` 连接到MongoDB 部署。

检索多个 document 的嵌入。

替换以下占位符后运行以下查询：

<generated_embeddings_collection_name> - 生成的嵌入集合的名称。
<document_id> - 源集合中 document 的 _id。
<auto_embed_field> - 为自动嵌入建立索引的字段名称。
<number_of_documents> — 要返回的document数量。

示例：从生成的嵌入集合中检索嵌入

1 use __mdb_internal_search
2 const mvColl = "<generated_embeddings_collection_name>"
3 db.getCollection(mvColl).find(
4    {},
5    { _id: "<document_id>", "_autoEmbed.<auto_embed_field>": { $slice: 5 } }
6 ).limit(<number_of_documents>)

1 [
2    { _autoEmbed: {}, _id: "ObjectId('573a1390f29313caabcd5c0f')" },
3    { _autoEmbed: {}, _id: "ObjectId('573a1390f29313caabcd5c0f')" },
4    {
5       _autoEmbed: {
6          fullplot: Binary.fromInt8Array(new Int8Array([
7             5, -30,  16,   4, -57,  -8, -17, -13,  16,  11, -22,  15,
8             -7,  13,   8,  -2,  -1, -14,  27,  10,  -9,  20,  14,  -2,
9             3, -56, -21,  10, -24,  12,  10,   9,  12,   7,   4,  14,
10             -7, -24, -15,  16,  13,  21,  -4, -16, -12, -15,   3, -33,
11             5, -21,   2,  -1,   0,  16,   7,  13,  19,   4,   5, -14,
12          -34,   7, -16,  38,   4,   4,   7, -22,   8,  14,  15, -14,
13             -4,   6,  22, -17,   8,  27,   8,  13,  46, -12,  -7,  -9,
14          -20,  13,  10,   4, -14, -11,  31,  -7,   0,  -3,   1,  16,
15             9,   5,   6,  -2,
16          ... 924 more items
17          ]))
18       },
19       _id: "ObjectId('573a1390f29313caabcd5c0f')"
20    },
21    {
22       _autoEmbed: {
23          fullplot: Binary.fromInt8Array(new Int8Array([
24             -5, -22,  22,  -6, -43, -13,  -5,   4,  5,   2,   4,  13,
25             0,  -3,  -3, -50,  -5,  -2,  -2,  27, -5,  36,  27,  12,
26          -12,  -6,  -1,   9,  -7,  25,   4, -28,  3,   9,   3,  23,
27             8,  11,  11,  25, -19,  27,  17,  18, -1,   0,   5, -12,
28             13,  -5,  -3,   3, -17,  16, -15,  43, -1,   1,   1,  -6,
29          -26,  16, -11,  13,  14,   0,  -9, -23, 25, -16,  11, -25,
30             7,   9,  -1,   0,  33,  -8,  -3, -18,  3,   4, -20, -14,
31             17,  -2,  -2, -10,  17, -25, -11,   9,  1,   2,  -8,   7,
32             20,  18,  17,  -2,
33          ... 924 more items
34          ]))
35       },
36       _id: "ObjectId('573a1390f29313caabcd5c0f')"
37    },
38    {
39       _autoEmbed: {
40          fullplot: Binary.fromInt8Array(new Int8Array([
41             0,  -1,  47,   6, -20, -14,  29,  -2,  13,  -1,  20,  11,
42          -18,  -7,  12, -10, -25,  10,   7, -15,  11,   9, -14,  12,
43             -9, -22,  16,   0,  18,   5,   9, -26,  14, -27,   6,  20,
44          -19,  -8,   1,  -5,  21,  13, -37,  -7,   0, -21, -51,   1,
45          -38, -14,   4,   6, -23,  15,  19,  33,   8,   0,  -7,  -3,
46          -25,   8, -29,  25,  -1,  12,   4, -21,  -1,   0, -14,  -3,
47             -6,  -3,   7,  30,   8,  -8,  34, -19, -12, -29, -15, -14,
48             1,  -4,   6,  -2, -36, -18,  -2,   4,  23,  17, -13,   1,
49             0,   7,  25, -19,
50          ... 924 more items
51          ]))
52       },
53       _id: "ObjectId('573a1390f29313caabcd5c0f')"
54    }
55 ]

用于从生成的嵌入集合中检索嵌入的PyMongo脚本

使用PyMongo从生成的嵌入集合中检索嵌入。

要从生成的嵌入集合中检索嵌入，可以使用以下Python脚本。要运行脚本，请安装PyMongo驱动程序。

创建一个名为`get_embedding.py` 的文件。

将以下代码复制并粘贴到 `get_embedding.py` 文件中。

示例：从生成的嵌入集合中检索嵌入

1 from pymongo import MongoClient
2 
3 MV_DATABASE = "__mdb_internal_search"
4 
5 def get_mv_collection(client, source_db, source_collection, index_name):
6    """Resolve the MV collection for an auto-embedding index."""
7    # 1. Look up the index ID via $listSearchIndexes on the source collection.
8    src = client[source_db][source_collection]
9    indexes = list(src.aggregate([{"$listSearchIndexes": {"name": index_name}}]))
10    if not indexes:
11       raise LookupError(f"No search index named {index_name!r} on {source_db}.{source_collection}")
12    index_id = indexes[0]["id"]
13 
14    # 2. Find the MV collection in __mdb_internal_search whose name starts with the index ID.
15    mv_db = client[MV_DATABASE]
16    matches = [n for n in mv_db.list_collection_names() if n.startswith(index_id)]
17    if not matches:
18       raise LookupError(f"No MV collection found for index {index_id} (index may still be building)")
19    if len(matches) > 1:
20       # Possible briefly during an auto-embed field update; pick the newest.
21       matches.sort(reverse=True)
22    return mv_db[matches[0]]
23 
24 def get_embedding(client, source_db, source_collection, index_name, embed_path, source_id):
25    """Fetch the embedding for a single source document."""
26    mv = get_mv_collection(client, source_db, source_collection, index_name)
27    doc = mv.find_one(
28       {"_id": source_id},
29       {"_id": 1, f"_autoEmbed.{embed_path}": 1},
30    )
31    if doc is None:
32       return None
33    return doc["_autoEmbed"][embed_path]
34 
35 # --- Usage ---
36 client = MongoClient("mongodb+srv://<user>:<pwd>@<cluster>/")
37 
38 embedding = get_embedding(
39    client,
40    source_db="<source_db>",
41    source_collection="<source_collection>",
42    index_name="<auto_embed_index_name>",
43    embed_path="<auto_embed_field>",
44    source_id="<document_id>",
45 )
46 
47 print(f"dims:    {len(embedding)}")
48 print(f"first 5: {embedding[:5]}")

替换 `get_embedding.py` 文件中的以下占位符：

占位符	说明
`<user>`	您的MongoDB 部署的用户名。
`<pwd>`	MongoDB 部署的密码。
`<cluster>`	用于MongoDB 部署的集群连接字符串。
`<source_db>`	包含源集合的数据库的名称。
`<source_collection>`	源集合的名称。
`<index_name>`	自动嵌入索引的名称。
`<source_id>`	`_id` 源集合中document的。
`<auto_embed_field>`	为自动嵌入建立索引的字段的名称。
`<number_of_documents>`	要返回的 document 数量。

运行以下命令，从生成的嵌入集合中检索嵌入。

python get_embedding.py

要从生成的嵌入集合流传输嵌入，您可以使用以下Python脚本。

创建一个名为`stream_embedding.py` 的文件。

将以下代码复制并粘贴到 `stream_embedding.py` 文件中。

示例：生成的嵌入集合中的流嵌入

1 from pymongo import MongoClient
2 
3 # --- Usage ---
4 client = MongoClient("mongodb+srv://<user>:<pwd>@<cluster>/")
5 
6 mv = get_mv_collection(client, "<source_db>", "<source_collection>", "<auto_embed_index_name>")
7 
8 cursor = mv.find(
9    {},
10    {"_id": 1, "_autoEmbed.<auto_embed_field>": 1},
11    batch_size=500,
12 )
13 
14 for doc in cursor:
15    src_id = doc["_id"]
16    vec = doc["_autoEmbed"]["<auto_embed_field>"]

替换 `stream_embedding.py` 文件中的以下占位符：

占位符	说明
`<user>`	您的MongoDB 部署的用户名。
`<pwd>`	MongoDB 部署的密码。
`<cluster>`	用于MongoDB 部署的集群连接字符串。
`<source_db>`	包含源集合的数据库的名称。
`<source_collection>`	源集合的名称。
`<auto_embed_index_name>`	自动嵌入索引的名称。
`<auto_embed_field>`	为自动嵌入建立索引的字段的名称。

运行以下命令以流生成的嵌入集合中的嵌入。

python stream_embedding.py

故障排除

以下部分指导对自动嵌入的常见问题进行故障排除。

未生成与索引ID匹配的嵌入集合: 您的索引可能仍处于 Building 或 Pending 状态。生成的嵌入集合是在首次写入延迟创建的。使用 $listSearchIndexes 检查状态。
来源缺少 document _id: 尚未生成指定document的嵌入，或者该document已被索引的过滤表达式过滤掉。
多个集合与索引ID匹配: 自动嵌入字段配置已更新。虽然已创建新的生成嵌入集合，但旧的生成嵌入集合可能会短暂停留，直到被清理。

后退

管理账单

来年

索引参考

1	use <database_name>
2	db.<collection_name>.aggregate( [ { $listSearchIndexes: { name: "<index_name>" } } ] )

1	[
2	{
3	id: '69f382ecd6fa583100184fe7',
4	name: 'auto-embed-index',
5	type: 'vectorSearch',
6	status: 'READY',
7	numDocs: 0,
8	latestDefinition: { ... },
9	statusDetail: [ ... ]
10	}
11	]

1	use __mdb_internal_search
2	db.getCollectionNames().filter(n => n.startsWith("<index_id>"))

1	use __mdb_internal_search
2	const mvColl = "<generated_embeddings_collection_name>"
3	db.getCollection(mvColl).countDocuments()

1	use __mdb_internal_search
2	const mvColl = "<generated_embeddings_collection_name>"
3	db.getCollection(mvColl).stats()

1	use __mdb_internal_search
2
3	const mvColl = "<generated_embeddings_collection_name>";
4
5	db.runCommand({ collStats: mvColl }).count
6	db.runCommand({ collStats: mvColl, scale: 1024 * 1024 })

1	{
2	ns: '__mdb_internal_search.69f382ecd6fa583100184fe7-96dad03b0a735a19fd9f1a22f9694efc-1-0',
3	size: 5142,
4	count: 1250000,
5	avgObjSize: 4312,
6	numOrphanDocs: 0,
7	storageSize: 1830,
8	freeStorageSize: 7,
9	capped: false,
10	wiredTiger: { ... },
11	nindexes: 1,
12	indexDetails: { ... },
13	indexBuilds: [],
14	totalIndexSize: 42,
15	indexSizes: { _id_: 0 },
16	totalSize: 1872,
17	scaleFactor: 1048576,
18	ok: 1,
19	'$clusterTime': {
20	clusterTime: Timestamp({ t: 1777646199, i: 1 }),
21	signature: {
22	hash: Binary.createFromBase64('pomqluUIpiZzLro3VWhO4dt2LKE=', 0),
23	keyId: Long('7634583163557117960')
24	}
25	},
26	operationTime: Timestamp({ t: 1777646199, i: 1 })
27	}

1	const s = db.runCommand({ collStats: mvColl, scale: 1024 * 1024 });
2	({
3	count: s.count,
4	avgObjSizeKB: (s.avgObjSize / 1024).toFixed(2),
5	dataMB: s.size,
6	storageMB: s.storageSize,
7	indexesMB: s.totalIndexSize,
8	totalMB: s.totalSize,
9	})

1	{
2	"count": 1250000,
3	"avgObjSizeKB": "4.21",
4	"dataMB": 5142,
5	"storageMB": 1830,
6	"indexesMB": 42,
7	"totalMB": 1872
8	}

1	from pymongo import MongoClient
2
3	MV_DATABASE = "__mdb_internal_search"
4	MB = 1024 * 1024
5
6	def get_mv_storage_stats(client, mv_collection_name):
7	"""Return storage metrics for a generated embeddings collection."""
8	db = client[MV_DATABASE]
9	stats = db.command("collStats", mv_collection_name, scale=MB)
10	return {
11	"count": stats["count"],
12	"avg_obj_kb": round(stats["avgObjSize"] / 1024, 2),
13	"data_mb": stats["size"],
14	"storage_mb": stats["storageSize"],
15	"indexes_mb": stats["totalIndexSize"],
16	"total_mb": stats["totalSize"],
17	}
18
19	client = MongoClient("mongodb+srv://<user>:<pwd>@<cluster>/")
20	print(get_mv_storage_stats(client, "<generated_embeddings_collection_name>"))

1	use __mdb_internal_search
2
3	const MB = 1024 * 1024;
4	const rows = db.getCollectionNames().map(name => {
5	const s = db.runCommand({ collStats: name, scale: MB });
6	return {
7	collection: name,
8	count: s.count,
9	storageMB: s.storageSize,
10	indexesMB: s.totalIndexSize,
11	totalMB: s.totalSize,
12	};
13	});
14
15	const total = rows.reduce((a, r) => ({
16	storageMB: a.storageMB + r.storageMB,
17	indexesMB: a.indexesMB + r.indexesMB,
18	totalMB: a.totalMB + r.totalMB,
19	}), { storageMB: 0, indexesMB: 0, totalMB: 0 });
20
21	print("Per-collection:");
22	printjson(rows);
23	print("Cluster total:");
24	printjson(total);

1	Per-collection:
2	[
3	{ "collection": "69e183...-1-3", "count": 1250000, "storageMB": 1830, "indexesMB": 42, "totalMB": 1872 },
4	{ "collection": "71fa42...-1-1", "count": 84000, "storageMB": 121, "indexesMB": 3, "totalMB": 124 }
5	]
6	Cluster total:
7	{ "storageMB": 1951, "indexesMB": 45, "totalMB": 1996 }
8	}

1	use __mdb_internal_search
2	const mvColl = "<generated_embeddings_collection_name>"
3	db.getCollection(mvColl).findOne(
4	{ _id: "<document_id>" },
5	{ _id: 1, "_autoEmbed.<auto_embed_field>": 1 }
6	)

1	[
2	{ _autoEmbed: {}, _id: "ObjectId('573a1390f29313caabcd5c0f')" },
3	{ _autoEmbed: {}, _id: "ObjectId('573a1390f29313caabcd5c0f')" },
4	{
5	_autoEmbed: {
6	fullplot: Binary.fromInt8Array(new Int8Array([
7	5, -30, 16, 4, -57, -8, -17, -13, 16, 11, -22, 15,
8	-7, 13, 8, -2, -1, -14, 27, 10, -9, 20, 14, -2,
9	3, -56, -21, 10, -24, 12, 10, 9, 12, 7, 4, 14,
10	-7, -24, -15, 16, 13, 21, -4, -16, -12, -15, 3, -33,
11	5, -21, 2, -1, 0, 16, 7, 13, 19, 4, 5, -14,
12	-34, 7, -16, 38, 4, 4, 7, -22, 8, 14, 15, -14,
13	-4, 6, 22, -17, 8, 27, 8, 13, 46, -12, -7, -9,
14	-20, 13, 10, 4, -14, -11, 31, -7, 0, -3, 1, 16,
15	9, 5, 6, -2,
16	... 924 more items
17	]))
18	},
19	_id: "ObjectId('573a1390f29313caabcd5c0f')"
20	},
21	{
22	_autoEmbed: {
23	fullplot: Binary.fromInt8Array(new Int8Array([
24	-5, -22, 22, -6, -43, -13, -5, 4, 5, 2, 4, 13,
25	0, -3, -3, -50, -5, -2, -2, 27, -5, 36, 27, 12,
26	-12, -6, -1, 9, -7, 25, 4, -28, 3, 9, 3, 23,
27	8, 11, 11, 25, -19, 27, 17, 18, -1, 0, 5, -12,
28	13, -5, -3, 3, -17, 16, -15, 43, -1, 1, 1, -6,
29	-26, 16, -11, 13, 14, 0, -9, -23, 25, -16, 11, -25,
30	7, 9, -1, 0, 33, -8, -3, -18, 3, 4, -20, -14,
31	17, -2, -2, -10, 17, -25, -11, 9, 1, 2, -8, 7,
32	20, 18, 17, -2,
33	... 924 more items
34	]))
35	},
36	_id: "ObjectId('573a1390f29313caabcd5c0f')"
37	},
38	{
39	_autoEmbed: {
40	fullplot: Binary.fromInt8Array(new Int8Array([
41	0, -1, 47, 6, -20, -14, 29, -2, 13, -1, 20, 11,
42	-18, -7, 12, -10, -25, 10, 7, -15, 11, 9, -14, 12,
43	-9, -22, 16, 0, 18, 5, 9, -26, 14, -27, 6, 20,
44	-19, -8, 1, -5, 21, 13, -37, -7, 0, -21, -51, 1,
45	-38, -14, 4, 6, -23, 15, 19, 33, 8, 0, -7, -3,
46	-25, 8, -29, 25, -1, 12, 4, -21, -1, 0, -14, -3,
47	-6, -3, 7, 30, 8, -8, 34, -19, -12, -29, -15, -14,
48	1, -4, 6, -2, -36, -18, -2, 4, 23, 17, -13, 1,
49	0, 7, 25, -19,
50	... 924 more items
51	]))
52	},
53	_id: "ObjectId('573a1390f29313caabcd5c0f')"
54	}
55	]

1	use __mdb_internal_search
2	const mvColl = "<generated_embeddings_collection_name>"
3	db.getCollection(mvColl).find(
4	{},
5	{ _id: "<document_id>", "_autoEmbed.<auto_embed_field>": { $slice: 5 } }
6	).limit(<number_of_documents>)

1	from pymongo import MongoClient
2
3	MV_DATABASE = "__mdb_internal_search"
4
5	def get_mv_collection(client, source_db, source_collection, index_name):
6	"""Resolve the MV collection for an auto-embedding index."""
7	# 1. Look up the index ID via $listSearchIndexes on the source collection.
8	src = client[source_db][source_collection]
9	indexes = list(src.aggregate([{"$listSearchIndexes": {"name": index_name}}]))
10	if not indexes:
11	raise LookupError(f"No search index named {index_name!r} on {source_db}.{source_collection}")
12	index_id = indexes[0]["id"]
13
14	# 2. Find the MV collection in __mdb_internal_search whose name starts with the index ID.
15	mv_db = client[MV_DATABASE]
16	matches = [n for n in mv_db.list_collection_names() if n.startswith(index_id)]
17	if not matches:
18	raise LookupError(f"No MV collection found for index {index_id} (index may still be building)")
19	if len(matches) > 1:
20	# Possible briefly during an auto-embed field update; pick the newest.
21	matches.sort(reverse=True)
22	return mv_db[matches[0]]
23
24	def get_embedding(client, source_db, source_collection, index_name, embed_path, source_id):
25	"""Fetch the embedding for a single source document."""
26	mv = get_mv_collection(client, source_db, source_collection, index_name)
27	doc = mv.find_one(
28	{"_id": source_id},
29	{"_id": 1, f"_autoEmbed.{embed_path}": 1},
30	)
31	if doc is None:
32	return None
33	return doc["_autoEmbed"][embed_path]
34
35	# --- Usage ---
36	client = MongoClient("mongodb+srv://<user>:<pwd>@<cluster>/")
37
38	embedding = get_embedding(
39	client,
40	source_db="<source_db>",
41	source_collection="<source_collection>",
42	index_name="<auto_embed_index_name>",
43	embed_path="<auto_embed_field>",
44	source_id="<document_id>",
45	)
46
47	print(f"dims: {len(embedding)}")
48	print(f"first 5: {embedding[:5]}")

初始化同步(Resumable Initial Sync)

注意

持续更新

document 插入

Document Updates

注意

Document 删除

model Hosting and Multi-Tenancy

model 基础架构

数据隐私

速率限制

查询处理

查询速率限制

对运营的影响

初始化同步(Resumable Initial Sync)

持续更新

查询

生成的嵌入集合

警告

查找生成的嵌入集合

警告

查找自动嵌入索引的生成嵌入集合

使用 mongosh 连接到MongoDB 部署。

获取索引的ID 。

获取生成的嵌入集合：

检查已在生成的嵌入集合中创建多少个document

使用 mongosh 连接到MongoDB 部署。

检查在生成的嵌入集合中创建了多少 document。

检查生成的嵌入集合的存储大小

重要

检查生成的嵌入集合的存储大小

使用 mongosh 连接到MongoDB 部署。

检查生成的嵌入集合的存储大小。

注意

检查副本集上的存储

检查分片集群上的存储

获取分片列表。

连接到每个分片。

collStats在每个分片上运行 命令。

对结果求和。

检查所有生成的嵌入集合的存储

注意

从生成的嵌入集合中检索嵌入

检索 document 的嵌入

使用 mongosh 连接到MongoDB 部署。

检索 document 的嵌入。

检索多个 document 的嵌入

使用 mongosh 连接到MongoDB 部署。

检索多个 document 的嵌入。

用于从生成的嵌入集合中检索嵌入的PyMongo脚本

创建一个名为get_embedding.py 的文件。

将以下代码复制并粘贴到 get_embedding.py 文件中。

替换 get_embedding.py 文件中的以下占位符：

运行以下命令，从生成的嵌入集合中检索嵌入。

创建一个名为stream_embedding.py 的文件。

将以下代码复制并粘贴到 stream_embedding.py 文件中。

替换 stream_embedding.py 文件中的以下占位符：

运行以下命令以流生成的嵌入集合中的嵌入。

故障排除

使用 `mongosh` 连接到MongoDB 部署。

使用 `mongosh` 连接到MongoDB 部署。

使用 `mongosh` 连接到MongoDB 部署。

`collStats`在每个分片上运行命令。

使用 `mongosh` 连接到MongoDB 部署。

使用 `mongosh` 连接到MongoDB 部署。

创建一个名为`get_embedding.py` 的文件。

将以下代码复制并粘贴到 `get_embedding.py` 文件中。

替换 `get_embedding.py` 文件中的以下占位符：

创建一个名为`stream_embedding.py` 的文件。

将以下代码复制并粘贴到 `stream_embedding.py` 文件中。

替换 `stream_embedding.py` 文件中的以下占位符：