Discover Latent Semantic Structure With Vector Clustering
SK
Scott Kurowski10 min read • Published Sep 11, 2024 • Updated Oct 11, 2024
FULL APPLICATION
Rate this article
There might be hidden business knowledge in your MongoDB vector database.
When business data is vectorized, its embedding enables non-retrieval-augmented generation (RAG) retrieval or document search. Example data could be customer text inputs, large language model (LLM) classification annotation strings or string arrays, or other varying text content. Applications which store changing or cumulative embedded vector database content can develop strategic latent semantic knowledge as a population, and this article shows a method to extract it.
You are probably familiar with vector embedding of text using an embedding model. OpenAI text-embedding-3-small and text-embedding-ada-002 semantically encode a vector’s input text meaning, not merely its token sequence. Given a text string less than 8 kt (kilotokens) long, text-embedding-ada-002 returns an array vector of 1536 floats or doubles, each valued in range [-1.0, 1.0]. Each vector is the neural network weight’s output from the first few layers of a partial LLM model. The model-encoded vector maps a specific point in the model’s normalized conceptual vector (hyper-)space for its embedded text — and this map is model-specific: We cannot mix vectors embedded by text-embedding-ada-002 and text-embedding-3-small in the same analysis.
A model’s embedding vector space is normalized in values range [-1.0, 1.0], scaled to its standard deviation (sigma) as unit 1.0, making old-fashioned analytic geometry a convenient toolbox. Yet it is the model which gives the vector space its meaning. For our AI embedding model, vector (point) location indicates (maps) specific semantic content. Casting this meaning through our toolbox further tells us that vector proximity indicates semantic similarity, and vectors separated by greater than or equal to 3.0 (3 unit sigmas) can be considered un-matching (“distinct” semantic content) to a confidence of 99.7%.
MongoDB $vectorSearch aggregation takes advantage of the semantic proximity of vector points to an input’s vector point location to return “nearby” results, effectively up to a cutoff distance in vector space (an inverse function of the match score), or max_k count limit.
Let’s see how this further applies to semantic vector ensembles. Assume we somehow observed a possible pattern in the source texts embedded as vectors and want to try an unsupervised semantic-density self-grouping of its vectors.
A powerful example strategy to discover unsupervised and unbiased semantic structure in a vector database is to group the vectors into a hierarchy of central concept clusters, leveraging analytic geometry and statistical properties of AI embedding model vector spaces. In other words, the model makes the vector space semantically meaningful. Note it is the mathematical treatment of vector space data which is unbiased, not the model’s embedding encoding which, by definition, must reflect its intrinsic training biases.
In Python 3.11, one can use the sklearn.scikit.org OPTICS density-clustering class [1]. Clustering is an unsupervised pre-LLM AI technique. The main tunable parameter, min_per_cluster, which I call its “granularity,” adjusts the minimum number of vector (point) members for a semantic cluster to configure how generalized to distinct clustered concept centroids become — a lower minimum groups many distinct smaller clusters, a higher minimum groups fewer generalized larger clusters. The other tunables are how many total vectors to cluster and their common vector length.
1 sklearn.cluster.OPTICS( 2 vectors, 3 min_per_cluster=512 4 )
It took about six hours to cluster N = 100K embedding vectors 1536 length on Apple silicon, reflecting its O(N^2) performance.
The result of OPTICS clustering is a hierarchy tree of density-grouped vectors (semantic concepts), where each cluster has a vector membership list. In addition to the “leaf” cluster concepts, identified is how they branch from a more generalized concept (also having a vector membership list) to a “root” concept cluster at the tree root.
It directly reveals the semantic structure within the vector set. Now, having the membership of which vectors form which conceptual clusters of the tree, we finally have the data to create a semantic centroid from each identified cluster.
Almost magically, we again leverage the math properties of the embedded vector-space to derive these by simply averaging the vectors of a cluster’s membership list, creating the cluster’s centroid concept vector.
The resultant centroid vector’s encoding “describes” its entire cluster as a group. But exactly what text does each centroid vector represent?
We need the text the clustered mean centroid concept vector represents. This is somewhat more difficult than embedding a text’s vector.
Three basic methods of reversing a clustered centroid embedding vector back into its conceptual source text were effective in my limited testing:
- Brute-force iterative error gradient descent of embedding + LLM calls
- Very slow, expensive $$$$$, ~20K iterations, a few hours
- Very simple, LLM next-guess + embed and diff its guess for error
- Faster when initial guess text is somewhat “close” to target
- Faster when LLM given best three guesses+errors, last eight guesses+errors
- Trained Embedding Dictionary Probe Model (vec2text)
- Fast-ish, ~1 minute to reverse a vector into model-generated text
- Python module on Pytorch + GPUs
- Pre-trained on text-embedding-ada-002 semantic vector space
- Re-trainable on other embedding models and semantic content
- Up to a few dozen text-embedding-ada-002 model calls
- Ideal without vector source texts for Method 3
- Not very good when only a few vectors are averaged as a centroid
- LLM-merging of the source texts of clustered embedded vectors
- Fast, 1x LLM call, cheap $, at most a few seconds
- Text-frequency table of clustered member vector texts + prompt
- Tested qualitatively superior to Method 1
- On average, provided better detail and fewer errors than Method 2
Methods 1 and 2 generate new (model-specific) conceptual language text directly from a computed or unknown source text’s embedding vector. Method 1 is not very practical yet demonstrates effective, if tediously/costly, text reversal. Method 2 (vec2text) uses a custom model trained to unmap embedding model vectors to texts to inform generating its iterative text guesses [2, 3].
When the embedding source texts are available for clustered vectors, Method 3 skips reversing the cluster centroid vector by simply asking an LLM to merge a text-frequency table of the cluster’s members’ source texts into a new, single centroid text. Qualitatively, in my experience, this method generated superior centroid texts relative Methods 1 and 2. I tested Method 2 using its pre-trained model, yet with subject-domain content training, it might be competitive with Method 3.
Let’s take a look at a real example of embedding vector clustering with a “toy-sized” set of 26 short texts selected for their obvious semantically-distinct conceptual grouping into five broader topics, having two (the MongoDB and Gödel topics) rich enough to potentially split into semantic sub-groups. You can run it yourself on Python 3.11 using the demo code linked below.
Here are the demo input texts, which are vectorized and then clustered as centroids:
1 TEXTS = [ 2 "Be at one with your power, joy and peace.", 3 "Know the flow of the much greater oneness we share.", 4 "Let one's superpower be considered choices in the network of life.", 5 6 "MongoDB Ops Manager", 7 "MongoDB Cloud Manager", 8 "MongoDB Cloud Manager Backups", 9 "MongoDB Atlas Database", 10 "MongoDB Atlas Stream Processing", 11 "MongoDB Atlas Vector Search", 12 "MongoDB Atlas Data Lake", 13 "MongoDB Enterprise Database Server", 14 15 "Gödel, Escher, Bach: An Eternal Golden Braid", 16 "How Gödel's Theorems Shape Quantum Physics as explored by Wheeler and Hawking", 17 "Bach, Johann Sebastian - Six Partitas BWV 825-830 for Piano", 18 "M.C. Escher, the Graphic Work", 19 "Bach's baroque style features recursion or self-referencing iterated functions like the artwork of Escher.", 20 "In 1931, Gödel proved the profound duality that formal systems cannot be both self-consistent and complete.", 21 "John Von Neumann was able to derive Gödel's 2nd theorem from his 1st before Gödel published it.", 22 23 "My cat is a fun black and white half-sized tuxedo.", 24 "Some people prefer the company of dogs instead of cats.", 25 "My friend has a large saltwater aquarium with colorful and exotic tropical fish.", 26 "My clever dog opens locked windows and doors to follow me.", 27 28 "Mesopotamian tablets record a fantastic version of human history.", 29 "North American burial mounds often held deceased local royal families.", 30 "Mayan pyramids predated most Aztec pyramids.", 31 "The Aztecs Quetzalcoatl closely resembles the Egyptian god Thoth.", 32 ]
The specific embedding model and vector length can be varied; the demo uses text-embedding-ada-002 at 1536 dimensions.
Can it find and describe the expected semantic structure? Yes! Here is the generated tree, post-annotated with its LLM-merged (Method 3) centroid texts:
The tree and clustered centroid texts beautifully reveal the semantic structure of the input texts’ concepts at a clustering “granularity” minimum of two vectors.
For a larger vector set having additional associated metadata, it may be of further interest to examine those in each cluster’s member vectors, as grouped, for any related insights.
For a final thought, it should be obvious that Atlas Vector Search and clustering of normalized scaled vectors from any data model (not just AI models) works the same way. A world of possibilities.
For effective manipulation of vector data, it is crucially important that embedding source texts be “cleaned” strings or arrays of strings omitting any formatting syntax and newlines, keeping only sentence separators.
When clustering vectors, it can be useful to test a range of “granularities” starting with a large minimum number of vectors per cluster setting (say, 512), then lowering it in large steps until clustering yields a useful balance between the generality and distinctness of the reversed centroid texts, and how many centroids were classified. If only one (the hierarchical root) cluster develops, use a lower minimum. If more than a few dozen clusters form, try a higher minimum.
The demo code uses Python openai API module 1.36 or later and is configured with (Method 2) vec2text disabled as there are too few and dissimilar vectors per cluster in this demo to create a meaningful mean centroid vector to reverse. To enable it, set
CONFIGURE_VEC2TEXT = True
in demo_vector_clustering.py. When enabled in the demo, vec2text centroid texts are somewhat nonsensical due to the sparse few member texts of dissimilar words and tenses among the texts of each cluster. A larger centroid membership should better converge to a perfectly valid central concept text using vec2text.To adapt the demo to a collection in an existing MongoDB vector database, follow these steps:
- Verify you have the demo working with highly-similar results as shown above before proceeding. You might need to do step (3)(e) below for this.
- Skip using `load_vector_db.py` since we’re using an existing vector database.
- Edit the MongoDB instance URI string including required auth credentials, name of the target vector database, and collection name on lines 39, 40, and 43:
- Edit the embedded source text field name and its embedding field name to cluster upon as a dictionary object *{ "text_field_name": "embedding_field_name" }* in line 49. You can get this info using *db.collection.getSearchIndexses()* in a *mongosh* shell. Add more than one if there are multiple embedded fields to cluster in separate passes in each vector. Here, the single vector text field name is *text* and its embedding vector is stored in the array field *embedding*
- Edit the output fitted cluster centroids collection name (in the same vector database), line 53. It does not need to start with *VECTOR_COLLECTION* as a prefix
- To use *vec2text* for a second version of each semantic centroid text in addition to LLM-merged centroid texts, first verify it functions correctly using the demo data, and then edit ```CONFIGURE_VEC2TEXT = True``` on line 57 here.
- Edit the OpenAI completion model to merge centroid texts, line 62. gpt35-turbo might perform okay, but I recommend gpt-4, gpt-4o, or gpt-4o-mini. For Azure OpenAI models, this is the deployment name.
- Edit the OpenAI completion model token budget, line 63. This limits the text-frequency table size for centroid text merging. You could consider increasing it above 16000; I recommend trying 16000 first.
- Edit the collection vector index path field’s embedding model used, line 67. I found *"text-embedding-3-small"* works well, too, if that’s what the search index is embedded upon. If using vec2text, however, you must use *text-embedding-ada-002*.
- Edit the cluster grouping "granularity" settings, where each *MINIMUM_VECTORS_PER_CLUSTER* list integer value is clustered one at a time in an outer loop (line 72); here is a guess which starts with coarse granularity at 512 and progressively clusters finer and finer granularities. If only one semantic cluster is discovered, the granularity is too high; cut it in half and try again until you start seeing clustered centroids at a useful level of detail. ```MINIMUM_VECTORS_PER_CLUSTER = [512,256,128,64,32,16]```
- If you’d like to keep the centroids data between repeated clustering code runs, comment-out line 81: ```vector_db[fitted_collection].delete_many({})```
- Unless you are using the environment variable *OPENAI_API_KEY* for the *OpenAI()* class instance declaration in line 84, edit it with your api_key parameter and value. If using *AzureOpenAI()*, use that class instead and populate any other parameters required. If ```CONFIGURE_VEC2TEXT = True```, repeat this in *reverse_vector_vec2text.py* in line 56, also: ```model_client = OpenAI(api_key=”my api key string”)```
- Edit and comment-out *load_demo_vectors()* in line 260.
- Run the semantic vector clustering code. For ~100K vectors length 1536, an M2 Apple required just over six hours. And half as many (~50K) vectors would take a quarter as long, ~1.5 hours, because clustering time scales O(N^2):
1 $ python demo_vector_clustering.py
- Inspect the output data between each clustering iteration of values in *MINIMUM_VECTORS_PER_CLUSTER* for the level of detail in their centroid texts, and stop the code run when there is “too much” detail. In a *mongosh*, try:
1 db.<your_fitted_semanatic_clusters_collection>.find({ 2 min_vectors_per_cluster: 512 // etc. 3 },{centroid_embedding:0,_id:0});
- Inspect each generated cluster hierarchy structure chart png file, which tends to branch increasingly as *min_vectors_per_cluster* gets lower:
1 demo_vectors_clustered_<<source_text_field_name>>_<<min_vectors_per_cluster>>minPerCluster.png
- Optional: If there is another metadata field in each cluster’s member vector for which there is further value to accumulate those into a list for each fitted centroid’s further external analysis, complete these steps also:
- Edit and un-comment the code lines which refer to a non-vector field *field_to_accumulate* in lines 221, 243, 249, 273, 279, 295, 357, and 388.
- Edit and replace *field_to_accumulate* to use the actual metadata field name everywhere it is referenced in the uncommented code.
The mathematical properties of AI embedding vectors enable us to directly use “old school AI” clustering techniques on semantic content. By modifying the demo code, you can experimentally apply this technique to your own embedding vector dbs. You just might reveal new, hidden structure and intelligence in your business data.
If you have questions, want to share what you’re building, or want to see what other developers are up to, visit the MongoDB Developer Community next.
Top Comments in Forums
There are no comments on this article yet.