How to Implement Working Memory in AI Agents and Agentic Systems for Real-time AI Applications
Richmond Alake12 min read • Published Nov 18, 2024 • Updated Nov 18, 2024
Rate this tutorial
Memory is the cornerstone on which all forms of intelligence emerge and evolve. It creates the foundation for human cognition and artificial systems to build complex understanding. For humans, memory is a dynamic biological process of encoding, storing, and retrieving information through neural networks, shaping our ability to learn, adapt, and make decisions.
For computational systems in the modern AI application landscape, such as LLM-powered chatbots, AI agents, and agentic systems, memory is the foundation for their reliability, performance, and applicability, determining their capacity to maintain context, learn from interactions, and exhibit consistent, intelligent behavior.
In this tutorial, we will cover:
- Memory in AI agents and agentic systems.
- How to implement working memory in agentic systems.
- How to use Tavily and MongoDB to implement working memory.
- A practical use case: implementing an AI sales assistant with real-time access to internal product catalogs and online information, showcasing the role of working memory's role in personalized recommendations and user interactions.
- The benefits of working memory in AI applications in real-time scenarios.
Why should you read this tutorial?
Your ability to understand memory from a holistic perspective and implement various functionalities of memory within computational systems positions you at a critical intersection of cognitive architecture design and practical AI development, making your expertise invaluable as these paradigms increase and become the dominant form factor of modern AI systems.
For a while now, intelligence has not been limited to just humans. The emergence of AI has expanded the association of memory, which was once solely a psychological process, to include computational processes. This tutorial's definition of memory will encompass human and computational paradigms.
Memory in intelligent entities is the mechanism that facilitates the storage, retrieval, and organization of information that is derived from an intelligent entity's interaction with its environment, other entities, and experiences. There are two primary forms of memory: short and long-term memory.
- Short-term memory: Holds information for a limited period of time
- Long-term memory: Stores information for an extended period of time
There are various forms of short-term memory. The key focus of this tutorial is working memory in agentic systems, mainly as a capability of an AI agent. Read more about AI agents and agentic systems.
Working memory in an agentic system is a core computational component that manages transient and temporary information through integrated system resources (databases, APIs, state management systems) to enable real-time processing and decision-making. It functions as the active execution context where immediate information is temporarily stored and manipulated, supporting:
- Real-time context integration.
- Dynamic response generation.
- Adaptive execution planning.
- State-aware decision making.
This section focuses on a solution to a use case that’s common across the retail industry and can be adapted to other sectors as well, and that is the implementation of a knowledge assistant that is aware of company-specific data. The result of the code implementation shown in this section is a solution that enables an AI sales assistant to access an internal product catalog (knowledge from long-term memory) and, at the same time, access online real-time information (working memory) related to a query.
Let’s set the scene.
In the fast-paced world of retail, providing exceptional customer service and personalized shopping experiences is paramount. AI-powered sales assistants are emerging as a transformative tool, enabling retailers to understand customer needs better and offer tailored product recommendations.
However, building knowledgeable sales assistants requires more than just understanding language; it demands the ability to retain and recall past interactions, retrieve information from an existing knowledge base, and obtain and store new information, much like human memory.
Here, we introduce a method for implementing working memory in an AI sales assistant using Tavily, Cohere, and MongoDB.
Key aspects:
- Data storage and retrieval with MongoDB
- Tavily Hybrid RAG client for working memory
- Real-time search and retrieval
The full code implementation for this use case can be found on GitHub. In this section, we will focus on the critical steps required to understand the implementation of working memory in an AI application, including LLM-enabled chatbots, AI agents, and agentic systems, but access the full code implementation for a complete understanding and implementation.
In this step, the aim is to create a knowledge base consisting of a product accessible by the research assistant via retrieval mechanisms. The retrieval mechanism used in this tutorial is vector search. MongoDB is used as an operational and vector database for the sales assistant's knowledge base. This means we can conduct a semantic search between the vector embeddings of each product generated from concatenated existing product attributes and an embedding of a user’s query passed into the assistant.
Data storage
The process begins with data ingestion into MongoDB. The product data, including attributes like product name, category, description, and technical details, is structured into a pandas DataFrame.
The product data used in this example is sourced from the Hugging Face Datasets library using the
load_dataset()
function. Specifically, it is obtained from the "philschmid/amazon-product-descriptions-vlm" dataset, which contains a vast collection of Amazon product descriptions and related information.1 from datasets import load_dataset 2 3 import pandas as pd 4 5 # Make sure you have an HF_TOKEN in your environment variables to access the dataset on hugging face 6 7 product_dataset = load_dataset("philschmid/amazon-product-descriptions-vlm") 8 9 # Convert product_dataset to pandas dataframe 10 11 product_dataframe = pd.DataFrame(product_dataset['train'])
This DataFrame is then converted into a list of dictionaries representing a product. The
insert_many()
method from the pymongo library is then used to efficiently insert these product documents into the MongoDB collection, named products
, within the amazon_products
database. This crucial step establishes the foundation of the AI sales assistant's knowledge base, making the product data accessible for downstream retrieval and analysis processes.1 try: 2 3 documents = product_dataframe.to_dict('records') 4 5 product_collection.insert_many(documents) 6 7 print("Data ingestion into MongoDB completed") 8 9 except Exception as e: 10 11 print(f"Error during data ingestion into MongoDB: {e}")
Embedding generation and storage
To facilitate semantic search capabilities, each product document is enriched with embeddings. The
get_embedding()
function utilizes the Cohere API to generate a numerical representation of each product's semantic meaning. This function leverages the embed-english-v3.0
model from Cohere to embed the combined textual information stored in each document's product_semantics
field. 1 import cohere 2 3 co = cohere.ClientV2() 4 5 def get_embedding(texts, model="embed-english-v3.0", input_type="search_document"): 6 7 """Gets embeddings for a list of texts using the Cohere API. 8 9 Args: 10 11 texts: A list of texts to embed. 12 13 model: The Cohere embedding model to use. 14 15 input_type: The input type for the embedding model. 16 17 Returns: 18 19 A list of embeddings, where each embedding is a list of floats. 20 21 """ 22 23 try: 24 25 response = co.embed( 26 27 texts=[texts], 28 29 model=model, 30 31 input_type=input_type, 32 33 embedding_types=["float"], 34 35 ) 36 37 # Extract and return the embeddings 38 39 return response.embeddings.float[0] 40 41 except Exception as e: 42 43 print(f"Error generating embeddings: {e}") 44 45 print("Couldn't generate emebedding for text: ") 46 47 print(texts) 48 49 return None
The resulting embeddings are then stored within a dedicated
embedding
field in each product document. This step enables the system to search for products based on their semantic similarity, allowing for more nuanced and relevant recommendations.1 # Generate an embedding attribute for each data point in the dataset 2 3 # Embedding is generated from the new product semantics attribute 4 5 try: 6 7 product_dataframe['embedding'] = product_dataframe['product_semantics'].apply(get_embedding) 8 9 print("Embeddings generated successfully") 10 11 except Exception as e: 12 13 print(f"Error generating embeddings: {e}")
Retrieval and Vector Search
MongoDB Atlas Vector Search is used for semantic-based retrieval. This feature allows for efficient similarity searches using the pre-calculated product embeddings. The system can retrieve products that are semantically similar to the query by querying the
embedding
field with a target embedding. This approach significantly enhances the AI sales assistant's ability to understand user intent and offer relevant product suggestions. Variables like
embedding_field_name
and vector_search_index_name
are used to configure and interact with the vector search index within MongoDB, ensuring efficient retrieval of similar products.1 # The field containing the text embeddings on each document 2 3 embedding_field_name = "embedding" 4 5 # MongoDB Atlas Vector Search index name 6 7 vector_search_index_name = "vector_index"
Vector indexes are required to enable efficient semantic search within MongoDB. By creating a vector index on the
embedding
field of the product documents, MongoDB can leverage the HSNW algorithm to perform fast similarity searches. This means that when the AI sales assistant needs to find products similar to a user's query, MongoDB can quickly identify and retrieve the most relevant products based on their semantic embeddings. This significantly improves the system's ability to understand user intent and deliver accurate recommendations in real-time.1 def setup_vector_search_index(collection, index_definition, index_name="vector_index"): 2 3 """ 4 5 Setup a vector search index for a MongoDB collection and wait for 30 seconds. 6 7 Args: 8 9 collection: MongoDB collection object 10 11 index_definition: Dictionary containing the index definition 12 13 index_name: Name of the index (default: "vector_index") 14 15 """ 16 17 new_vector_search_index_model = SearchIndexModel( 18 19 definition=index_definition, 20 21 name=index_name, 22 23 type="vectorSearch" 24 25 ) 26 27 # Create the new index 28 29 try: 30 31 result = collection.create_search_index(model=new_vector_search_index_model) 32 33 print(f"Creating index '{index_name}'...") 34 35 # Sleep for 30 seconds 36 37 print(f"Waiting for 30 seconds to allow index '{index_name}' to be created...") 38 39 time.sleep(30) 40 41 print(f"30-second wait completed for index '{index_name}'.") 42 43 return result 44 45 except Exception as e: 46 47 print(f"Error creating new vector search index '{index_name}': {str(e)}") 48 49 return None
The Tavily Hybrid RAG client forms the core of the AI sales assistant's working memory, bridging the gap between the internal knowledge base stored in MongoDB and the vast external knowledge available online.
Unlike traditional RAG systems that rely solely on retrieving documents, adding Tavily into our system introduces a hybrid approach, which combines information from static (local) and dynamic (foreign) sources to provide comprehensive and context-aware responses. This is a form of HybridRAG, as we use two retrieval techniques to supplement information provided to an LLM.
In this implementation step, Tavily acts as a central orchestrator, integrating with MongoDB and external search engines. When a user makes a query, Tavily first searches the local knowledge base in MongoDB, leveraging the vector index to identify semantically similar products. Simultaneously, Tavily can query external sources for broader contexts or information that is not found locally. The results from both sources are then intelligently combined and presented to the user, providing a more complete and insightful answer.
The code snippet below initializes the Tavily Hybrid RAG client, which is the core component responsible for implementing working memory in AI sales assistants. It imports necessary libraries (
pymongo
and tavily
) and then creates an instance of the TavilyHybridClient
class. During initialization, it configures the client with the Tavily API key, specifies MongoDB as the database provider, and provides references to the MongoDB collection, vector search index, embedding field, and content field.
This setup establishes the connection between Tavily and the underlying knowledge base, enabling the client to perform a hybrid search and manage working memory effectively.
1 from pymongo import MongoClient 2 3 from tavily import TavilyHybridClient 4 5 hybrid_rag = TavilyHybridClient( 6 7 api_key=os.environ.get("TAVILY_API_KEY"), 8 9 db_provider="mongodb", 10 11 collection=product_collection, 12 13 index=vector_search_index_name, 14 15 embeddings_field="embedding", 16 17 content_field="product_semantics" 18 19 )
Tavily manages a working memory that stores and recalls previously searched or generated information from foreign sources. Working memory enables the AI sales assistant to maintain a relevant conversation state. When processing a new query, the system integrates information from both its knowledge base and a real-time information source.
1 results = hybrid_rag.search("Get me a black laptop to use in a office", max_local=5, max_foreign=2)
This code snippet above initiates a search using the Tavily Hybrid RAG client. It calls the
search()
method of the hybrid_rag object with the user's query ("Get me a black laptop to use in an office") as input. The parameters
max_local=5
and max_foreign=2
limit the number of results retrieved from the local knowledge base (MongoDB) to 5 and the number of results fetched from external sources to 2. The results of the search, containing both local and foreign documents, are stored in the results variable for further processing or display.Take note that the items below are both sourced from the internet or a "foreign" source:
- "Shop Office Depot for Black Laptop Computers..."
- "Actual charge time will vary based on operating..."
Further explanation of code:
hybrid_rag.search()
: This initiates a search operation using the previously initialized Tavily Hybrid RAG client.- "Get me a black laptop to use in an office": This is the user's query that the system will try to answer by searching both local and foreign sources.
max_local=5
: This parameter limits the number of results retrieved from the local knowledge base (MongoDB) to a maximum of 5. This helps prioritize relevant information stored internally.max_foreign=2
: This parameter limits the number of results fetched from external sources (such as websites or search engines) to a maximum of 2. This controls the amount of external information incorporated into the response.results
: This variable stores the search results returned by the Tavily client, which will include both local and foreign documents based on the specified limitations.
There are scenarios where storing new information from the working memory into a long-term memory component within a system is required.
For example, let's assume the user asks for "a black laptop with a long battery life for office use." Tavily might retrieve information about a specific laptop model with a long battery life from an external website. By saving this foreign data, the next time a user asks for a "laptop with long battery life", the AI sales assistant can directly retrieve the previously saved information from its local knowledge base, providing a faster and more efficient response.
Below are a few more benefits and rationale for saving foreign data from working memory to long-term memory:
- Enriched knowledge base: By saving foreign data, the AI sales assistant's knowledge base becomes more comprehensive and up-to-date with information from the web. This can significantly improve the relevance and accuracy of future responses.
- Reduced latency: Subsequent searches for similar queries will be faster as the relevant information is now available locally, eliminating the need to query external sources again. This also reduces the operational cost of the entire system.
- Offline access: If external sources become unavailable, the AI sales assistant can still provide answers based on the previously saved foreign data, ensuring continuity of service.
1 results = hybrid_rag.search("Get me a black laptop to use in a office", max_local=5, max_foreign=2, save_foreign=True)
This line of code above initiates a search using the Tavily Hybrid RAG client, similar to the previous example in Step 3. However, it includes an additional parameter,
save_foreign=True
, which instructs the client to save the retrieved foreign results (from external sources) into the local knowledge base (MongoDB). This means that the information retrieved from external sources will be stored and become part of the AI sales assistant's long-term memory.Observe that included in the "local" sourced results are search results that were once "foreign." Items used in the working memory have been moved to the long-term memory without any extensive implementation efforts.
Working memory, enabled by Tavily and MongoDB in your AI application stack, offers several key benefits for LLM-powered chatbots, AI agents, and agentic systems, including AI-powered sales assistants:
- Enhanced context and personalization: AI agents can remember past interactions and user preferences, allowing them to provide more contextually relevant and personalized responses. This is demonstrated in the code through the use of the Tavily Hybrid RAG client, which stores and retrieves information from both local and foreign sources, allowing the system to recall past interactions.
- Improved efficiency and speed: Working memory allows AI agents to access previously retrieved information quickly, reducing the need for repeated external queries. This is evident in the code where the
save_foreign=True
parameter enables saving foreign data into the local knowledge base, accelerating future searches for similar information. - Increased knowledge base and adaptability: By saving foreign data, AI agents can continuously expand their knowledge base, learning from new interactions and adapting to evolving user needs. This is reflected in the code's use of MongoDB as a long-term memory store, enabling the system to build a more comprehensive knowledge base over time.
- Enhanced user experience: Working memory enables more natural and engaging interactions, as AI agents can understand and respond to user queries with greater context and personalization. This is a crucial benefit highlighted in the AI sales assistant use case, where remembering past interactions leads to more satisfying customer experiences.
Overall, working memory empowers AI agents and agentic systems to become more intelligent, adaptable, reliable, and user-centric, significantly improving their adoption, effectiveness, and overall user experience.
For more information on working memory and other forms of memory in agentic systems, this extensive tutorial implements an agentic system that uses Tavily and MongoDB for memory components.
Memory in AI agents, agentic, and compound AI s∂ystems refers to the mechanisms that enable these intelligent entities to store, retrieve, and organize information derived from their interactions with the environment, other entities, and experiences. This memory is crucial for maintaining context, learning from past interactions, and making intelligent decisions. It encompasses short-term memory, which holds information temporarily for immediate processing, and long-term memory, which stores information for extended periods.
Working memory in intelligent systems is a form of short-term memory that manages transient and temporary information necessary for real-time processing and decision-making. It functions as the active execution context where immediate information is temporarily stored and manipulated. In AI agents, working memory supports real-time context integration, dynamic response generation, adaptive execution planning, and state-aware decision-making.
Implementing working memory in AI agents involves integrating Tavily and MongoDB to manage transient information effectively. Start by creating a knowledge base (long-term memory) in MongoDB, storing product data with semantic embeddings generated using Cohere. Then, Tavily's Hybrid RAG client will be set up to handle working memory, allowing the AI agent to access both local and real-time external information. This hybrid approach enables the agent to retrieve, store, and manipulate immediate information, enhancing real-time processing and decision-making in AI applications.
The Tavily Hybrid RAG client serves as the core component for implementing working memory in AI agents. It acts as an orchestrator that bridges the internal knowledge base (MongoDB) and external real-time information sources. By combining local and foreign data retrieval, Tavily allows AI agents to maintain context, adapt to new information, and provide dynamic responses. This hybrid retrieval mechanism enhances the agent's ability to process data in real-time, supporting more intelligent and context-aware interactions.
Working memory enhances AI agents by:
- Maintaining context: Allows agents to keep track of ongoing interactions and user preferences.
- Dynamic decision-making: Supports real-time processing of new information for adaptive responses.
- Learning from interactions: Enables agents to incorporate new data into their knowledge base, improving over time.
- Providing personalized experiences: Leads to more contextually relevant and satisfying user interactions.
- Enhancing efficiency: Reduces latency by storing frequently accessed information for quick retrieval.
Working memory makes AI agents more intelligent, responsive, and user-centric in real-time applications.
Top Comments in Forums
There are no comments on this article yet.