Explore Developer Center's New Chatbot! MongoDB AI Chatbot can be accessed at the top of your navigation to answer all your MongoDB questions.

Join us at AWS re:Invent 2024! Learn how to use MongoDB for AI use cases.
MongoDB Developer
Atlas
plus
Sign in to follow topics
MongoDB Developer Centerchevron-right
Developer Topicschevron-right
Productschevron-right
Atlaschevron-right

Sip, Swig, and Search With Playwright, OpenAI, and MongoDB Atlas Search

Anaiya Raisinghani12 min read • Published Sep 27, 2024 • Updated Oct 01, 2024
AIPythonAtlas
Facebook Icontwitter iconlinkedin icon
Rate this tutorial
star-empty
star-empty
star-empty
star-empty
star-empty
It’s not often that I find myself craving a 44-oz Diet Coke with cream, but ever since I watched the sensational Hulu original “The Secret Lives of Mormon Wives,” I’ve been thinking about it much more than I’d like to admit.
If you haven’t tried a “dirty” soda before, I highly recommend it. It’s a fun combination of any soda of your choice, half-and-half or creamer, and flavored syrups. If you don’t like soda, you’ll hate it but I fear that’s a given. I’ll admit I was skeptical, but if you prefer soda as your beverage, your biggest takeaway will be to wonder what other delicious things people in Utah are keeping from the rest of us.
After consulting with the professionals (a very quick Google search), I found that Swig is “home of the original dirty soda.” A look at their menu shows a ton of super unique drink options, with various flavor combinations. Since we’re heading into fall, let’s see if we can replace the traditional Pumpkin Spice Latte with something a bit less basic and a little worse for our teeth.
Let’s use Playwright to scrape the menu with the ingredients from their website, OpenAI’s structured outputs to help us decide which drinks are the most appropriate for each season, and MongoDB Atlas Search to filter out our “dirty” sodas based on their ingredients and what we’re craving.
Before we get started, let’s go over these platforms in a bit more detail just so we’re all on the same page.

What is Playwright and why are we using it?

Playwright is a powerful browser automation tool that was built by Microsoft. It is useful for websites that are running on modern rendering engines, such as Chromium (Google Chrome runs on this), Firefox, and WebKit, since it allows developers to make new browser pages, open up various URLs, and even allows you to interact with all the elements located on a page.
Playwright was chosen for this tutorial because of how simple it is to return a website's elements, especially from dynamic websites, like the one we are scraping. While there are a ton of other incredible capabilities for Playwright, it is perfect for our use-case since our drink items are loaded using JavaScript after the initial page loads. With other more traditional Python web-scrapers, I was getting time-out errors or empty lists since the menu items weren’t embedded in the raw HTML of the website. Playwright, on the other hand, can really nicely handle JavaScript execution and waits for the content to fully load before grabbing the information we need.

What are OpenAI’s Structured Outputs?

OpenAI’s new Structured Outputs actually make sure that any API responses look exactly the way the developer specifies. This works by forcing whichever model you use (we are using GPT-4o) to have a response that matches the schema given by the developer. This tutorial is using it to make sure that the drinks from the Swig menu are shown in a structured JSON format, since we want to analyze the model’s responses later in the tutorial. I will go over how to do this in detail in the tutorial!
MongoDB Atlas Search is embedded full-text search inside of MongoDB Atlas, MongoDB’s cloud database service for developers. It’s crucial for this tutorial since we will be saving our scraped menu items into an Atlas cluster and then creating an aggregation pipeline on the data to find which drinks match our specific season and ingredients.

Tutorial prerequisites

These prerequisites are crucial to ensure we are successful in this tutorial.
  1. An IDE of your choice: This tutorial uses a Google Colab notebook. Please feel free to follow along.
  2. OpenAI API key: You will need to pay to access an API key.
  3. MongoDB Atlas cluster: Please make sure you are using the free tier, you have set your IP address to “access from anywhere,” and that you have copied the connection string to a safe place for future reference.
Once you have your OpenAI API key saved someplace safe, your MongoDB Atlas cluster provisioned, and the connection string saved someplace safe, you are ready to begin!

Part 1: Scrape all menu items from Swig’s website

Inspect your website!

Before we write our main function to scrape the website, we need to make sure we really inspect the website we are hoping to scrape so that we can correctly figure out where the information we want lives.
Head over to Swig, click on the “Dirty Dr Pepper” option or any soda option of your choice, and then click on the American Fork store. This is to see all the menu items available for a location. You can choose any location you like. I just picked American Fork since it was the first one shown.
Click on the Dirty Dr Pepper or any soda choice
Just picking American Fork since it’s the first one shown
Then, you’ll see all the menu items for that location as shown below: Menu items for American Fork location
Now, we can click on any soda, right click to inspect the page, and see what we’re working with.
The first thing I want to point out is that this website is a dynamic website. How do I know this? Swig is a dynamic website
I can see that when I try to inspect a menu item to figure out how to scrape our HTML, the information is located inside an iframe, meaning that the content I want to scrape is HTML that is embedded inside of another webpage. From this example, I can see that everything I want to find (and I know this because I highlighted and inspected other parts, such as the name “Dirty Dr Pepper” and the ingredients “Dr Pepper + Coconut (120 - 440 Calories)”) is nested under the same iframe URL, https://swig-orders.crispnow.com/tabs/locations/menu.
So what does a dynamic website mean? It means I cannot just scrape the HTML from the top level directly, as the website is changing and doesn’t load entirely in one go. Luckily, Playwright is a scraper that can handle waiting for the other page elements to load before scraping, but not all web scrapers are equal in this. Prior to realizing this, and when using a different scraper to try and achieve my goal, I was getting empty outputs every time I ran my code. So please, please inspect your web page before you try and scrape!
Once we have a solid understanding of the website we are attempting to scrape, let’s write up our scraping function.

Write our scraping function

So now, let’s scrape all of the menu items Swig offers so we can see which combinations are best suited for fall. We want to get the name of each menu item and its description. An example of this is the name “Dirty Dr Pepper” and the description “Dr Pepper + Coconut (120 - 440 Calories).
To do this, let’s first install Playwright itself. This can be done with a simple pip statement, and please keep in mind we are running this in our Google Colab notebook:
1!pip install playwright
2!playwright install
Now, let’s define our imports. Because we are using a notebook, we have to use async. If you’re working locally through your IDE, feel free to use sync.
1import asyncio
2from playwright.async_api import async_playwright
Let’s use the name swigScraper for our definition. Once again, we are going to use async and then headless mode since we are using a notebook. Learn more about when to use headless vs. headed mode.
We also want to make sure we are using the correct URL. Remember from above, we want to use the URL that is located inside of the iframe that our elements are being dynamically generated from. We don’t want the normal Swig website URL.
1async def swigScraper():
2 async with async_playwright() as playwright:
3
4 browser = await playwright.chromium.launch(headless=True)
5 page = await browser.new_page()
6
7 # make sure to have the correct URL
8 await page.goto('https://swig-orders.crispnow.com/tabs/locations/menu')
Since the web page we’re trying to scrape has a lot of hidden elements, let’s first scroll through the menu to see what loads after about a minute. Then, we can right click and inspect the page to see where our name and description are nested. After scrolling through for a minute, I highlighted the drink name and then right clicked and hit “inspect.” This screenshot shows my result:
Result of inspecting the name of the drink
As we can see from this screenshot, we needed to wait for our ion-card-content to load before we can see where the information we want lives. This allows us to finish up our function with a wait_for_selector saying we want to wait until that specific selector loads:
1await page.wait_for_selector('ion-card-content', state='attached', timeout=60000)
2
3
4# our items names and descriptions are all located in this area
5items = await page.query_selector_all('ion-card-content')
Now, we can create a list to store our menu, loop through the HTML and take what we need, extract our text, and then make it look pretty:
1 menu = []
2
3 for item in items:
4 result = {}
5
6 name = await item.query_selector('p.text-h3')
7 description = await item.query_selector('p.text-b2')
8
9 # use inner text to extract our info
10 if name and description:
11 result = {}
12 result['name'] = await name.inner_text()
13 result['description'] = await description.inner_text()
14 menu.append(result)
15
16 for item in menu:
17 print(f"Name: {item['name']}, Description: {item['description']}")
18
19 await browser.close()
20 return menu
21
22scraped_menu = await swigScraper()
23print(scraped_menu)
Our results will look like this: Results of scraping our Swig menu
Now that we have all our menu items nicely formatted, let’s use OpenAI’s Structured Outputs to let us know which drinks are perfect for each season and why!

Part 2: OpenAI structured schema outputs

Please make sure you have your OpenAI API key ready! We will be using it in this section.
We are going to be using Structured Outputs so that we can let the model know exactly what we’re looking for and how we want our output to be styled.
Our first step is to install OpenAI:
1!pip install openai
Now, let’s import openai along with json and getpass for our OpenAI API key.
1import openai
2import json
3import getpass
Using getpass, input your API key so that it’s easy for us to use throughout this section of the tutorial.
1# put in your OpenAI API key here
2openai_api_key = getpass.getpass(prompt= "Put in OpenAI API Key here")
Before we can get started, we need to make sure our menu is properly formatted for OpenAI and our model to understand. We can do this by putting all of our drinks and their descriptions into a single string. We also want to tell OpenAI our prompt for what exactly it is and which drinks and ingredients are available. I am going to let our model know that they are the best soda mixologist Utah has ever seen and I am providing a list of our sodas with their descriptions. I also want to ask which sodas are best for each season (spring, summer, fall, winter) based on their descriptions:
1def swigJoined(scraped_menu):
2 drink_list = []
3
4 # just formatting our menu from above
5 for drink in scraped_menu:
6 drink_format = f"{drink['name']}: {drink['description']}]"
7 drink_list.append(drink_format)
8
9 # put all the drinks into a single string for OpenAI to understand it
10 drink_string = "\n".join(drink_list)
11
12 # we have to tell OpenAI which drinks/combinations are available
13 prompt = (
14 "You are the best soda mixologist Utah has ever seen! This is a list of sodas and their descriptions, or ingredients:\n"
15 f"{drink_string}\n\n Please sort each and every drink provided into spring, summer, fall, or winter seasons based on their ingredients\n"
16 "and give me reasonings as to why by stating which ingredients make it best for each season. For example, cinnamon is more fall, but peach\n"
17 "is more summer."
18 )
19
20 return prompt
Now, let’s generate our prompt using the menu we scraped. We are going to be using our prompt down below in our structured outputs part of this tutorial:
1my_prompt = swigJoined(scraped_menu)
2openai.api_key = openai_api_key
Now that this is ready, we can use our structured call and JSON schema. For help on this part, please refer to the documentation:
We can see under the “extracting structured data from unstructured data” section that the request should follow these specifics:
1POST /v1/chat/completions
2{
3 "model": "gpt-4o-2024-08-06",
4 "messages": [
5 {
6 "role": "system",
7 "content": "Extract action items, due dates, and owners from meeting notes."
8 },
9 {
10 "role": "user",
11 "content": "...meeting notes go here..."
12 }
13 ],
14 "response_format": {
15 "type": "json_schema",
16 "json_schema": {
17 "name": "action_items",
18 "strict": true,
19 "schema": {
20 "type": "object",
21 "properties": {
22 "action_items": {
23 "type": "array",
24 "items": {
25 "type": "object",
26 "properties": {
27 "description": {
28 "type": "string",
29 "description": "Description of the action item."
30 },
31 "due_date": {
32 "type": ["string", "null"],
33 "description": "Due date for the action item, can be null if not specified."
34 },
35 "owner": {
36 "type": ["string", "null"],
37 "description": "Owner responsible for the action item, can be null if not specified."
38 }
39 },
40 "required": ["description", "due_date", "owner"],
41 "additionalProperties": false
42 },
43 "description": "List of action items from the meeting."
44 }
45 },
46 "required": ["action_items"],
47 "additionalProperties": false
48 }
49 }
50 }
51}
So, we can take this skeleton code and make it our own. This tutorial uses GPT-4o, but please feel free to use whichever GPT you’re most comfortable with:
1response = openai.chat.completions.create(
2 model="gpt-4o-2024-08-06",
3 messages=[
4 {"role": "system", "content": "You are the best soda mixologist Utah has ever seen!"},
5 {"role": "user", "content": my_prompt}
6 ],
7 response_format={
8 "type": "json_schema",
9 "json_schema": {
10 "name": "drink_response",
11 "strict": True,
12 "schema": {
13 "type": "object",
14 "properties": {
15 "seasonal_drinks": {
16 "type": "array",
17 "items": {
18 "type": "object",
19 "properties": {
20 "drink": {"type": "string"},
21 "reason": {"type": "string"}
22 },
23 "required": ["drink", "reason"],
24 "additionalProperties": False
25 }
26 }
27 },
28 "required": ["seasonal_drinks"],
29 "additionalProperties": False
30 }
31 }
32 }
33)
Now, let’s print it and see our structured response:
1print(json.dumps(response.model_dump(), indent=2))
Structured output from OpenAI Structured Outputs in the API
Here, we can see that the output we are looking for is located inside of the “content” part for our fall drinks. Let’s extract it so that we can see a list of the drinks and the reasons why each drink is best for each season. Let’s do this by printing it out using model_dump:
1content = response.model_dump()['choices'][0]['message']['content']
2print(content)
After printing out the “content” line from our structured schema
It’s still in one line, so let’s print out the drinks so they look pretty:
1parsed_drinks = json.loads(content)
2seasonal_drinks_pretty = parsed_drinks['seasonal_drinks']
3print(json.dumps(seasonal_drinks_pretty, indent=2))
Our drinks and the reasons why they’re good for each season
Now, we can see all the drinks that are perfect for each season from the Swig menu! Let’s take a look at some of them.
OpenAI believes Dirty S.O.P is perfect for summer since, "The inclusion of peach makes this drink more suited for summer, as peach is typically associated with warm weather and summer harvests." A great drink for fall and winter is the Dr Spice: "Cinnamon and cinnamon stick are warm spices typically associated with fall and winter, making this drink best suited for chillier weather."
Now that we know which soda-based drinks are perfect for each season based on our output, let’s go ahead and insert our drinks as documents into our MongoDB Atlas cluster so we can run an aggregation pipeline on them and figure out which ones are perfect for our upcoming fall season.

Part 3: Insert into MongoDB Atlas and create an aggregation pipeline

Our first step is to install pymongo. PyMongo is the official MongoDB driver for Python applications.
Install it using pip:
1!pip install pymongo
Here, we are going to import our MongoClient, set up our MongoDB connection using getpass, and then we can name our database and collection anything that we want since it’ll be created when we enter our data. I am naming my database “swig_menu” and the collection “fall_drinks.”
1from pymongo import MongoClient
2
3# set up your MongoDB connection
4connection_string = getpass.getpass(prompt= "Enter connection string WITH USER + PASS here")
5client = MongoClient(connection_string, appname = "devrel.showcase.swig_menu")
6
7# name your database and collection anything you want since it will be created when you enter your data
8database = client['swig_menu']
9collection = database['seasonal_drinks']
10
11# insert our fall drinks
12collection.insert_many(seasonal_drinks_pretty)
Once you run this block, double-check everything is imported correctly: Correctly imported documents in MongoDB Atlas
Now, let’s create an Atlas Search index so we can use MongoDB’s Atlas Search on our documents!
Once you’ve created your search index, create an aggregation pipeline using the MongoDB Atlas UI. To do this, head over to “Collections” and then click on “Aggregation.” Here, we can search through our seasonal drinks and use Atlas Search’s exact match feature to figure out which drinks are best for fall!
How to create an aggregation pipeline in the MongoDB Atlas UI
Let’s first see all the fall drinks that our AI model found for us. To do this, we can use our $search operator and create a stage in our aggregation pipeline:
1{
2 "text": {
3 "query": "fall",
4 "path": "reason"
5 }
6}
We have eight results! results from finding drinks with “fall” in them
Now, let’s say I want drinks that are for the fall that have the ingredient “apple” in them. To do this, we need to use a compound operator that combines two or more queries. So this is saying I want to find drinks that contain “fall” AND “apple.” The operator needs to be a “must.” If I wanted “fall” OR “apple,” I would need to use a “should.”
1{
2 "compound": {
3 "must": [
4 {
5 "text": {
6 "query": "fall",
7 "path": "reason"
8 }
9 },
10 {
11 "text": {
12 "query": "apple",
13 "path": "reason"
14
15 }
16 }
17 ],
18 }
19}
Output after running the aggregation pipeline
We have two great options for fall drinks that include apples!
Now, we can find drinks from Swig’s website that are super specific to what we are craving for each season.

Conclusion

In this tutorial, we have gone over how to scrape a website using Playwright, we have put our scraped information through OpenAI and have gotten results for seasonal drinks from the menu with reasoning and a structured output, and we finished off the tutorial with importing our drinks with their reasons into MongoDB Atlas and used MongoDB Atlas Search to find fall drinks and their ingredients!
I hope you enjoyed this tutorial. Please connect with us in the Developer Forum.

Resources used

Top Comments in Forums
There are no comments on this article yet.
Start the Conversation

Facebook Icontwitter iconlinkedin icon
Rate this tutorial
star-empty
star-empty
star-empty
star-empty
star-empty
Related
Tutorial

Building AI Graphs With Rivet and MongoDB Atlas Vector Search to Power AI Applications


Sep 18, 2024 | 10 min read
Tutorial

How to Build an Animated Timeline Chart with the MongoDB Charts Embedding SDK


Dec 13, 2023 | 6 min read
Quickstart

Quick Start: Getting Started With MongoDB Atlas and Python


Apr 10, 2024 | 4 min read
Tutorial

How to Do Semantic Search in MongoDB Using Atlas Vector Search


Sep 18, 2024 | 8 min read
Table of Contents