Looking for help in aggregating user orders data

Neeraj_Sathe · 2024-11-19T16:12:33.459Z

Hello,

I am having an aggregation use case where I have a large JSON containing product orders data that I need to aggregate.

Below is a small subset of the data which I am planning to aggregate and seek your assistance.

[
  {
    "orderid": "1111",
    "order_type": "individual",
    "users": [
      {
        "name": "user1",
        "phone": 982928,
        "items": [
          {
            "name": "AAA",
            "qty": 20,
            "price": 10
          },
          {
            "name": "BBB",
            "qty": 30,
            "price": 15
          }
        ]
      },
      {
        "name": "user2",
        "phone": 948783,
        "items": [
          {
            "name": "AAA",
            "qty": 10,
            "price": 10
          },
          {
            "name": "CCC",
            "qty": 5,
            "price": 20
          }
        ]
      },
      {
        "name": "user3",
        "phone": 787868,
        "items": [
          {
            "name": "BBB",
            "qty": 40,
            "price": 10
          },
          {
            "name": "CCC",
            "qty": 15,
            "price": 20
          }
        ]
      }
    ]
  },
  {
    "orderid": "2222",
    "order_type": "bulk",
    "users": [
      {
        "name": "user1",
        "phone": 982928
      },
      {
        "name": "user3",
        "phone": 787868
      }
    ],
    "items": [
      {
        "name": "AAA",
        "qty": 3,
        "price": 10
      },
      {
        "name": "BBB",
        "qty": 15,
        "price": 10
      }
    ]
  }
]

The output I am looking to achieve would look something like:

[
  {
    "user":"user1",
    "orders":[1111,2222],
    "unique_items":2
  },
  {
    "user":"user2",
    "orders":[1111],
    "unique_items":2
  },
  {
    "user":"user3",
    "orders":[1111,2222],
    "unique_items":3
  }
]

Here is a link to mongo playground for convenience.

I am relatively new at aggregation, any help in this regard would be highly appreciated.

Neeraj_Sathe · 2024-11-20T01:16:55.558Z

@Akshat_Gupta3 can you help?

John_Sewell · 2024-11-22T09:02:56.460Z

I’d start with an $unwind on the users and then re-group using a $group with the new ID being the user ID, you can then $addToSet the orders to get a distinct list of them, you can also probably run a reduce on the items to get an array of the item names then run another stage after that to get a count of items in the unique set.
If you have a massive dataset this could be slow due to the unwind, in which case you could build it gradually by filtering the input data and effectively running in batches.

steevej · 2024-11-23T14:34:58.345Z

First, a big thank you for providing a playground. It makes helping you so much easier to do.

As mentioned by John

John_Sewell:

this could be slow due to the unwind

and $group. Group is expensive in terms of memory as all documents must be processed before the first document is output.

So in a case like this, where I assume that you have a collection of unique user profile, I would start the aggregation with the user_collection rather than the order_collection. You already have all the groups, you then $lookup the order_collection to get what you want.

The first stages would looked like the one at this updated playground where I added a user_collection. With an index on users.name the $lookup might be faster than $unwind/$group.

The stages to count the unique items would be a bit more complicated due to the different schema between bulk and individual. The $lookup pipeline probably could be use to $set items with $cond of order_type, for bulk $items is used as-is while for indivual would use the items specific for the user.

I’ll come back here when I have an idea and time.

steevej · 2024-11-24T15:09:29.721Z

Here is an updated playground that improves from the previous one to do

steevej:

the $lookup pipeline probably could be use to $set items with $cond of order_type, for bulk $items is used as-is while for individual would use the items specific for the user.

There is a new $set in the lookup pipeline that $map(s) the unified items to only keep the item’s names. This is the $reduce mentioned in

John_Sewell:

run a reduce on the items to get an array of the item names

The final $set of main pipeline then uses $setUnion to produces an array of unique items.

I left out the counting of unique_items because I prefer to get the list of unique items rather than just the count. The application layer can easily do the calculation. Obviously, the calculation could be done on the server to reduce bandwidth utilization but increase the CPU utilization.

I also left out the cleaning of the temporary results (all prefixed with tmp) like _tmp_lookup_orders. It makes understanding the different steps easier by keeping them.

Ramkumar_Pillai · 2024-11-24T21:10:39.398Z

db.orders.aggregate([
  // Step 1: Unwind the 'users' array to get one document per user per order
  { $unwind: "$users" },
  
  // Step 2: Create a unique identifier for each user (combining order ID and user)
  {
    $project: {
      orderid: 1,
      user: "$users.name",
      items: { $ifNull: ["$users.items", []] } // Ensuring 'items' field is included, even if empty
    }
  },
  
  // Step 3: Group by user to collect all orders and unique items
  {
    $group: {
      _id: "$user",
      orders: { $addToSet: "$orderid" }, // Collect unique order ids
      unique_items: {
        $addToSet: { $map: {
            input: { $ifNull: ["$items", []] }, // Handle missing 'items' array
            as: "item",
            in: "$$item.name"
          }
        }
      }
    }
  },
  
  // Step 4: Project the final desired output format
  {
    $project: {
      user: "$_id",
      orders: 1,
      unique_items: { $size: "$unique_items" } // Count unique items
    }
  },

  // Step 5: Sort the results by user name (optional)
  { $sort: { user: 1 } }
]);

steevej · 2024-11-24T22:08:46.497Z

Even if your aggregation may provide the appropriate answer with the sample data provided, your aggregation does not work properly. The way you $project items in the 2nd stage does not take into consideration that items is a top level array for order_type:bulk. You set it to an empty array which completely wipes out the items array that already exists. This was mentioned by

steevej:

The stages to count the unique items would be a bit more complicated due to the different schema between bulk and individual.

Your pipeline works with the provided dataset because all the items of the only bulk order are already present in invidual orders.

Your $addToSet seems to add the whole items array rather than individual item names. So it looks like the unique_items count it provides is the number of orders rather than the number of items. This will be confirmed if the count for user3 is 2 rather than 3.

I created a playground with the original data set and your pipeline but with an extra item in the bulk invoice that is not present in any of the individual orders. You will see that this item (which I named only_in_bulk) is not in your result set. I also use a different field of the unique item counts and kept the result of $addToSet to show that it may count the number of items list rather than the number of items.

I also update my previous playground with only_in_bulk to show that the updated result set is different when some items only appers in bulk orders.

steevej · 2024-12-04T02:30:58.536Z

@Neeraj_Sathe, despite having no response form @Akshat_Gupta3 like requested you did receive feedback from 3 fellows.

We would really appreciate that you follow up on your issue.

Neeraj_Sathe · 2024-12-04T07:25:04.582Z

Thank you @John_Sewell @steevej, I was able to work out a working solution based on your responses and @steevej’s recommendations.

muhammad_ali10 · 2024-12-04T09:14:50.814Z

It looks like a nice idea. I have not personally used mongodb for this purpose yet. But some of my friends are suggesting it to use it for the multi-lingual and multi-regional projects. Like I have a project where my client is offering certificate translation services, so I want to update them according to the users locations and their previous record. If you know about such features, then leave a reply here.

Mehul_Sanghvi · 2025-02-06T05:05:03.505Z

@steevej Does this mean we should always perform calculations at the application layer instead of within the query?

steevej · 2025-02-06T16:40:47.180Z

Mehul_Sanghvi:

we should always perform calculations at the application layer

I did not wrote that. I wrote that I prefer:

steevej:

I left out the counting of unique_items because I prefer to get the list of unique items rather than just the count.

and I continued with

steevej:

the calculation could be done on the server to reduce bandwidth utilization but increase the CPU utilization

Mehul_Sanghvi · 2025-02-10T08:24:40.955Z

Noted. Thanks for the clarification.