How to prevent duplicates in mongodb timeseries collection

Jordan_Baczuk · October 13, 2021, 9:56pm

Problem

Sensors check-in periodically, but network connectivity issues may cause them to check-in with the same data more than once.

MongoDB does not allow the unique property on secondary indexes for timeseries collections (MongoDB 5.0). Timeseries Limitations

In addition, calculations need to be done on the data (preferably using aggregations) that involve counting the number of entries, which will be inaccurate if there are duplicates. Not to mention it bloats the database and is just messy.

Question

Is there any way to prevent duplicate entries in a MongoDB Timeseries collection?

Darren_Haba · November 26, 2021, 5:05pm

I’m wondering the same thing, has anyone found a solution?

steevej · November 29, 2021, 2:55pm

Can’t you use an update with upsert:true?

Something along the way:

// Usual inserted document that produces duplicates
insert = { "sensor":1 ,
    "timestamp":2 ,
    "mesure":"pressure" ,
    "value":10 }

// Modified into a query that would define you uniqueness
query = { "sensor" : 1 , "timestamp" : 2 , "mesure" : "pressure" }

// With the value
value = { "value":10 }

// Also replace the following insert that produces a new duplicate
// every time it is called
c.insert( insert ) ;

// into an update with upsert:true that will insert only once
// no matter how often it is called.
c.update( query , value , { upsert:true } )

I am not yet familiar with the new time series collections. I do not know if it is applicable or not. But I feel it could.

Gorkem_Erdogan · December 1, 2021, 7:34pm

I’m having the same issue here. Any solutions other than upsert?

steevej · December 1, 2021, 9:30pm

What is wrong with upsert?

Gorkem_Erdogan · December 2, 2021, 5:37am

Hello man,

Well I dont think time series collections allow us to use update operations. It is append only.

steevej · December 2, 2021, 2:28pm

Insert only would make sense in order to improve performance of the main use‑case for time series.

But it would prevent some others where you would want to update your time series entries with some post inserting data.

This simply confirms that

and that I must start learning more.

sergeimeza · December 2, 2021, 3:25pm

While you cannot prevent entries with duplicated data. You can query/aggregate the data in such a way that replicated data is ignored.

Below is some aggregation pipeline on the some_timeseries Time Series collection

db.some_timeseries.aggregate([
  // ...
  {
    $group: {
      _id: {
        username: '$metadata.username',
        media_id: '$metadata.media_id',
        timestamp: {
          $dateTrunc: {
            date: '$timestamp',
            unit: 'day',
            startOfWeek: 'monday',
            binSize: 1,
          },
        },
      },
      // ...
    },
  },
  ///
])

sergeimeza · December 2, 2021, 3:26pm

On the same $group stage you could pick up the max, min, avg etc for a given entry.

steevej · December 2, 2021, 3:26pm

You are absolutely right. I should have followed the link provided in the first post (https://mongodb.prakticum-team.ru/proxy/docs.mongodb.com/manual/core/timeseries/timeseries-limitations/)
and I would have read:

Time series collections only support insert operations and read queries. Updates and manual delete operations result in an error.

Darren_Haba · December 3, 2021, 3:23am

It appears that the limitations imposed on time-series collections are temporary because in the upcoming version MongoDB 5.1, we can perform some delete and update operations. I’m guessing time-series might support upsert and the unique property on secondary indexes in future releases.

I came up with my own method to ensure data is never retrieved from an API and accidentally inserted twice, which was my original problem.

I’m still learning the aggregation pipeline, but @sergeimeza way of creating a query that ignores duplicates is probably the best solution for now.

Thanks for all the help.

Hector_Valverde · August 12, 2022, 10:12am

Hi there,

I was hoping to see the support if the Unique indexes in time series but apparently it is not there yet: https://mongodb.prakticum-team.ru/docs/manual/core/timeseries/timeseries-limitations/#std-label-timeseries-limitations-secondary-indexes

The $group solution looks like a workaround, it doesn’t really avoid duplicates and adds constraints to the clients who acess the collection.

Anyone has any other suggestion?

Michael_Gillhofer · December 18, 2024, 5:29pm

This seems to be an issue still.

Any Ideas on how to solve this in the year 2025?