Mongorestore fails with error: `Failed: corruption found in archive; 858927154 is neither a valid bson length nor a archive terminator`

I have a MongoDB dump (created with --archive) that I’m trying to restore to our database using mongorestore. However, when I try to do this, it fails with the following error:

Failed: corruption found in archive; 858927154 is neither a valid bson length nor a archive terminator

We received the dump from our former cloud provider, so I don’t know much about how it was created. I do know that it was created with Mongodump 100.6.0 and MongoDB 6.0.2, and I have tried restoring with that exact version, but it still fails.

mongorestore -h localhost --archive=Dump.mongodump -vvvv

This outputted the following data:

2023-09-02T18:39:19.230+0000	archive prelude quiltmc_modmail.logs
2023-09-02T18:39:19.230+0000	archive prelude quiltmc_modmail.config
2023-09-02T18:39:19.230+0000	archive format version "0.1"
2023-09-02T18:39:19.230+0000	archive server version "6.0.2"
2023-09-02T18:39:19.230+0000	archive tool version "100.6.0"
2023-09-02T18:39:19.232+0000	preparing collections to restore from
2023-09-02T18:39:19.232+0000	using  as dump root directory
2023-09-02T18:39:19.232+0000	reading collections for database quiltmc_modmail in quiltmc_modmail
2023-09-02T18:39:19.232+0000	found collection quiltmc_modmail.logs bson to restore to quiltmc_modmail.logs
2023-09-02T18:39:19.232+0000	found collection metadata from quiltmc_modmail.logs to restore to quiltmc_modmail.logs
2023-09-02T18:39:19.232+0000	adding intent for quiltmc_modmail.logs
2023-09-02T18:39:19.232+0000	found collection quiltmc_modmail.config bson to restore to quiltmc_modmail.config
2023-09-02T18:39:19.232+0000	found collection metadata from quiltmc_modmail.config to restore to quiltmc_modmail.config
2023-09-02T18:39:19.232+0000	adding intent for quiltmc_modmail.config
2023-09-02T18:39:19.243+0000	demux End
2023-09-02T18:39:19.243+0000	demux finishing (err:corruption found in archive; 858927154 is neither a valid bson length nor a archive terminator)
2023-09-02T18:39:19.243+0000	received  from namespaceChan
2023-09-02T18:39:19.243+0000	reading metadata for quiltmc_modmail.logs from archive 'Dump.mongodump'
2023-09-02T18:39:19.244+0000	reading metadata for quiltmc_modmail.config from archive 'Dump.mongodump'
2023-09-02T18:39:19.244+0000	restoring up to 4 collections in parallel
2023-09-02T18:39:19.244+0000	starting restore routine with id=3
2023-09-02T18:39:19.244+0000	ending restore routine with id=3, no more work to do
2023-09-02T18:39:19.244+0000	starting restore routine with id=1
2023-09-02T18:39:19.244+0000	ending restore routine with id=1, no more work to do
2023-09-02T18:39:19.244+0000	starting restore routine with id=0
2023-09-02T18:39:19.244+0000	ending restore routine with id=0, no more work to do
2023-09-02T18:39:19.244+0000	starting restore routine with id=2
2023-09-02T18:39:19.244+0000	ending restore routine with id=2, no more work to do
2023-09-02T18:39:19.244+0000	building indexes up to 4 collections in parallel
2023-09-02T18:39:19.244+0000	starting index build routine with id=3
2023-09-02T18:39:19.244+0000	restoring indexes for collection quiltmc_modmail.logs from metadata
2023-09-02T18:39:19.244+0000	index: &idx.IndexDocument{Options:primitive.M{"default_language":"english", "language_override":"language", "name":"messages.content_text_messages.author.name_text_key_text", "textIndexVersion":3, "v":2, "weights":primitive.M{"key":1, "messages.author.name":1, "messages.content":1}}, Key:primitive.D{primitive.E{Key:"_fts", Value:"text"}, primitive.E{Key:"_ftsx", Value:1}}, PartialFilterExpression:primitive.D(nil)}
2023-09-02T18:39:19.244+0000		run create Index command for indexes: messages.content_text_messages.author.name_text_key_text
2023-09-02T18:39:19.245+0000	starting index build routine with id=0
2023-09-02T18:39:19.245+0000	no indexes to restore for collection quiltmc_modmail.config
2023-09-02T18:39:19.245+0000	ending index build routine with id=0, no more work to do
2023-09-02T18:39:19.245+0000	starting index build routine with id=1
2023-09-02T18:39:19.245+0000	ending index build routine with id=1, no more work to do
2023-09-02T18:39:19.245+0000	starting index build routine with id=2
2023-09-02T18:39:19.245+0000	ending index build routine with id=2, no more work to do
2023-09-02T18:39:19.248+0000	ending index build routine with id=3, no more work to do
2023-09-02T18:39:19.248+0000	Failed: corruption found in archive; 858927154 is neither a valid bson length nor a archive terminator
2023-09-02T18:39:19.248+0000	0 document(s) restored successfully. 0 document(s) failed to restore.

I have examined the dump in a hex editor, and I have found that the 32-bit integer 858927154 is equivalent to the hex 0x33323032, which, when encoded, translates to 2023. The database contains quite a few dates, so it’s hard to know exactly where the problematic data is in the file. The last date in the file occurs here:

32303233 2D30372D 33315431 353A3035 3A30312E 3539322B 30303030 09646F6E 65206475 6D70696E 67207175 696C746D 635F6D6F 646D6169 6C2E6C6F 67732028 37313120 646F6375 6D656E74 73290D0A 45000000 02646200 10000000 7175696C 746D635F 6D6F646D 61696C00 02636F6C 6C656374 696F6E00 05000000 6C6F6773 0008454F 46000112 43524300 AD1DD5DB A30C5FCD 00FFFFFF FF

Which encodes to:

2023-07-31T15:05:01.592+0000	done dumping quiltmc_modmail.logs (711 documents)
Edbquiltmc_modmailcollectionlogsEOFCRC­ÕÛ£_Íÿÿÿÿ

I’m not sure if this is the cause, though, because the file has a similar “done dumping” message for each time the process finishes dumping a table.

I did create a dump of another database for testing purposes, which I’ve verified that I can restore. It was created in a newer version of MongoDB (6.0.8 100.6.0), but I noticed that it lacked the “done dumping” messages. I also noticed that the test dump ended with:

454F4600 01124352 43000000 00000000 000000FF FFFFFF

which showed in the hex editor as EOF CRC ...., meanwhile, the corrupted database shows as

454F4600 01124352 4300AD1D D5DBA30C 5FCD00FF FFFFFF

which showed in the hex editor as EOF CRC . ... _. .....

I did try and replace the corrupted file’s file end with the working file’s, but this didn’t seem to work.

I found absolutely nothing about this error online, so I wondered if anyone had any ideas about what I might try next.

Hi @Southpaw_1496

Amazingly thorough investigation on your part.

If this is encoded in the archive then the vendor has created the dump incorrectly. I’d ask them for a new one.

What I think has happened is they managed to redirect stderr to stdout and pipe the whole thing to the output file.

Possibly ran with stderr redirection over ssh? I’m just guessing.


I could reproduce the situation like so:

(mongodump --archive 2>&1) >dump.archive

mongorestore --drop --archive=dump.archive
2023-09-02T18:47:07.203-0400	preparing collections to restore from
2023-09-02T18:47:07.354-0400	demux finishing when there are still outs (1)
2023-09-02T18:47:07.354-0400	reading metadata for test.foo from archive 'dump.archive'
2023-09-02T18:47:07.354-0400	no indexes to restore for collection test.foo
2023-09-02T18:47:07.354-0400	Failed: corruption found in archive; 858927154 is neither a valid bson length nor a archive terminator
2023-09-02T18:47:07.354-0400	0 document(s) restored successfully. 0 document(s) failed to restore.

It’s almost certain that they will have deleted the database by now since we are no longer using their services, but thank you for confirming where the corruption lies.

Do you think it’s feasible to edit these lines out and make a valid file?

I tried one yesterday with success using vim

bbe might be a good tool for this, I was not able to come up with a valid pattern so far though.

Each valid data end with FFFFFFFF before the invalid data. Removing everything 2023-.*\n and the final \n at the end of file should see you right.

Thanks so much!

The cloud provider has offered to repair the dump for us even though they no longer have the original; I will send your advice to help them along :slight_smile:

1 Like

I removed the first occurrence as you recommended:

32 30 32 33 2D 30 37 2D 33 31 54 31 35 3A 30 34 3A 35 39 2E 39 38 39 2B 30 30 30 30 09 77 72 69 74 69 6E 67 20 71 75 69 6C 74 6D 63 5F 6D 6F 64 6D 61 69 6C 2E 63 6F 6E 66 69 67 20 74 6F 20 61 72 63 68 69 76 65 20 6F 6E 20 73 74 64 6F 75 74 0D

and attempted to import it as a test. I figured that it would at least get the first document imported if it was doing it correctly. However, it failed with a different error:

Failed: corruption found in archive; bson (size: 18186, byte: 54) doesn't end with a null byte

Could this be to do with the six bytes preceding the bytes I removed?

00 00 FF FF FF FF

For sure only be removing:
\x32\x30\x32\x33.......\x0a

I’m still working on this myself on and off for a repeatable solution.

Some refs that may help are the archive format spec and the BSON spec.

Do I need to remove the line breaks as well? The regex you provided didn’t match them when I put it into a regex tester, so I didn’t remove them.

Yes. \x0a or \n should match that.

After removing the patterns you suggested in the file, the operation now fails with the following error:

2023-09-03T18:57:28.787+0100	Failed: quiltmc_modmail.config: error restoring from archive '/Users/southpaw/Docker/MongoDB/Working-dump.mongodump': reading bson input: error demultiplexing archive; archive io error

I will consult the specs you sent to see if anything sticks out and update with any progress.

I don’t suppose there’s any way to know in what part of the file the error occurred?

I have learned that the corrupted file was created using kubectl exec, and so I created a valid backup file of a test database, and also one using kubectl exec to try and compare the two. However, the output of the dump command seems to change even when given the exact same data, making comparing difficult.

As a test, I took two dumps of the same database. The first dump began as follows:

6D E2 99 81 66 00 00 00 10 63 6F 6E 63 75 72 72 65 6E 74 5F 63 6F 6C 6C 65 63 74 69 6F 6E 73 00 04 00 00 00 02 76 65 72 73 69 6F 6E 00 04 00 00 00 30 2E 31 00 02 73 65 72 76 65 72 5F 76 65 72 73 69 6F 6E 00 06 00 00 00 36 2E 30 2E 38 00 02 74 6F 6F 6C 5F 76 65 72 73 69 6F 6E 00 08 00 00 00 31 30 30 2E 36 2E 30 00 00 D3 02 00 00

which encodes to

m♁fconcurrent_collectionsversion0.1server_version6.0.8tool_version100.6.0Ó

However, the dump I took immediately after starts like this:

6D E2 99 81 66 00 00 00 10 63 6F 6E 63 75 72 72 65 6E 74 5F 63 6F 6C 6C 65 63 74 69 6F 6E 73 00 04 00 00 00 02 76 65 72 73 69 6F 6E 00 04 00 00 00 30 2E 31 00 02 73 65 72 76 65 72 5F 76 65 72 73 69 6F 6E 00 06 00 00 00 36 2E 30 2E 38 00 02 74 6F 6F 6C 5F 76 65 72 73 69 6F 6E 00 08 00 00 00 31 30 30 2E 36 2E 30 00 00 0E 01 00 00 02 64 62 00 05 00 00 00 74 65 73 74 00 02 63 6F 6C 6C 65 63 74 69 6F 6E 00 0C 00 00 00 62 61 63 6B 75 70 2D 64 75 6D 70 00 02 6D 65 74 61 64 61 74 61 00 B3 00 00 00 7B 22 69 6E 64 65 78 65 73 22 3A 5B 7B 22 76 22 3A 7B 22 24 6E 75 6D 62 65 72 49 6E 74 22 3A 22 32 22 7D 2C 22 6B 65 79 22 3A 7B 22 5F 69 64 22 3A 7B 22 24 6E 75 6D 62 65 72 49 6E 74 22 3A 22 31 22 7D 7D 2C 22 6E 61 6D 65 22 3A 22 5F 69 64 5F 22 7D 5D 2C 22 75 75 69 64 22 3A 22 65 38 37 32 30 34 37 66 64 37 34 39 34 61 39 37 61 63 30 36 61 30 62 39 65 38 35 35 65 35 35 36 22 2C 22 63 6F 6C 6C 65 63 74 69 6F 6E 4E 61 6D 65 22 3A 22 62 61 63 6B 75 70 2D 64 75 6D 70 22 2C 22 74 79 70 65 22 3A 22 63 6F 6C 6C 65 63 74 69 6F 6E 22 7D 00 10 73 69 7A 65 00 00 00 00 00 02 74 79 70 65 00 0B 00 00 00 63 6F 6C 6C 65 63 74 69 6F 6E 00 00

which encodes to

m♁fconcurrent_collectionsversion0.1server_version6.0.8tool_version100.6.0dbtestcollectionbackup-dumpmetadata³{"indexes":[{"v":{"$numberInt":"2"},"key":{"_id":{"$numberInt":"1"}},"name":"_id_"}],"uuid":"e872047fd7494a97ac06a0b9e855e556","collectionName":"backup-dump","type":"collection"}sizetypecollection

To more plainly illustrate, here is a screenshot of the diff from my hex editor

I will need to do more investigation as to why this might be happening.

2 Likes

Cannot reproduce. What are the commands you are running for the dump?

Yes. I think they must have run with the tty flag and redirected to the output file. Running without the tty flag will have resulted in a valid archive.

kubectl exec --tty podname mongodump --archive > yourdump.archive

vs

kubectl exec podname mongodump --archive > yourdump.archive

mongodump -u root -p Fw0NRvwNvgFSyitQYqQS --archive=Test-Dump-1.mongodump

followed immediately by

mongodump -u root -p Fw0NRvwNvgFSyitQYqQS --archive=Test-Dump-2.mongodump

I tested in a database inside Docker on my local machine that wasn’t connected to anything. It couldn’t have been modified in between the two dumps.

Edit: I have rolled the root password of my database, and it was isolated and couldn’t have been connected to externally anyway.

The difference is likely the order the collections are dumped.

Collections can be dumped in parallel and their contents can also switch between them during the dump.

Don’t expect different dumps of the same data to be the same.

#sha256 sum of 10 dumps of the same database.
sha256sum dump? | sort
4329c4c7eea4114be68e8d70b5bedc1481a24d46ded445ef62cbf42d0ae1964c  dump3
5bb5299bfce2aeee2ecb144120ff3d3b8ef4cb85bc8ab4a247aaaebaa137b700  dump0
5bb5299bfce2aeee2ecb144120ff3d3b8ef4cb85bc8ab4a247aaaebaa137b700  dump9
98c831a6564f11f8d5eeed9544fb1a612135ea217f740920335d35e4f7e5e8ee  dump2
98c831a6564f11f8d5eeed9544fb1a612135ea217f740920335d35e4f7e5e8ee  dump4
98c831a6564f11f8d5eeed9544fb1a612135ea217f740920335d35e4f7e5e8ee  dump5
bd6682c623606a38f0bb9f735638034bfaba147513e7e40f180e908e1ffe9c63  dump1
ec70a7541713c374a24f00d06b0c9fcc5c31442b34c12006412db2176b205a1e  dump8
eecaac98a78941f95aa215187515f493c7e44f0be56c715b36a0db01b2ac5dbe  dump6
eecaac98a78941f95aa215187515f493c7e44f0be56c715b36a0db01b2ac5dbe  dump7

I’m still struggling with the demultiplexing archive error. I took a -vvvv log to make sure I wasn’t missing anything.

2023-09-14T07:51:47.243+0000	using write concern: &{majority false 0}
2023-09-14T07:51:47.265+0000	checking options
2023-09-14T07:51:47.265+0000		dumping with object check disabled
2023-09-14T07:51:47.265+0000	will listen for SIGTERM, SIGINT, and SIGKILL
2023-09-14T07:51:47.266+0000	connected to node type: standalone
2023-09-14T07:51:47.282+0000	archive prelude quiltmc_modmail.logs
2023-09-14T07:51:47.282+0000	archive prelude quiltmc_modmail.config
2023-09-14T07:51:47.282+0000	archive format version "0.1"
2023-09-14T07:51:47.282+0000	archive server version "6.0.2"
2023-09-14T07:51:47.282+0000	archive tool version "100.6.0"
2023-09-14T07:51:47.283+0000	preparing collections to restore from
2023-09-14T07:51:47.283+0000	using  as dump root directory
2023-09-14T07:51:47.283+0000	reading collections for database quiltmc_modmail in quiltmc_modmail
2023-09-14T07:51:47.283+0000	found collection quiltmc_modmail.logs bson to restore to quiltmc_modmail.logs
2023-09-14T07:51:47.283+0000	found collection metadata from quiltmc_modmail.logs to restore to quiltmc_modmail.logs
2023-09-14T07:51:47.283+0000	adding intent for quiltmc_modmail.logs
2023-09-14T07:51:47.283+0000	found collection quiltmc_modmail.config bson to restore to quiltmc_modmail.config
2023-09-14T07:51:47.283+0000	found collection metadata from quiltmc_modmail.config to restore to quiltmc_modmail.config
2023-09-14T07:51:47.283+0000	adding intent for quiltmc_modmail.config
2023-09-14T07:51:47.294+0000	demux namespaceHeader: {quiltmc_modmail config false 0}
2023-09-14T07:51:47.294+0000	received quiltmc_modmail.config from namespaceChan
2023-09-14T07:51:47.294+0000	first non special collection quiltmc_modmail.config found. The demultiplexer will handle it and the remainder
2023-09-14T07:51:47.294+0000	reading metadata for quiltmc_modmail.logs from archive 'backup-dump.mongodump'
2023-09-14T07:51:47.295+0000	reading metadata for quiltmc_modmail.config from archive 'backup-dump.mongodump'
2023-09-14T07:51:47.295+0000	restoring up to 4 collections in parallel
2023-09-14T07:51:47.295+0000	starting restore routine with id=3
2023-09-14T07:51:47.295+0000	starting restore routine with id=1
2023-09-14T07:51:47.295+0000	starting restore routine with id=2
2023-09-14T07:51:47.295+0000	starting restore routine with id=0
2023-09-14T07:51:47.295+0000	demux Open for quiltmc_modmail.config
2023-09-14T07:51:47.297+0000	demux End
2023-09-14T07:51:47.299+0000	restoring to existing collection quiltmc_modmail.config without dropping
2023-09-14T07:51:47.299+0000	collection quiltmc_modmail.config already exists - skipping collection create
2023-09-14T07:51:47.299+0000	restoring quiltmc_modmail.config from archive 'backup-dump.mongodump'
2023-09-14T07:51:47.299+0000	using 1 insertion workers
2023-09-14T07:51:47.311+0000	finished restoring quiltmc_modmail.config (0 documents, 0 failures)
2023-09-14T07:51:47.311+0000	Failed: quiltmc_modmail.config: error restoring from archive 'backup-dump.mongodump': reading bson input: error demultiplexing archive; archive io error
2023-09-14T07:51:47.311+0000	0 document(s) restored successfully. 0 document(s) failed to restore.

The only things I could find online about the error seemed to deal with networked filesystems that were too slow for the restore, but I am running a Docker container locally, so that can’t be the issue. I did find TOOLS-2458 on the MongoDB bug tracker, but that also had (err:corruption found in archive; ParserConsumer.BodyBSON() ( EOF )), which my log does not have.

I will continue my investigations.

I originally assumed that the maximum logging verbosity was 4, but now I’ve found it goes up to five. Here is the new log:

2023-09-14T09:34:12.221+0000	using write concern: &{majority false 0}
2023-09-14T09:34:12.265+0000	checking options
2023-09-14T09:34:12.265+0000		dumping with object check disabled
2023-09-14T09:34:12.265+0000	will listen for SIGTERM, SIGINT, and SIGKILL
2023-09-14T09:34:12.266+0000	connected to node type: standalone
2023-09-14T09:34:12.281+0000	archive prelude quiltmc_modmail.logs
2023-09-14T09:34:12.282+0000	archive prelude quiltmc_modmail.config
2023-09-14T09:34:12.282+0000	archive format version "0.1"
2023-09-14T09:34:12.282+0000	archive server version "6.0.2"
2023-09-14T09:34:12.282+0000	archive tool version "100.6.0"
2023-09-14T09:34:12.289+0000	preparing collections to restore from
2023-09-14T09:34:12.290+0000	using  as dump root directory
2023-09-14T09:34:12.290+0000	reading collections for database quiltmc_modmail in quiltmc_modmail
2023-09-14T09:34:12.290+0000	found collection quiltmc_modmail.logs bson to restore to quiltmc_modmail.logs
2023-09-14T09:34:12.290+0000	found collection metadata from quiltmc_modmail.logs to restore to quiltmc_modmail.logs
2023-09-14T09:34:12.290+0000	adding intent for quiltmc_modmail.logs
2023-09-14T09:34:12.290+0000	found collection quiltmc_modmail.config bson to restore to quiltmc_modmail.config
2023-09-14T09:34:12.290+0000	found collection metadata from quiltmc_modmail.config to restore to quiltmc_modmail.config
2023-09-14T09:34:12.290+0000	adding intent for quiltmc_modmail.config
2023-09-14T09:34:12.311+0000	demux namespaceHeader: {quiltmc_modmail config false 0}
2023-09-14T09:34:12.311+0000	received quiltmc_modmail.config from namespaceChan
2023-09-14T09:34:12.311+0000	first non special collection quiltmc_modmail.config found. The demultiplexer will handle it and the remainder
2023-09-14T09:34:12.311+0000	reading metadata for quiltmc_modmail.logs from archive 'backup-dump.mongodump'
2023-09-14T09:34:12.312+0000	reading metadata for quiltmc_modmail.config from archive 'backup-dump.mongodump'
2023-09-14T09:34:12.312+0000	restoring up to 4 collections in parallel
2023-09-14T09:34:12.312+0000	starting restore routine with id=3
2023-09-14T09:34:12.312+0000	starting restore routine with id=1
2023-09-14T09:34:12.312+0000	starting restore routine with id=0
2023-09-14T09:34:12.312+0000	starting restore routine with id=2
2023-09-14T09:34:12.312+0000	demux Open for quiltmc_modmail.config
2023-09-14T09:34:12.314+0000	demux End
2023-09-14T09:34:12.317+0000	restoring to existing collection quiltmc_modmail.config without dropping
2023-09-14T09:34:12.317+0000	collection quiltmc_modmail.config already exists - skipping collection create
2023-09-14T09:34:12.317+0000	restoring quiltmc_modmail.config from archive 'backup-dump.mongodump'
2023-09-14T09:34:12.318+0000	using 1 insertion workers
2023-09-14T09:34:12.328+0000	finished restoring quiltmc_modmail.config (0 documents, 0 failures)
2023-09-14T09:34:12.328+0000	demux finishing when there are still outs (1)
2023-09-14T09:34:12.328+0000	demux finishing (err:corruption found in archive; ParserConsumer.BodyBSON() ( corruption found in archive; bson (size: 2458, byte: 103) doesn't end with a null byte ))
2023-09-14T09:34:12.328+0000	ending restore routine with id=0, no more work to do
2023-09-14T09:34:12.328+0000	Failed: quiltmc_modmail.config: error restoring from archive 'backup-dump.mongodump': reading bson input: error demultiplexing archive; archive io error
2023-09-14T09:34:12.328+0000	0 document(s) restored successfully. 0 document(s) failed to restore.
root@cec8a2653bef:/# mongorestore -h localhost -u root -p aI73yBJzrItJcZxq2Nr0 --archive="backup-dump.mongodump" -vvvvvvv
2023-09-14T09:35:40.658+0000	using write concern: &{majority false 0}
2023-09-14T09:35:40.674+0000	checking options
2023-09-14T09:35:40.674+0000		dumping with object check disabled
2023-09-14T09:35:40.674+0000	will listen for SIGTERM, SIGINT, and SIGKILL
2023-09-14T09:35:40.675+0000	connected to node type: standalone
2023-09-14T09:35:40.685+0000	archive prelude quiltmc_modmail.logs
2023-09-14T09:35:40.685+0000	archive prelude quiltmc_modmail.config
2023-09-14T09:35:40.687+0000	archive format version "0.1"
2023-09-14T09:35:40.687+0000	archive server version "6.0.2"
2023-09-14T09:35:40.687+0000	archive tool version "100.6.0"
2023-09-14T09:35:40.689+0000	preparing collections to restore from
2023-09-14T09:35:40.689+0000	using  as dump root directory
2023-09-14T09:35:40.689+0000	reading collections for database quiltmc_modmail in quiltmc_modmail
2023-09-14T09:35:40.690+0000	found collection quiltmc_modmail.logs bson to restore to quiltmc_modmail.logs
2023-09-14T09:35:40.690+0000	found collection metadata from quiltmc_modmail.logs to restore to quiltmc_modmail.logs
2023-09-14T09:35:40.690+0000	adding intent for quiltmc_modmail.logs
2023-09-14T09:35:40.690+0000	found collection quiltmc_modmail.config bson to restore to quiltmc_modmail.config
2023-09-14T09:35:40.690+0000	found collection metadata from quiltmc_modmail.config to restore to quiltmc_modmail.config
2023-09-14T09:35:40.690+0000	adding intent for quiltmc_modmail.config
2023-09-14T09:35:40.699+0000	demux namespaceHeader: {quiltmc_modmail config false 0}
2023-09-14T09:35:40.699+0000	received quiltmc_modmail.config from namespaceChan
2023-09-14T09:35:40.699+0000	first non special collection quiltmc_modmail.config found. The demultiplexer will handle it and the remainder
2023-09-14T09:35:40.699+0000	reading metadata for quiltmc_modmail.config from archive 'backup-dump.mongodump'
2023-09-14T09:35:40.699+0000	reading metadata for quiltmc_modmail.logs from archive 'backup-dump.mongodump'
2023-09-14T09:35:40.699+0000	restoring up to 4 collections in parallel
2023-09-14T09:35:40.699+0000	starting restore routine with id=3
2023-09-14T09:35:40.699+0000	starting restore routine with id=1
2023-09-14T09:35:40.699+0000	demux Open for quiltmc_modmail.config
2023-09-14T09:35:40.700+0000	starting restore routine with id=2
2023-09-14T09:35:40.700+0000	starting restore routine with id=0
2023-09-14T09:35:40.701+0000	demux End
2023-09-14T09:35:40.702+0000	restoring to existing collection quiltmc_modmail.config without dropping
2023-09-14T09:35:40.703+0000	collection quiltmc_modmail.config already exists - skipping collection create
2023-09-14T09:35:40.703+0000	restoring quiltmc_modmail.config from archive 'backup-dump.mongodump'
2023-09-14T09:35:40.703+0000	using 1 insertion workers
2023-09-14T09:35:40.715+0000	finished restoring quiltmc_modmail.config (0 documents, 0 failures)
2023-09-14T09:35:40.715+0000	demux finishing when there are still outs (1)
2023-09-14T09:35:40.715+0000	ending restore routine with id=0, no more work to do
2023-09-14T09:35:40.715+0000	demux finishing (err:corruption found in archive; ParserConsumer.BodyBSON() ( corruption found in archive; bson (size: 2458, byte: 103) doesn't end with a null byte ))
2023-09-14T09:35:40.715+0000	ending restore routine with id=2, no more work to do
2023-09-14T09:35:40.715+0000	Failed: quiltmc_modmail.config: error restoring from archive 'backup-dump.mongodump': reading bson input: error demultiplexing archive; archive io error
2023-09-14T09:35:40.715+0000	0 document(s) restored successfully. 0 document(s) failed to restore.

I now see in my logs the same corruption found in archive; error present on the MongoDB issue. I think bson (size: 2458, byte: 103) is telling me where the error is occurring, which is the other thing I couldn’t figure out, however, I don’t know how to interpret it.

I was working with some larger mongodumps and it does appear that the corruption is more that just a log line.

In the test file I created the was a \x0d inserted in the midst of a document. Fixing a dump corrupted in this fashion I think will take a long time and a log of effort.

The dump I have is around 18.5MB, I’m not sure whether that’s considered to be a large dump or not.

I can see a few x0ds in my dump, the hard part is knowing whether they’re supposed to be there or not.

I would be prepared to go through the bad ones one by one if I knew where they were. Is there any way to tell when it fails at what byte the error occurs?

Yes exactly, it could be valid data or an anomaly. I was writing something in python to try and clean it up when I came across it.

After reading the magic bytes the rest of a proper archive is documents with separations of terminator bytes \xffffffff.

The python below can help locate corruption, it could be the document before the byte location reported though.

Print Location of Bad Documents
# pip install pymongo  to get the mongo implementation of bson
import bson
import struct

MAGIC=b'\x6d\xe2\x99\x81'

with open('dump.archive','br') as dump:
    # read magic bytes
    magic = dump.read(4)
    if not magic == MAGIC:
        raise SystemExit("Not a mongodump archive")
    while True:
        pos = dump.tell()
        # read document length
        docLen, = struct.unpack_from('<i', dump.read(4))
        if docLen == -1:
            continue
        dump.seek(pos)
        docStart = dump.peek()[:16]
        # read the document
        try:
            bson.decode(dump.read(docLen))
        except bson.errors.InvalidBSON as e:
            print(f'{pos:#08x} {e} {docStart[:16]}')
            raise SystemExit(1)


For the dump file I have create I have determined that any \x0a byte has had a \x0d prefixed to it in addition to the loglines from the dump.

The below steps actually cleaned it up really good and I could mongorestore it.

# remove added \x0d before \x0a and remove loglines starting with timestamp 2023-
bbe -e 's/\r\n/\n/' dump.bad | sed -z 's/2023-.\+\n//g' >dump.archive
#remove added `\x0a` at EOF
truncate -s -1 dump.archive

I have tried the script you recommended and the error is now different. This was the archive that I already tried modifying and not the original, so I will try with the original and report back.

2023-09-16T18:21:45.563+0000	preparing collections to restore from
2023-09-16T18:21:45.576+0000	reading metadata for quiltmc_modmail.config from archive './dump.archive'
2023-09-16T18:21:45.576+0000	reading metadata for quiltmc_modmail.logs from archive './dump.archive'
2023-09-16T18:21:45.579+0000	restoring to existing collection quiltmc_modmail.config without dropping
2023-09-16T18:21:45.579+0000	restoring quiltmc_modmail.config from archive './dump.archive'
2023-09-16T18:21:45.589+0000	finished restoring quiltmc_modmail.config (2 documents, 0 failures)
2023-09-16T18:21:45.590+0000	restoring to existing collection quiltmc_modmail.logs without dropping
2023-09-16T18:21:45.590+0000	restoring quiltmc_modmail.logs from archive './dump.archive'
2023-09-16T18:21:45.632+0000	finished restoring quiltmc_modmail.logs (0 documents, 0 failures)
2023-09-16T18:21:45.632+0000	Failed: quiltmc_modmail.logs: error restoring from archive './dump.archive': cannot transform type bson.Raw to a BSON Document: not enough bytes available to read type. bytes=3 type=string
2023-09-16T18:21:45.632+0000	0 document(s) restored successfully. 0 document(s) failed to restore.
root@62c1a0b36a15:/Dumps# mongorestore -h localhost -u root -p aI73yBJzrItJcZxq2Nr0 --archive="./dump.archive" -vvvvv
2023-09-16T18:22:04.482+0000	using write concern: &{majority false 0}
2023-09-16T18:22:04.495+0000	checking options
2023-09-16T18:22:04.495+0000		dumping with object check disabled
2023-09-16T18:22:04.495+0000	will listen for SIGTERM, SIGINT, and SIGKILL
2023-09-16T18:22:04.496+0000	connected to node type: standalone
2023-09-16T18:22:04.512+0000	archive prelude quiltmc_modmail.logs
2023-09-16T18:22:04.512+0000	archive prelude quiltmc_modmail.config
2023-09-16T18:22:04.512+0000	archive format version "0.1"
2023-09-16T18:22:04.513+0000	archive server version "6.0.2"
2023-09-16T18:22:04.513+0000	archive tool version "100.6.0"
2023-09-16T18:22:04.517+0000	preparing collections to restore from
2023-09-16T18:22:04.517+0000	using  as dump root directory
2023-09-16T18:22:04.517+0000	reading collections for database quiltmc_modmail in quiltmc_modmail
2023-09-16T18:22:04.517+0000	found collection quiltmc_modmail.logs bson to restore to quiltmc_modmail.logs
2023-09-16T18:22:04.518+0000	found collection metadata from quiltmc_modmail.logs to restore to quiltmc_modmail.logs
2023-09-16T18:22:04.519+0000	adding intent for quiltmc_modmail.logs
2023-09-16T18:22:04.519+0000	found collection quiltmc_modmail.config bson to restore to quiltmc_modmail.config
2023-09-16T18:22:04.519+0000	found collection metadata from quiltmc_modmail.config to restore to quiltmc_modmail.config
2023-09-16T18:22:04.519+0000	adding intent for quiltmc_modmail.config
2023-09-16T18:22:04.524+0000	demux namespaceHeader: {quiltmc_modmail config false 0}
2023-09-16T18:22:04.524+0000	received quiltmc_modmail.config from namespaceChan
2023-09-16T18:22:04.524+0000	first non special collection quiltmc_modmail.config found. The demultiplexer will handle it and the remainder
2023-09-16T18:22:04.524+0000	reading metadata for quiltmc_modmail.logs from archive './dump.archive'
2023-09-16T18:22:04.524+0000	reading metadata for quiltmc_modmail.config from archive './dump.archive'
2023-09-16T18:22:04.525+0000	restoring up to 4 collections in parallel
2023-09-16T18:22:04.525+0000	starting restore routine with id=3
2023-09-16T18:22:04.525+0000	starting restore routine with id=1
2023-09-16T18:22:04.525+0000	starting restore routine with id=2
2023-09-16T18:22:04.525+0000	starting restore routine with id=0
2023-09-16T18:22:04.525+0000	demux Open for quiltmc_modmail.config
2023-09-16T18:22:04.528+0000	restoring to existing collection quiltmc_modmail.config without dropping
2023-09-16T18:22:04.528+0000	collection quiltmc_modmail.config already exists - skipping collection create
2023-09-16T18:22:04.528+0000	restoring quiltmc_modmail.config from archive './dump.archive'
2023-09-16T18:22:04.528+0000	using 1 insertion workers
2023-09-16T18:22:04.528+0000	demux namespaceHeader: {quiltmc_modmail config true -1616536111820191563}
2023-09-16T18:22:04.530+0000	continuing through error: E11000 duplicate key error collection: quiltmc_modmail.config index: _id_ dup key: { _id: ObjectId('608acdbda56c97501eda92ea') }
2023-09-16T18:22:04.530+0000	continuing through error: E11000 duplicate key error collection: quiltmc_modmail.config index: _id_ dup key: { _id: ObjectId('636c3de7b2cd74350b2c0670') }
2023-09-16T18:22:04.539+0000	finished restoring quiltmc_modmail.config (0 documents, 2 failures)
2023-09-16T18:22:04.539+0000	demux checksum for namespace quiltmc_modmail.config is correct (-1616536111820191563), 2496 bytes
2023-09-16T18:22:04.539+0000	demux namespaceHeader: {quiltmc_modmail logs false 0}
2023-09-16T18:22:04.539+0000	demux Open for quiltmc_modmail.logs
2023-09-16T18:22:04.540+0000	restoring to existing collection quiltmc_modmail.logs without dropping
2023-09-16T18:22:04.540+0000	collection quiltmc_modmail.logs already exists - skipping collection create
2023-09-16T18:22:04.540+0000	restoring quiltmc_modmail.logs from archive './dump.archive'
2023-09-16T18:22:04.540+0000	using 1 insertion workers
2023-09-16T18:22:04.574+0000	demux End
2023-09-16T18:22:04.579+0000	finished restoring quiltmc_modmail.logs (0 documents, 0 failures)
2023-09-16T18:22:04.579+0000	Failed: quiltmc_modmail.logs: error restoring from archive './dump.archive': cannot transform type bson.Raw to a BSON Document: not enough bytes available to read type. bytes=3 type=string
2023-09-16T18:22:04.579+0000	0 document(s) restored successfully. 0 document(s) failed to restore.