Recovering a corrupted MongoDump archive

For the past month or so, I have been working to recover a corrupted Mongodump archive that was provided to us when we switched cloud providers. Unfortunately, the original database no longer exists, so recovering this archive is the only way to get the data back. I originally posted in Mongorestore fails with error: `Failed: corruption found in archive; 858927154 is neither a valid bson length nor a archive terminator` - #18 by Southpaw_1496 , but the topic was closed due to inactivity. This is a summary of the progress that was made:

We first found that the error was caused by log messages being streamed to stdin along with the dump data and therefore being recorded as part of the dump file. Removing those messages didn’t initially work, but it did change the error, and Chris (who was helping in the thread) found that there was corruption elsewhere in the file, not just the log messages. He created a shell script to try and eliminate these errors, as well as a python script that was designed to find corrupted parts of a dump file.

Since the topic was closed due to inactivity, I tried running the shell script. Restoring once again fails with the “Corruption found in archive” error, but with a slightly different value. The 32-bit integer it complained about (962801664) is equivalent to the hex 00 30 63 39 which encodes to 0c9, part of an ID in the database near the end of the file. The surrounding data is

5F 69 64 00 0D 00 00 00 30 63 39 36 63 30 61 35 37 39 66 36 00 02 6B 65 79

which encodes to

_id
0c96c0a579f6key

Unlike the first error I received where the error is clearly the result of log messages being added to the dump file, the ID and the bytes surrounding it seem to be valid data, so it’s not obvious where the corruption is.

I also tried the python script, but I’m unsure how to interpret its output:

0x10db26c invalid length or type code b'R<\x00\x00\x02_id\x00\r\x00\x00\x00036'

The data at offset 0x10db26c is 52 3C 00 00 02, which encodes to R<. The surrounding data is:

74 69 74 6C 65 00 00 52 3C 00 00 02 5F 69 64

Which encodes to:

titleR<_id

From my limited understanding, the data appears to be valid: There’s 7 bytes between title and _id, just like in all the other (presumably valid) occurrences of the pattern in the file. Possibly the python script is telling me where exactly the corruption is, but I can’t understand what it’s trying to say.

Does anyone have ideas for other things I could try?

All the script is doing is attempting to decode a document. It prints out the location that it errored and the error along with 16B of the document.

As I mentioned in that topic the corruption could be in the preceding document ( I think this would end up changing this document’s length) or in the document starting at this location(I think that is the case for this document).

The first 16B of this document is:
b'R<\x00\x00\x02_id\x00\r\x00\x00\x00036'
or in hex:
52 3c 00 00 02 5f 69 64 00 0d 00 00 00 30 33 36

52 3c 00 00 is the int32 length of the document: 15442Bytes
02 is specifying type string for the next document.
5f 69 64 00 is the cstring for the e_name(field): _id terminated with 00
Being a string type the next 4B are the int32 length of the string
0d 00 00 00 13Bytes
30 33 36 is the start of the 13B string 036

You’ll have to get familiar with the bson spec and the archive spec. I knew very little about them before I looked into the original post.

I guess I’ll start there then.