For the past month or so, I have been working to recover a corrupted Mongodump archive that was provided to us when we switched cloud providers. Unfortunately, the original database no longer exists, so recovering this archive is the only way to get the data back. I originally posted in Mongorestore fails with error: `Failed: corruption found in archive; 858927154 is neither a valid bson length nor a archive terminator` - #18 by Southpaw_1496 , but the topic was closed due to inactivity. This is a summary of the progress that was made:
We first found that the error was caused by log messages being streamed to stdin along with the dump data and therefore being recorded as part of the dump file. Removing those messages didn’t initially work, but it did change the error, and Chris (who was helping in the thread) found that there was corruption elsewhere in the file, not just the log messages. He created a shell script to try and eliminate these errors, as well as a python script that was designed to find corrupted parts of a dump file.
Since the topic was closed due to inactivity, I tried running the shell script. Restoring once again fails with the “Corruption found in archive” error, but with a slightly different value. The 32-bit integer it complained about (962801664) is equivalent to the hex 00 30 63 39
which encodes to 0c9
, part of an ID in the database near the end of the file. The surrounding data is
5F 69 64 00 0D 00 00 00 30 63 39 36 63 30 61 35 37 39 66 36 00 02 6B 65 79
which encodes to
_id
0c96c0a579f6key
Unlike the first error I received where the error is clearly the result of log messages being added to the dump file, the ID and the bytes surrounding it seem to be valid data, so it’s not obvious where the corruption is.
I also tried the python script, but I’m unsure how to interpret its output:
0x10db26c invalid length or type code b'R<\x00\x00\x02_id\x00\r\x00\x00\x00036'
The data at offset 0x10db26c
is 52 3C 00 00 02
, which encodes to R<
. The surrounding data is:
74 69 74 6C 65 00 00 52 3C 00 00 02 5F 69 64
Which encodes to:
titleR<_id
From my limited understanding, the data appears to be valid: There’s 7 bytes between title
and _id
, just like in all the other (presumably valid) occurrences of the pattern in the file. Possibly the python script is telling me where exactly the corruption is, but I can’t understand what it’s trying to say.
Does anyone have ideas for other things I could try?