Created on 2010-03-24.08:41:40 by loewis, last changed 2010-04-04.18:13:44 by loewis.
| File name |
Uploaded |
Type |
Edit |
Remove |
|
ids
|
loewis,
2010-04-01.23:10:22
|
application/octet-stream |
|
|
| msg1589 (view) |
Author: loewis |
Date: 2010-03-24.08:41:40 |
|
It seems that the postgres database is partially corrupted. For roundup_tracker, doing
select count(*) from _file where id='2'
gives
ERROR: invalid page header in block 0 of relation "_file"
Likewise, pg_dumpall produces the same error, as does "vacuum".
Any help appreciated.
|
| msg1590 (view) |
Author: r.david.murray |
Date: 2010-03-24.22:28:23 |
|
I've done some research on this. This kind of error appears, as far as the postgresql mailing list is concerned, to be associated with hardware or kernel problems. That is, something on the system corrupted the database file. At least, I haven't found any indication that this kind of error message has ever been tracked down to anything else. Perhaps someone with postgresql internals knowledge could make something out of the data in the corrupted file, though. Is it something I could share with others (ie: can the information be considered sensitive in any way? My thought is no)? Assuming someone from the mailing list would be willing to look.
In addition to this, however, we apparently also have missing pg_clog files. These are deleted by vacuum and autovacuum when they are no longer needed, apparently, but currently there is only one, that was created on the 24th...so at some point clog files were deleted that were still needed. A post I found on the mailing list opines that this should only happen if the files are deleted by something outside postgresql.
Do we have a backup of that table somewhere? We may be stuck with restoring it from backup.
We could also try (after making a full backup of the binary db files!) restarting postgres with zero_damaged_pages=1, which will get past the error but result in data loss, which could perhaps be patched to be less data loss working from a backup. Since this is page 0, maybe all the data lost would be in the backup. Conceivably I could try that on a test tracker install first, but I haven't set one up yet.
It may be possible to finesse the missing transaction data too, but again there may be data loss.
I'm concluding all this based on google-foo, so if someone with better postgresql chops can look at this that would of course be better.
And however we recover, I hope we can figure out what caused the corruption to begin with, otherwise it may just reoccur.
|
| msg1591 (view) |
Author: loewis |
Date: 2010-03-24.22:36:15 |
|
> Do we have a backup of that table somewhere? We may be stuck with
> restoring it from backup.
We don't have a postgres dump. However, there are backups, in
~roundup/backup (and probably to-tape backups of that from earlier
dates; I would also hope that there are backups of the raw database tables).
> And however we recover, I hope we can figure out what caused the
> corruption to begin with, otherwise it may just reoccur.
That would be for Upfront to answer.
|
| msg1592 (view) |
Author: roche |
Date: 2010-03-25.15:28:05 |
|
On Wed, 2010-03-24 at 22:36 +0000, Martin v. Löwis wrote:
> Martin v. Löwis <martin@v.loewis.de> added the comment:
>
> > Do we have a backup of that table somewhere? We may be stuck with
> > restoring it from backup.
>
> We don't have a postgres dump. However, there are backups, in
> ~roundup/backup (and probably to-tape backups of that from earlier
> dates; I would also hope that there are backups of the raw database tables).
>
> > And however we recover, I hope we can figure out what caused the
> > corruption to begin with, otherwise it may just reoccur.
>
> That would be for Upfront to answer.
I have asked Izak to have a look. Apologies for making you wait. I'm
currently on holiday and not checking my mail frequently and our whole
team was out of the office yesterday.
|
| msg1593 (view) |
Author: izak |
Date: 2010-03-25.15:47:07 |
|
Martin v. Löwis wrote:
>> And however we recover, I hope we can figure out what caused the
>> corruption to begin with, otherwise it may just reoccur.
>
> That would be for Upfront to answer.
We didn't change the kernel or the hardware. We did upgrade the virtual
host on which psf runs, including the postgres installation. I suspect
we need only shut it down, run a fsck on the filesystem to make sure
that is in good shape. Let me know when would be a good time. I suppose
any time and as soon as possible?
Unfortunately I am not an expert in postgresql on THAT low a level, so I
would also recommend a restore from backup.
|
| msg1594 (view) |
Author: loewis |
Date: 2010-03-25.23:34:07 |
|
> We didn't change the kernel or the hardware. We did upgrade the virtual
> host on which psf runs, including the postgres installation. I suspect
> we need only shut it down, run a fsck on the filesystem to make sure
> that is in good shape. Let me know when would be a good time. I suppose
> any time and as soon as possible?
Sure, go ahead.
> Unfortunately I am not an expert in postgresql on THAT low a level, so I
> would also recommend a restore from backup.
I'd rather avoid that. The backup is not current enough, and new backups
aren't being made because of the corruption.
|
| msg1595 (view) |
Author: loewis |
Date: 2010-03-26.06:46:54 |
|
I have now restarted postgres; that alone seems to have resolved some of the problems.
|
| msg1596 (view) |
Author: izak |
Date: 2010-03-26.07:35:11 |
|
I've checked the filesystem, and though it wasn't checked in 420 days,
it was still fine. I cleaned up some of the startup scripts that do
nothing on a virtual host (things that set the hardware clock etc) and
then restarted it.
|
| msg1597 (view) |
Author: r.david.murray |
Date: 2010-03-26.15:43:31 |
|
If someone can restore one of the tape backups of file.csv to somewhere on the machine (the one in ~roundup/backup is now empty), I will try to make time this weekend to experiment with ways to recover that table.
|
| msg1602 (view) |
Author: ajaksu2 |
Date: 2010-03-26.18:53:51 |
|
Martin v. Löwis wrote:
>> Unfortunately I am not an expert in postgresql on THAT low a level, so I
>> would also recommend a restore from backup.
>
> I'd rather avoid that. The backup is not current enough, and new backups
> aren't being made because of the corruption.
Martin, are we positive there's missing data or is it possible that we
just can't access the table?
I've found a message[1] suggesting we can still get the data if we
skip accessing the header on _file.
If you issue "SET enable_seqscan TO 'off'; ", does "select * from
_file where __retired__=0" work? And does the dump work?
HTH,
Daniel
[1] http://osdir.com/ml/pgsql-general/2010-02/msg00380.html
|
| msg1603 (view) |
Author: r.david.murray |
Date: 2010-03-26.19:22:45 |
|
Unfortunately, no:
roundup_tracker=# set enable_seqscan to 'off';
SET
roundup_tracker=# select * from _file where __retired__=0;
ERROR: invalid page header in block 0 of relation "_file"
I'm hoping that if there is any missing data it is old data that we have a backup of (somewhere), and that the rest will be readable after zeroing the invalid page. But I'm not going to test that theory on the live database, so I need to set up a test instance in order to test my theory...
|
| msg1604 (view) |
Author: r.david.murray |
Date: 2010-03-26.19:33:23 |
|
Hmm. What I just reported makes me wonder, though. Accessing the data through that index should have worked :(
Using file_retired=1 does not produce the header error, but gives an empty table. (I don't know what values are valid). Maybe retired=0 is trying to read an item from the first database page, and thus gets the error.
|
| msg1606 (view) |
Author: ajaksu2 |
Date: 2010-03-26.20:07:55 |
|
David,
What does "EXPLAIN select * from _file where __retired__=0;" give you?
I get "Index Scan using _file_retired_idx on _file ..." after setting
enable_seqscan to off, but "Seq Scan on _file ..." in the default
case.
IIRC, __retired__ is an integer used as a bool, defaulting to 0 (file
is 'alive').
You could also try to filter by id to avoid hitting page 0, something
like this seems to perform an index scan too:
select * from _file where __retired__=0 and id>15000;
|
| msg1607 (view) |
Author: loewis |
Date: 2010-03-26.20:28:57 |
|
> What does "EXPLAIN select * from _file where __retired__=0;" give you?
>
> I get "Index Scan using _file_retired_idx on _file ..." after setting
> enable_seqscan to off, but "Seq Scan on _file ..." in the default
> case.
Same here.
> select * from _file where __retired__=0 and id>15000;
That does indeed produce a result. Not sure how this is going to help,
though.
|
| msg1608 (view) |
Author: ajaksu2 |
Date: 2010-03-26.20:57:59 |
|
Martin v. Löwis wrote:
>> select * from _file where __retired__=0 and id>15000;
>
> That does indeed produce a result. Not sure how this is going to help,
> though.
Not sure it will help. It gives us some info on whether we lost data
(and if so, where) or just can't access records easily.
It seems to be possible to get a (lossy?) dump of the DB under
zero_damaged_pages='on', as David proposed.
If it makes any sense, this might help getting most of the data from
_file before zeroing pages, if only to know how useful an ancient
backup for _file would be.
|
| msg1609 (view) |
Author: r.david.murray |
Date: 2010-03-26.21:10:20 |
|
That query works the same whether you use retired or not. Looks like we at least have the current data from id 10336 on (creation date 2008-05-16). If we have a backup of the earlier data we can at least recover the most of the data. We'd still be missing any changes made to older objects, but I don't think file information changes very often.
We could also ask on the postgres forum to see if anyone is knowledgeable enough to actually repair the page for us. It's possible just some bits got twiddled rather than wholesale corruption of the block, but I don't know how to tell. I won't have time to get subscribed and put together a post until tomorrow sometime.
|
| msg1610 (view) |
Author: loewis |
Date: 2010-03-26.21:24:46 |
|
> That query works the same whether you use retired or not. Looks like
> we at least have the current data from id 10336 on (creation date
> 2008-05-16). If we have a backup of the earlier data we can at least
> recover the most of the data. We'd still be missing any changes made
> to older objects, but I don't think file information changes very
> often.
I'm tempted to create the lossy dump right away, drop the table, and
recreate it. We shouldn't invest too much time into this: the actual
file data are on disk, and hopefully Upfront can provide us with some
old backup within a few days (certainly not over the weekend, though).
As for changes, we also have file__journal, which seems to dump fine.
|
| msg1611 (view) |
Author: loewis |
Date: 2010-03-26.22:49:54 |
|
I have now created a full database tgz if anybody wants to take a look.
I have then enabled the zero_damaged_pages setting, and vacuum'ed. That produced these messages:
invalid page header in block 0 of relation "_file"; zeroing out page
invalid page header in block 8 of relation "pg_type_typname_nsp_index"; zeroing out page
Dumping the database still failed with a missing pg_clog file. I created one that was zeroed out; this let me proceed. Postgres didn't decide to delete the log file afterwards; this worries me somewhat.
Dumping the database still doesn't work; it gives this error message:
ERROR: invalid memory alloc request size 18446744073709551613
This happens when doing "select *" or "copy" on _file. Notice that this number is 0xfffffffffffffffd.
izak: can you please provide a copy of the latest backup of /home/roundup/backups/tracker-export where file.csv is not empty?
|
| msg1613 (view) |
Author: ajaksu2 |
Date: 2010-03-29.00:20:38 |
|
> Dumping the database still doesn't work; it gives this error message:
>
> ERROR: invalid memory alloc request size 18446744073709551613
>
> This happens when doing "select *" or "copy" on _file. Notice that this number is 0xfffffffffffffffd.
Ugh. It seems we're left with copy-ing the other tables to files,
salvaging _file and rebuilding it with a backup and/or file_journal.
I could try to roll a script to perform these tasks if you're interested.
|
| msg1614 (view) |
Author: r.david.murray |
Date: 2010-03-29.14:13:47 |
|
I think any help will be welcome, Daniel :)
It turns out I can't even play around with this because I don't have a 64bit linux on which to load the binary copy of the database. I should be able to set up a KVM instance (which I want to do anyway), but that will take me a bit.
|
| msg1615 (view) |
Author: loewis |
Date: 2010-03-29.20:25:27 |
|
I'm going to try to write a script that selects each individual _file record, dumps them all into a pickle, so that I can drop the table afterwards. I hope that this will then allow to dump the database.
|
| msg1631 (view) |
Author: loewis |
Date: 2010-04-01.07:15:55 |
|
I have now run a script that selects each _file record individually; this failed for 113 and 152. Some of them returned a NULL id (even though I selected them by id), namely 107, 116, 119, 126, 127, 10337. I have deleted all these records from the database; afterwards, pg_dump would create a dump of the table (I actually had to delete 119 from the dump instead). I dropped the table, and reloaded the dump.
Now I can create dumps of the entire postgres cluster.
|
| msg1632 (view) |
Author: loewis |
Date: 2010-04-01.23:10:22 |
|
Upfront has provided a backup from Feb 15. I have checked that none of the issues backed up at that point had been modified after after Feb 15. I have restored all missing _file entries to that point.
Attached is a list (file:ids1) of _file records that have been restored.
|
| msg1633 (view) |
Author: loewis |
Date: 2010-04-01.23:12:57 |
|
Here is a list of file ids which are missing from range(16731): [0, 9109, 9110, 9184, 9278, 9347, 9375, 9376, 9377, 9378, 10156, 10157, 10158, 10159, 10160, 10161, 10162, 10163, 10164, 10165, 10166, 10167, 10168, 10169, 10295, 11287, 11294, 11825, 13128, 13340, 13341, 13342, 13343, 13344, 13345, 14084, 14209, 14898, 15500, 16149]
It would probably be useful to find out which of these are referenced from issues. I'll do that when I have time.
|
| msg1636 (view) |
Author: loewis |
Date: 2010-04-03.15:32:07 |
|
None of these fileids are linked to issues, so it appears these ids really don't exist. Closing this as resolved.
|
| msg1637 (view) |
Author: r.david.murray |
Date: 2010-04-04.02:09:51 |
|
Interesting. We just got a traceback email for one of them (13128), from an access by anonymous directly to the 'unnamed' file. Looks related to some of the spam we get.
|
| msg1638 (view) |
Author: loewis |
Date: 2010-04-04.07:26:24 |
|
I think that's actually a roundup bug. If somebody accesses fileXXX/path, it shouldn't choke, but report http 404.
I checked the apache logs; the access to file13128/unnamed comes from Googlebot. Unfortunately, they don't provide a referer. I guess Googlebot will continue trying until they eventually get a proper result.
|
| msg1639 (view) |
Author: loewis |
Date: 2010-04-04.18:13:44 |
|
I have now fixed the roundup bug with r79748 and r79749.
|
|
| Date |
User |
Action |
Args |
| 2010-04-04 18:13:45 | loewis | set | status: chatting -> resolved messages:
+ msg1639 |
| 2010-04-04 07:26:25 | loewis | set | status: resolved -> chatting messages:
+ msg1638 |
| 2010-04-04 02:09:59 | r.david.murray | set | status: chatting -> resolved |
| 2010-04-04 02:09:51 | r.david.murray | set | status: resolved -> chatting messages:
+ msg1637 |
| 2010-04-03 15:32:08 | loewis | set | status: chatting -> resolved messages:
+ msg1636 |
| 2010-04-01 23:12:58 | loewis | set | assignedto: izak -> messages:
+ msg1633 |
| 2010-04-01 23:10:23 | loewis | set | files:
+ ids messages:
+ msg1632 |
| 2010-04-01 07:15:55 | loewis | set | messages:
+ msg1631 |
| 2010-03-29 20:25:27 | loewis | set | messages:
+ msg1615 |
| 2010-03-29 14:13:48 | r.david.murray | set | messages:
+ msg1614 |
| 2010-03-29 00:20:39 | ajaksu2 | set | messages:
+ msg1613 |
| 2010-03-26 22:49:55 | loewis | set | messages:
+ msg1611 |
| 2010-03-26 21:24:46 | loewis | set | messages:
+ msg1610 |
| 2010-03-26 21:10:20 | r.david.murray | set | messages:
+ msg1609 |
| 2010-03-26 20:57:59 | ajaksu2 | set | messages:
+ msg1608 |
| 2010-03-26 20:28:57 | loewis | set | messages:
+ msg1607 |
| 2010-03-26 20:07:55 | ajaksu2 | set | messages:
+ msg1606 |
| 2010-03-26 19:33:23 | r.david.murray | set | messages:
+ msg1604 |
| 2010-03-26 19:22:46 | r.david.murray | set | messages:
+ msg1603 |
| 2010-03-26 18:53:51 | ajaksu2 | set | nosy:
+ ajaksu2 messages:
+ msg1602 |
| 2010-03-26 15:43:32 | r.david.murray | set | messages:
+ msg1597 |
| 2010-03-26 07:35:12 | izak | set | messages:
+ msg1596 |
| 2010-03-26 06:46:55 | loewis | set | messages:
+ msg1595 |
| 2010-03-25 23:34:08 | loewis | set | messages:
+ msg1594 |
| 2010-03-25 15:47:07 | izak | set | messages:
+ msg1593 |
| 2010-03-25 15:28:05 | roche | set | messages:
+ msg1592 |
| 2010-03-24 22:36:16 | loewis | set | messages:
+ msg1591 |
| 2010-03-24 22:28:25 | r.david.murray | set | status: unread -> chatting nosy:
+ r.david.murray messages:
+ msg1590 |
| 2010-03-24 08:41:40 | loewis | create | |
|