Friday, November 16, 2018

postgres corruption: ERROR: invalid page in block



Certain select queries will hit the following error. It has something to do with physical disk issue/corruption/outage where the table resides. This is a very common issue from my experience where disk filled up and database crashed.


query failed: ERROR: invalid page in block 0 of relation base/16386/26697


As to how to fix it, I am not completely certain when the database got to this point. I have fairly good luck with doing VACUUM FULL then REINDEX the database. 

But first, set zero out the damaged pages so, the cleaning operations can be carried out and not stuck on the errors.

SET zero_damaged_pages = on;

Vacuum full and reindex.

VACUUM FULL ANALYZE VERBOSE;
REINDEX DATABASE "VCDB";

--note: the database name must match exactly so, double quotes are are necessary.

Might be getting the following in the verbose mode of what is operation is doing.

WARNING:  invalid page in block 6 of relation base/16385/11936; zeroing out page
WARNING:  invalid page in block 4 of relation base/16385/20087; zeroing out page
WARNING:  relation "vpx_compute_resource" page 4 is uninitialized --- fixing
WARNING:  invalid page in block 0 of relation base/16385/20224; zeroing out page
WARNING:  relation "vpx_dvs" page 0 is uninitialized --- fixing
WARNING:  invalid page in block 13 of relation pg_tblspc/16397/PG_9.4_201409291/16385/21977; zeroing out page
WARNING:  relation "vpx_hist_stat4_1" page 13 is uninitialized --- fixing


REINDEX table <broken_table>;
or
REINDEX DATABASE "VCDB";


Some folks reported that by setting zero_damaged_pages to ON and then run a select all on the impacted table, it fixes it. I am not seeing through the logic there. My guess on this logic is, by setting the zero_damaged_pages to ON, it omitted all the bad pages and subsequent operations will not stop by the errors. So, I do not agree that by selecting all on the damaged table is the fix or even attempting to fix anything.

select * from <damaged_table>;


Summary
This isn't really a complete fix of the issue. The zero_damaged_pages set to on basically telling PostgreSQL to zero out the error at the buffer level and continue to do. The outcome of this is loss of data on the invalid page. Bad news, even though, this remedy fixes the majority of the issue but not 100% guaranteed. In my opinion, restoring from the good backup set are the only good remedy and ensuring no data loss that impacting database integrity..