On a standby node, while performing a clone. I am receiving an error. So, I am on Standby and nothing's incorrectly done on the repmgr.conf either.
repmgr -h 10.18.10.11 -U repmgr -d repmgr -f repmgr.conf standby clone
Stuff that are happening at Postgresql desk
On a standby node, while performing a clone. I am receiving an error. So, I am on Standby and nothing's incorrectly done on the repmgr.conf either.
repmgr -h 10.18.10.11 -U repmgr -d repmgr -f repmgr.conf standby clone
Continue from the first part of Create Hot Standby blog.
https://postgresdesk.blogspot.com/2023/04/postgresql-creating-hot-standby.html
Verify the repmgr is working on the Master node.
On Standby node.
as root, stop the postgresql database.
On Standby node, perform a repmgr clone from the Master node.
As root startup, the postgres database follows with registering the Standby node to the database.
The Replication Manager is completed as this point.
There are ways to monitor the repmgr setup. The event table will record everything that has been done in the replication.
repmgr has some commands that essentially perform the same stuff as looking into the repmgr database.
That's it. The Standby should not be able to perform DML and all DML should be coming from Master node.
On Standby node.
Create test1 table in Master node then log back into Standby node to perform data insertion. It still would not allowed it.
Enjoy.
2020-05-07 23:57:02.492 UTC 5d35de42.8b5 0 LOG: Updated instance status successfully.
2020-05-07 23:58:02.551 UTC 5d35de42.8b5 0 LOG: Updating instance status...
2020-05-07 23:58:02.551 UTC 5d35de42.8b5 0 LOG: Checking bloat of WAL caused by replication slots
2020-05-07 23:58:02.552 UTC 5d35de42.8b5 0 LOG: Memory check: flag = rss, mem_used_bytes = 3216 kB, mem_avail = 715776 kB
2020-05-07 23:58:02.552 UTC 5d35de42.8b5 0 LOG: Writing instance status...
2020-05-07 23:58:02.552 UTC 5d35de42.8b5 0 LOG: Wrote instance status successfully.
2020-05-07 23:58:02.552 UTC 5d35de42.8b5 0 LOG: Updated instance status successfully.
2020-05-07 23:59:02.604 UTC 5d35de42.8b5 0 LOG: Updating instance status...
2020-05-07 23:59:02.604 UTC 5d35de42.8b5 0 LOG: Checking bloat of WAL caused by replication slots
2020-05-07 23:59:02.605 UTC 5d35de42.8b5 0 LOG: Memory check: flag = rss, mem_used_bytes = 3216 kB, mem_avail = 715776 kB
2020-05-07 23:59:02.605 UTC 5d35de42.8b5 0 LOG: Writing instance status...
2020-05-07 23:59:02.605 UTC 5d35de42.8b5 0 LOG: Wrote instance status successfully.
2020-05-07 23:59:02.605 UTC 5d35de42.8b5 0 LOG: Updated instance status successfully.
2020-07-07 19:13:04.975 UTC 5f04c940.849d 0 LOG: database system was interrupted; last known up at 2020-05-09 12:46:59 UTC
2020-07-07 19:13:05.195 UTC 5f04c940.849d 0 LOG: invalid primary checkpoint record
2020-07-07 19:13:05.195 UTC 5f04c940.849d 0 LOG: invalid secondary checkpoint record
2020-07-07 19:13:05.195 UTC 5f04c940.849d 0 PANIC: could not locate a valid checkpoint record
2020-07-07 19:13:07.569 UTC 5f04c940.849a 0 LOG: startup process (PID 33949) was terminated by signal 6: Aborted
2020-07-07 19:13:07.569 UTC 5f04c940.849a 0 LOG: aborting startup due to startup process failure
2020-07-07 19:13:07.631 UTC 5f04c940.849a 0 LOG: database system is shut down
2020-07-07 19:17:58.055 UTC 5f04ca66.8a6b 0 LOG: database system was interrupted; last known up at 2020-05-09 12:46:59 UTC
2020-07-07 19:17:58.142 UTC 5f04ca66.8a6b 0 LOG: invalid primary checkpoint record
2020-07-07 19:17:58.142 UTC 5f04ca66.8a6b 0 LOG: invalid secondary checkpoint record
2020-07-07 19:17:58.142 UTC 5f04ca66.8a6b 0 PANIC: could not locate a valid checkpoint record
2020-07-07 19:17:59.182 UTC 5f04ca65.8a69 0 LOG: startup process (PID 35435) was terminated by signal 6: Aborted
2020-07-07 19:17:59.182 UTC 5f04ca65.8a69 0 LOG: aborting startup due to startup process failure
2020-07-07 19:17:59.217 UTC 5f04ca65.8a69 0 LOG: database system is shut down
2020-07-07 19:38:57.068 UTC 5f04cf51.9d32 0 LOG: database system was interrupted; last known up at 2020-05-09 12:46:59 UTC
2020-07-07 19:38:57.562 UTC 5f04cf51.9d32 0 LOG: invalid primary checkpoint record
2020-07-07 19:38:57.562 UTC 5f04cf51.9d32 0 LOG: invalid secondary checkpoint record
2020-07-07 19:38:57.562 UTC 5f04cf51.9d32 0 PANIC: could not locate a valid checkpoint record
2020-07-07 19:39:00.363 UTC 5f04cf50.9d2e 0 LOG: startup process (PID 40242) was terminated by signal 6: Aborted
2020-07-07 19:39:00.363 UTC 5f04cf50.9d2e 0 LOG: aborting startup due to startup process failure
2020-07-07 19:39:00.587 UTC 5f04cf50.9d2e 0 LOG: database system is shut down
2020-07-07 20:01:33.348 UTC 5f04d49d.b1f3 0 LOG: database system was interrupted; last known up at 2020-05-09 12:46:59 UTC
2020-07-07 20:01:33.431 UTC 5f04d49d.b1f3 0 LOG: invalid primary checkpoint record
2020-07-07 20:01:33.431 UTC 5f04d49d.b1f3 0 LOG: invalid secondary checkpoint record
2020-07-07 20:01:33.431 UTC 5f04d49d.b1f3 0 PANIC: could not locate a valid checkpoint record
2020-07-07 20:01:33.944 UTC 5f04d49d.b1f1 0 LOG: startup process (PID 45555) was terminated by signal 6: Aborted
2020-07-07 20:01:33.944 UTC 5f04d49d.b1f1 0 LOG: aborting startup due to startup process failure
2020-07-07 20:01:33.976 UTC 5f04d49d.b1f1 0 LOG: database system is shut down
2020-07-07 20:06:09.405 UTC 5f04d5b1.829 0 LOG: database system was interrupted; last known up at 2020-05-09 12:46:59 UTC
2020-07-07 20:06:09.993 UTC 5f04d5b1.829 0 LOG: invalid primary checkpoint record
2020-07-07 20:06:09.993 UTC 5f04d5b1.829 0 LOG: invalid secondary checkpoint record
2020-07-07 20:06:09.993 UTC 5f04d5b1.829 0 PANIC: could not locate a valid checkpoint record
2020-07-07 20:06:11.276 UTC 5f04d5b0.827 0 LOG: startup process (PID 2089) was terminated by signal 6: Aborted
2020-07-07 20:06:11.276 UTC 5f04d5b0.827 0 LOG: aborting startup due to startup process failure
2020-07-07 20:06:11.307 UTC 5f04d5b0.827 0 LOG: database system is shut down
2020-07-07 20:08:32.940 UTC 5f04d640.97a 0 LOG: database system was interrupted; last known up at 2020-05-09 12:46:59 UTC
2020-07-07 20:08:33.024 UTC 5f04d640.97a 0 LOG: invalid primary checkpoint record
2020-07-07 20:08:33.024 UTC 5f04d640.97a 0 LOG: invalid secondary checkpoint record
2020-07-07 20:08:33.024 UTC 5f04d640.97a 0 PANIC: could not locate a valid checkpoint record
2020-07-07 20:08:33.531 UTC 5f04d640.978 0 LOG: startup process (PID 2426) was terminated by signal 6: Aborted
2020-07-07 20:08:33.531 UTC 5f04d640.978 0 LOG: aborting startup due to startup process failure
2020-07-07 20:08:33.562 UTC 5f04d640.978 0 LOG: database system is shut down
root@photon-machine [ /storage/log/vmware/vpostgres ]# /opt/vmware/vpostgres/current/bin/psql -U postgres -d VCDB
psql.bin: could not connect to server: No such file or directory
Is the server running locally and accepting
connections on Unix domain socket "/var/run/vpostgres/.s.PGSQL.5432"?
root@photon-machine [ ~ ]# /usr/bin/service-control --start vpostgres
Operation not cancellable. Please wait for it to finish...
Performing start operation on service vmware-vpostgres...
Error executing start on service vmware-vpostgres. Details {
"resolution": null,
"componentKey": null,
"problemId": null,
"detail": [
{
"translatable": "An error occurred while starting service '%(0)s'",
"args": [
"vmware-vpostgres"
],
"id": "install.ciscommon.service.failstart",
"localized": "An error occurred while starting service 'vmware-vpostgres'"
}
]
}
Service-control failed. Error: {
"resolution": null,
"componentKey": null,
"problemId": null,
"detail": [
{
"translatable": "An error occurred while starting service '%(0)s'",
"args": [
"vmware-vpostgres"
],
"id": "install.ciscommon.service.failstart",
"localized": "An error occurred while starting service 'vmware-vpostgres'"
}
]
VMware vCenter Server Appliance 6.7.0.11000
Type: vCenter Server with an embedded Platform Services Controller
Last login: Tue Jul 7 20:04:16 2020 from 10.200.193.21
root@photon-machine [ ~ ]# /usr/bin/service-control --start --all
Operation not cancellable. Please wait for it to finish...
Performing start operation on service lwsmd...
Successfully started service lwsmd
Performing start operation on service vmafdd...
Successfully started service vmafdd
Performing start operation on service vmdird...
Successfully started service vmdird
Performing start operation on service vmcad...
Successfully started service vmcad
Performing start operation on service vmware-sts-idmd...
Successfully started service vmware-sts-idmd
Performing start operation on service vmware-stsd...
Successfully started service vmware-stsd
Performing start operation on service vmdnsd...
Successfully started service vmdnsd
Performing start operation on profile: ALL...
Successfully started service vmware-vmon
root@photon-machine [ ~ ]# su vpostgres -s /opt/vmware/vpostgres/current/bin/pg_resetxlog /storage/db/vpostgres
could not change directory to "/root": Permission denied
Transaction log reset
su vpostgres -s /bin/sh
cd /opt/vmware/vpostgres/9.6/bin/
./pg_resetxlog /storage/db/vpostgres
Transaction log reset
root@photon-machine [ ~ ]# /usr/bin/service-control --start vpostgres
Operation not cancellable. Please wait for it to finish...
Performing start operation on service vmware-vpostgres...
Successfully started service vmware-vpostgres
root@photon-machine [ ~ ]# ps -ef|grep postgres
vpostgr+ 8974 4379 0 20:33 ? 00:00:00 /opt/vmware/vpostgres/current/bin/postgres -D /storage/db/vpostgres
vpostgr+ 8981 8974 0 20:33 ? 00:00:00 postgres: logger process
vpostgr+ 8983 8974 0 20:33 ? 00:00:00 postgres: checkpointer process
vpostgr+ 8984 8974 0 20:33 ? 00:00:00 postgres: writer process
vpostgr+ 8985 8974 0 20:33 ? 00:00:00 postgres: wal writer process
vpostgr+ 8986 8974 0 20:33 ? 00:00:00 postgres: autovacuum launcher process
vpostgr+ 8987 8974 0 20:33 ? 00:00:00 postgres: stats collector process
vpostgr+ 8988 8974 0 20:33 ? 00:00:00 postgres: bgworker: health_status_worker
root 10526 3529 0 20:39 pts/2 00:00:00 grep --color=auto postgres
root@photon-machine [ ~ ]# /opt/vmware/vpostgres/current/bin/psql -U postgres -d VCDB
psql.bin (9.6.6 (VMware Postgres 9.6.6.0-7698979 release))
Type "help" for help.
VCDB=#
root@photon-machine [ /storage/archive/vpostgres ]# /opt/vmware/vpostgres/9.6/bin/pg_xlogdump 00000001000000020000002E
/opt/vmware/vpostgres/9.6/bin/pg_xlogdump /storage/archive/vpostgres/00000001000000020000002
postgresql: Purging WAL files with pg_archivecleanup in VC Appliance.
Purging wall files with pg_archivecleanup in VC Appliance.
Essentially, this is similar to "rm" except, it detects dependency of WAL files and subsequent WAL files.
The following -n flag provide information of what else to be removed if WAL file 0000000100000002000000EA is removed.
Note: watch out for space right after pg_xlog
root@photon-machine [ /storage/db/vpostgres/pg_xlog ]# /opt/vmware/vpostgres/9.6/bin/pg_archivecleanup -n /storage/db/vpostgres/pg_xlog/ 0000000100000002000000EA
/storage/db/vpostgres/pg_xlog//0000000100000002000000E6
/storage/db/vpostgres/pg_xlog//0000000100000002000000E9
/storage/db/vpostgres/pg_xlog//0000000100000002000000E8
/storage/db/vpostgres/pg_xlog//0000000100000002000000E7
Another example of deletion on different WAL file. The WAL file of 0000000100000002000000E6 will no longer in pg_xlog once deleted.
root@photon-machine [ /storage/db/vpostgres/pg_xlog ]# /opt/vmware/vpostgres/9.6/bin/pg_archivecleanup -d /storage/db/vpostgres/pg_xlog/ 0000000100000002000000E7
pg_archivecleanup: keep WAL file "/storage/db/vpostgres/pg_xlog//0000000100000002000000E7" and later
pg_archivecleanup: removing file "/storage/db/vpostgres/pg_xlog//0000000100000002000000E6"
Friday, November 16, 2018
postgres corruption: ERROR: invalid page in block
Certain select queries will hit the following error. It has something to do with physical disk issue/corruption/outage where the table resides. This is a very common issue from my experience where disk filled up and database crashed.
query failed: ERROR: invalid page in block 0 of relation base/16386/26697
As to how to fix it, I am not completely certain when the database got to this point. I have fairly good luck with doing VACUUM FULL then REINDEX the database.
But first, set zero out the damaged pages so, the cleaning operations can be carried out and not stuck on the errors.
SET zero_damaged_pages = on;
Vacuum full and reindex.
VACUUM FULL ANALYZE VERBOSE;
REINDEX DATABASE "VCDB";
--note: the database name must match exactly so, double quotes are are necessary.
Might be getting the following in the verbose mode of what is operation is doing.
WARNING: invalid page in block 6 of relation base/16385/11936; zeroing out page
WARNING: invalid page in block 4 of relation base/16385/20087; zeroing out page
WARNING: relation "vpx_compute_resource" page 4 is uninitialized --- fixing
WARNING: invalid page in block 0 of relation base/16385/20224; zeroing out page
WARNING: relation "vpx_dvs" page 0 is uninitialized --- fixing
WARNING: invalid page in block 13 of relation pg_tblspc/16397/PG_9.4_201409291/16385/21977; zeroing out page
WARNING: relation "vpx_hist_stat4_1" page 13 is uninitialized --- fixing
REINDEX table <broken_table>;
or
REINDEX DATABASE "VCDB";
Some folks reported that by setting zero_damaged_pages to ON and then run a select all on the impacted table, it fixes it. I am not seeing through the logic there. My guess on this logic is, by setting the zero_damaged_pages to ON, it omitted all the bad pages and subsequent operations will not stop by the errors. So, I do not agree that by selecting all on the damaged table is the fix or even attempting to fix anything.
select * from <damaged_table>;
Summary
This isn't really a complete fix of the issue. The zero_damaged_pages set to on basically telling PostgreSQL to zero out the error at the buffer level and continue to do. The outcome of this is loss of data on the invalid page. Bad news, even though, this remedy fixes the majority of the issue but not 100% guaranteed. In my opinion, restoring from the good backup set are the only good remedy and ensuring no data loss that impacting database integrity..
Subscribe to:
Posts (Atom)