Tuesday, July 7, 2020

postgresql: corruption: invalid primary checkpoint record

Corruptions are very common with PostgreSQL. It can easily be triggered on outages, shutting down the VM before shutting down the database gracefully or disk-related issues.

Here is a common one from the postgresql.log. It errors out with Checkpoint related error. In any database such as Oracle, Checkpoint has something to do with archivelogs or pg_xlogs for PostgreSQL. First thing coming to my mind is performing a pg_resetxlog and is probably a good attempt to gain access to the database again. Please note. pg_resetxlog will cause data loss but it still better than a dead database that refuses to come up. As the name suggested.

Here is what's in the postgresql.log



2020-05-07 23:57:02.492 UTC 5d35de42.8b5 0   LOG:  Updated instance status successfully.
2020-05-07 23:58:02.551 UTC 5d35de42.8b5 0   LOG:  Updating instance status...
2020-05-07 23:58:02.551 UTC 5d35de42.8b5 0   LOG:  Checking bloat of WAL caused by replication slots
2020-05-07 23:58:02.552 UTC 5d35de42.8b5 0   LOG:  Memory check: flag = rss, mem_used_bytes = 3216 kB, mem_avail = 715776 kB
2020-05-07 23:58:02.552 UTC 5d35de42.8b5 0   LOG:  Writing instance status...
2020-05-07 23:58:02.552 UTC 5d35de42.8b5 0   LOG:  Wrote instance status successfully.
2020-05-07 23:58:02.552 UTC 5d35de42.8b5 0   LOG:  Updated instance status successfully.
2020-05-07 23:59:02.604 UTC 5d35de42.8b5 0   LOG:  Updating instance status...
2020-05-07 23:59:02.604 UTC 5d35de42.8b5 0   LOG:  Checking bloat of WAL caused by replication slots
2020-05-07 23:59:02.605 UTC 5d35de42.8b5 0   LOG:  Memory check: flag = rss, mem_used_bytes = 3216 kB, mem_avail = 715776 kB
2020-05-07 23:59:02.605 UTC 5d35de42.8b5 0   LOG:  Writing instance status...
2020-05-07 23:59:02.605 UTC 5d35de42.8b5 0   LOG:  Wrote instance status successfully.
2020-05-07 23:59:02.605 UTC 5d35de42.8b5 0   LOG:  Updated instance status successfully.
2020-07-07 19:13:04.975 UTC 5f04c940.849d 0   LOG:  database system was interrupted; last known up at 2020-05-09 12:46:59 UTC
2020-07-07 19:13:05.195 UTC 5f04c940.849d 0   LOG:  invalid primary checkpoint record
2020-07-07 19:13:05.195 UTC 5f04c940.849d 0   LOG:  invalid secondary checkpoint record
2020-07-07 19:13:05.195 UTC 5f04c940.849d 0   PANIC:  could not locate a valid checkpoint record
2020-07-07 19:13:07.569 UTC 5f04c940.849a 0   LOG:  startup process (PID 33949) was terminated by signal 6: Aborted
2020-07-07 19:13:07.569 UTC 5f04c940.849a 0   LOG:  aborting startup due to startup process failure
2020-07-07 19:13:07.631 UTC 5f04c940.849a 0   LOG:  database system is shut down
2020-07-07 19:17:58.055 UTC 5f04ca66.8a6b 0   LOG:  database system was interrupted; last known up at 2020-05-09 12:46:59 UTC
2020-07-07 19:17:58.142 UTC 5f04ca66.8a6b 0   LOG:  invalid primary checkpoint record
2020-07-07 19:17:58.142 UTC 5f04ca66.8a6b 0   LOG:  invalid secondary checkpoint record
2020-07-07 19:17:58.142 UTC 5f04ca66.8a6b 0   PANIC:  could not locate a valid checkpoint record
2020-07-07 19:17:59.182 UTC 5f04ca65.8a69 0   LOG:  startup process (PID 35435) was terminated by signal 6: Aborted
2020-07-07 19:17:59.182 UTC 5f04ca65.8a69 0   LOG:  aborting startup due to startup process failure
2020-07-07 19:17:59.217 UTC 5f04ca65.8a69 0   LOG:  database system is shut down
2020-07-07 19:38:57.068 UTC 5f04cf51.9d32 0   LOG:  database system was interrupted; last known up at 2020-05-09 12:46:59 UTC
2020-07-07 19:38:57.562 UTC 5f04cf51.9d32 0   LOG:  invalid primary checkpoint record
2020-07-07 19:38:57.562 UTC 5f04cf51.9d32 0   LOG:  invalid secondary checkpoint record
2020-07-07 19:38:57.562 UTC 5f04cf51.9d32 0   PANIC:  could not locate a valid checkpoint record
2020-07-07 19:39:00.363 UTC 5f04cf50.9d2e 0   LOG:  startup process (PID 40242) was terminated by signal 6: Aborted
2020-07-07 19:39:00.363 UTC 5f04cf50.9d2e 0   LOG:  aborting startup due to startup process failure
2020-07-07 19:39:00.587 UTC 5f04cf50.9d2e 0   LOG:  database system is shut down
2020-07-07 20:01:33.348 UTC 5f04d49d.b1f3 0   LOG:  database system was interrupted; last known up at 2020-05-09 12:46:59 UTC
2020-07-07 20:01:33.431 UTC 5f04d49d.b1f3 0   LOG:  invalid primary checkpoint record
2020-07-07 20:01:33.431 UTC 5f04d49d.b1f3 0   LOG:  invalid secondary checkpoint record
2020-07-07 20:01:33.431 UTC 5f04d49d.b1f3 0   PANIC:  could not locate a valid checkpoint record
2020-07-07 20:01:33.944 UTC 5f04d49d.b1f1 0   LOG:  startup process (PID 45555) was terminated by signal 6: Aborted
2020-07-07 20:01:33.944 UTC 5f04d49d.b1f1 0   LOG:  aborting startup due to startup process failure
2020-07-07 20:01:33.976 UTC 5f04d49d.b1f1 0   LOG:  database system is shut down
2020-07-07 20:06:09.405 UTC 5f04d5b1.829 0   LOG:  database system was interrupted; last known up at 2020-05-09 12:46:59 UTC
2020-07-07 20:06:09.993 UTC 5f04d5b1.829 0   LOG:  invalid primary checkpoint record
2020-07-07 20:06:09.993 UTC 5f04d5b1.829 0   LOG:  invalid secondary checkpoint record
2020-07-07 20:06:09.993 UTC 5f04d5b1.829 0   PANIC:  could not locate a valid checkpoint record
2020-07-07 20:06:11.276 UTC 5f04d5b0.827 0   LOG:  startup process (PID 2089) was terminated by signal 6: Aborted
2020-07-07 20:06:11.276 UTC 5f04d5b0.827 0   LOG:  aborting startup due to startup process failure
2020-07-07 20:06:11.307 UTC 5f04d5b0.827 0   LOG:  database system is shut down
2020-07-07 20:08:32.940 UTC 5f04d640.97a 0   LOG:  database system was interrupted; last known up at 2020-05-09 12:46:59 UTC
2020-07-07 20:08:33.024 UTC 5f04d640.97a 0   LOG:  invalid primary checkpoint record
2020-07-07 20:08:33.024 UTC 5f04d640.97a 0   LOG:  invalid secondary checkpoint record
2020-07-07 20:08:33.024 UTC 5f04d640.97a 0   PANIC:  could not locate a valid checkpoint record
2020-07-07 20:08:33.531 UTC 5f04d640.978 0   LOG:  startup process (PID 2426) was terminated by signal 6: Aborted
2020-07-07 20:08:33.531 UTC 5f04d640.978 0   LOG:  aborting startup due to startup process failure
2020-07-07 20:08:33.562 UTC 5f04d640.978 0   LOG:  database system is shut down


Attempting to start up the PostgreSQL or vCenter service-control, it will refuse to.

root@photon-machine [ /storage/log/vmware/vpostgres ]# /opt/vmware/vpostgres/current/bin/psql -U postgres -d VCDB
psql.bin: could not connect to server: No such file or directory
        Is the server running locally and accepting
        connections on Unix domain socket "/var/run/vpostgres/.s.PGSQL.5432"?




root@photon-machine [ ~ ]# /usr/bin/service-control --start vpostgres
Operation not cancellable. Please wait for it to finish...
Performing start operation on service vmware-vpostgres...
Error executing start on service vmware-vpostgres. Details {
    "resolution": null,
    "componentKey": null,
    "problemId": null,
    "detail": [
        {
            "translatable": "An error occurred while starting service '%(0)s'",
            "args": [
                "vmware-vpostgres"
            ],
            "id": "install.ciscommon.service.failstart",
            "localized": "An error occurred while starting service 'vmware-vpostgres'"
        }
    ]
}
Service-control failed. Error: {
    "resolution": null,
    "componentKey": null,
    "problemId": null,
    "detail": [
        {
            "translatable": "An error occurred while starting service '%(0)s'",
            "args": [
                "vmware-vpostgres"
            ],
            "id": "install.ciscommon.service.failstart",
            "localized": "An error occurred while starting service 'vmware-vpostgres'"
        }
    ]




vCenter service-control --start --all will not work either .. obviously since vPostgreql isn't starting. It will be stuck at the "vmware-vmon" process

VMware vCenter Server Appliance 6.7.0.11000

Type: vCenter Server with an embedded Platform Services Controller

Last login: Tue Jul  7 20:04:16 2020 from 10.200.193.21
root@photon-machine [ ~ ]# /usr/bin/service-control --start --all
Operation not cancellable. Please wait for it to finish...
Performing start operation on service lwsmd...
Successfully started service lwsmd
Performing start operation on service vmafdd...
Successfully started service vmafdd
Performing start operation on service vmdird...
Successfully started service vmdird
Performing start operation on service vmcad...
Successfully started service vmcad
Performing start operation on service vmware-sts-idmd...
Successfully started service vmware-sts-idmd
Performing start operation on service vmware-stsd...
Successfully started service vmware-stsd
Performing start operation on service vmdnsd...
Successfully started service vmdnsd
Performing start operation on profile: ALL...
Successfully started service vmware-vmon

It is time to reset the pg_resetlog. Again, this will cause loss of transactions until the last known working checkpoint with the WAL files.

root@photon-machine [ ~ ]# su vpostgres -s /opt/vmware/vpostgres/current/bin/pg_resetxlog   /storage/db/vpostgres
could not change directory to "/root": Permission denied
Transaction log reset


One might want to cd into the /opt/vmware/vpostgres/9.6/bin directory.

su vpostgres -s /bin/sh
cd /opt/vmware/vpostgres/9.6/bin/
./pg_resetxlog /storage/db/vpostgres
Transaction log reset

pg_resetxlog is completed. Most of the time this will work but there times, it just wouldn't.


The following show that the PostgreSQL can be started with service-control --start vpostgres. "ps -ef|grep postgres" is showing PostgreSQL and other processes are up and running. Able to login to the PostgreSQL.

root@photon-machine [ ~ ]# /usr/bin/service-control --start vpostgres
Operation not cancellable. Please wait for it to finish...
Performing start operation on service vmware-vpostgres...
Successfully started service vmware-vpostgres

root@photon-machine [ ~ ]# ps -ef|grep postgres
vpostgr+  8974  4379  0 20:33 ?        00:00:00 /opt/vmware/vpostgres/current/bin/postgres -D /storage/db/vpostgres
vpostgr+  8981  8974  0 20:33 ?        00:00:00 postgres: logger process
vpostgr+  8983  8974  0 20:33 ?        00:00:00 postgres: checkpointer process
vpostgr+  8984  8974  0 20:33 ?        00:00:00 postgres: writer process
vpostgr+  8985  8974  0 20:33 ?        00:00:00 postgres: wal writer process
vpostgr+  8986  8974  0 20:33 ?        00:00:00 postgres: autovacuum launcher process
vpostgr+  8987  8974  0 20:33 ?        00:00:00 postgres: stats collector process
vpostgr+  8988  8974  0 20:33 ?        00:00:00 postgres: bgworker: health_status_worker
root     10526  3529  0 20:39 pts/2    00:00:00 grep --color=auto postgres


root@photon-machine [ ~ ]#  /opt/vmware/vpostgres/current/bin/psql -U postgres -d VCDB
psql.bin (9.6.6 (VMware Postgres 9.6.6.0-7698979 release))
Type "help" for help.

VCDB=#


That's it. The database is accessible. Whether the application is working or able to synch up with the database is a different story.

If all else not working, consider restoring the database to last working condition or redeploy. Overall, Postgresql is less resilient and more attention intensive than other commercial databases.