Postgres desk

Thursday, April 20, 2023

Postgres: ERROR: this node should be a standby

On a standby node, while performing a clone. I am receiving an error. So, I am on Standby and nothing's incorrectly done on the repmgr.conf either.

repmgr -h 10.18.10.11 -U repmgr -d repmgr -f repmgr.conf standby clone

[postgres@center-2 ~]$ repmgr -f repmgr.conf standby register -F
INFO: connecting to local node "centos-2" (ID: 2)
ERROR: this node should be a standby (host=10.18.10.11 port=5432 dbname=repmgr user=repmgr)

Reason the clone was done without the-F parameter. It should be done with -F to overwrite the original. The error definitely bogus.

repmgr -h 10.18.10.11 -U repmgr -d repmgr -f /var/lib/pgsql/repmgr.conf standby clone -F

Postgresql: Setup and configuring Postgresql Replication Manager (Repmgr)

Continue from the first part of Create Hot Standby blog.

https://postgresdesk.blogspot.com/2023/04/postgresql-creating-hot-standby.html

Download repmgr https://repmgr.org/ I will really need the repmgr.conf file from the download to make modifications for my setup.

It is necessary to modify the repmgr.conf for both Master and Standby nodes. The repmgr.conf file can be located anywhere in the VM. I am going to have mine copied to /var/lib/pgsql/

/var/lib/pgsql/repmgr.conf

Install repmgr version 13 through yum on both Master and Slave nodes. That way, it set up and link all the necessary libraries for my setup.

yum update
yum install repmgr13.x86_64

Modify the shared_preload_libraries with repmgr. Perform this for Standby node as well.

vi /var/lib/pgsql/13/data/postgresql.conf

shared_preload_libraries = 'repmgr'

Modify pg_hba.conf and add the following line for all repmgr. Perform this for Standby node as well.

vi /var/lib/pgsql/13/data/pg_hba.conf

local   repmgr          repmgr                                  trust
host    repmgr          repmgr          192.168.1.0/24          trust
local   replication     repmgr                                  trust
host    replication     repmgr          192.168.1.0/24          trust

Modify repmgr.conf file. Standby node requires similar changes. So, make the same modification for repmgr.conf in Standby node for id as 2.The follow_command entry is crucial. Every modification will require the postgres database to be restarted (systemctl start postgresql-13.service).

node_id=1             
node_name='centos-1
conninfo='host=192.168.1.11 port=5432 dbname=repmgr user=repmgr
data_directory='/var/lib/pgsql/13/data'
primary_follow_timeout=10
failover = 'automatic'
promote_command='repmgr standby promote -f /var/lib/pgsql/repmgr.conf'
follow_command='repmgr standby follow -f /var/lib/pgsql/repmgr.conf -W --upstream-node-id=2'

on Master node, create repmgr database and user. This is done once on Master node postgresql database.

[postgres@center-1 ~]$ createuser -s repmgr
[postgres@center-1 ~]$ createdb repmgr -O repmgr

Create and register Master node

[postgres@center-1 ~]$ repmgr -f /var/lib/pgsql/repmgr.conf master register
INFO: connecting to primary database...
NOTICE: attempting to install extension "repmgr"
NOTICE: "repmgr" extension successfully installed
NOTICE: primary node record (ID: 1) registered
[postgres@center-1 ~]$

Verify the repmgr is working on the Master node.

[postgres@center-1 ~]$ psql -d repmgr -U repmgr
psql (13.10)
Type "help" for help.

repmgr=# \dt
              List of relations
 Schema |        Name        | Type  | Owner  
--------+--------------------+-------+--------
 repmgr | events             | table | repmgr
 repmgr | monitoring_history | table | repmgr
 repmgr | nodes              | table | repmgr
 repmgr | voting_term        | table | repmgr
(4 rows)

repmgr=# \x
Expanded display is on.
repmgr=# select * from nodes;
-[ RECORD 1 ]----+------------------------------------------------------
node_id          | 1
upstream_node_id | 
active           | t
node_name        | centos-1
type             | primary
location         | default
priority         | 100
conninfo         | host=192.168.1.11 port=5432 dbname=repmgr user=repmgr
repluser         | repmgr
slot_name        | 
config_file      | /var/lib/pgsql/repmgr.conf

On Standby node.

as root, stop the postgresql database.

[root@center-2 ~]# systemctl stop postgresql-13.service 

On Standby node, perform a repmgr clone from the Master node.

[postgres@center-2 pgsql]$ repmgr -h 192.168.1.11 -U repmgr -d repmgr -f /var/lib/pgsql/repmgr.conf standby clone -F

NOTICE: destination directory "/var/lib/pgsql/13/data" provided
INFO: connecting to source node
DETAIL: connection string is: host=192.168.1.11 user=repmgr dbname=repmgr
DETAIL: current installation size is 40 MB
INFO: replication slot usage not requested;  no replication slot will be set up for this standby
NOTICE: checking for available walsenders on the source node (2 required)
NOTICE: checking replication connections can be made to the source server (2 required)
WARNING: data checksums are not enabled and "wal_log_hints" is "off"
DETAIL: pg_rewind requires "wal_log_hints" to be enabled
WARNING: directory "/var/lib/pgsql/13/data" exists but is not empty
NOTICE: -F/--force provided - deleting existing data directory "/var/lib/pgsql/13/data"
NOTICE: starting backup (using pg_basebackup)...
HINT: this may take some time; consider using the -c/--fast-checkpoint option
INFO: executing:
  pg_basebackup -l "repmgr base backup"  -D /var/lib/pgsql/13/data -h 192.168.1.11 -p 5432 -U repmgr -X stream 
NOTICE: standby clone (using pg_basebackup) complete
NOTICE: you can now start your PostgreSQL server
HINT: for example: pg_ctl -D /var/lib/pgsql/13/data start
HINT: after starting the server, you need to register this standby with "repmgr standby register"

As root startup, the postgres database follows with registering the Standby node to the database.

[root@center-2 ~]# systemctl start postgresql-13.service 
[root@center-2 ~]# su - postgres

[postgres@center-2 ~]$ repmgr -f /var/lib/pgsql/repmgr.conf standby register
INFO: connecting to local node "centos-2" (ID: 2)
INFO: connecting to primary database
WARNING: --upstream-node-id not supplied, assuming upstream node is primary (node ID 1)
INFO: standby registration complete
NOTICE: standby node "centos-2" (ID: 2) successfully registered

The Replication Manager is completed as this point.

There are ways to monitor the repmgr setup. The event table will record everything that has been done in the replication.

repmgr=# select node_id, event, successful, details from events;

node_id | event | successful | details

---------+---------------------------+------------+----------------------------------------------------------------------------------------------

1 | cluster_created | t |

1 | primary_register | t |

2 | standby_clone | t | cloned from host "192.168.1.11", port 5432; backup method: pg_basebackup; --force: Y

2 | standby_register | t | standby registration succeeded; upstream node ID is 1

1 | repmgrd_start | t | monitoring cluster primary "centos-1" (ID: 1)

2 | repmgrd_start | t | monitoring connection to upstream node "centos-1" (ID: 1)

2 | standby_promote | t | server "centos-2" (ID: 2) was successfully promoted to primary

2 | repmgrd_failover_promote | t | node "centos-2" (ID: 2) promoted to primary; old primary "centos-1" (ID: 1) marked as failed

2 | repmgrd_reload | t | monitoring cluster primary "centos-2" (ID: 2)

1 | standby_clone | t | cloned from host "192.168.1.12", port 5432; backup method: pg_basebackup; --force: Y

1 | repmgrd_standby_reconnect | t | node restored as standby after 827 seconds, monitoring connection to upstream node -1

1 | standby_register | t | standby registration succeeded; upstream node ID is 2 (-F/--force option was used)

2 | child_node_new_connect | t | new standby "centos-1" (ID: 1) has connected

1 | standby_promote | t | server "centos-1" (ID: 1) was successfully promoted to primary

1 | repmgrd_failover_promote | t | node "centos-1" (ID: 1) promoted to primary; old primary "centos-2" (ID: 2) marked as failed

1 | repmgrd_reload | t | monitoring cluster primary "centos-1" (ID: 1)

2 | standby_clone | t | cloned from host "192.168.1.11", port 5432; backup method: pg_basebackup; --force: Y

2 | repmgrd_standby_reconnect | t | node restored as standby after 252 seconds, monitoring connection to upstream node -1

2 | standby_register | t | standby registration succeeded; upstream node ID is 1 (-F/--force option was used)

1 | child_node_new_connect | t | new standby "centos-2" (ID: 2) has connected

(20 rows)

repmgr=# select * from nodes;

---------+------------------+--------+-----------+---------+----------+----------+-------------------------------------------------------+----------+-----------+----------------------------

(2 rows)

repmgr=# \dt

List of relations

Schema | Name | Type | Owner

--------+--------------------+-------+--------

repmgr | events | table | repmgr

repmgr | monitoring_history | table | repmgr

repmgr | nodes | table | repmgr

repmgr | voting_term | table | repmgr

(4 rows)

repmgr has some commands that essentially perform the same stuff as looking into the repmgr database.

repmgr -f /var/lib/pgsql/repmgr.conf cluster show
repmgr -f /var/lib/pgsql/repmgr.conf cluster crosscheck
repmgr -f /var/lib/pgsql/repmgr.conf cluster events

That's it. The Standby should not be able to perform DML and all DML should be coming from Master node.

On Standby node.

TestDB=# create table test1 (int serial);
ERROR:  cannot execute CREATE TABLE in a read-only transaction

Create test1 table in Master node then log back into Standby node to perform data insertion. It still would not allowed it.

TestDB=# insert into test1 (select generate_series(1,1000));
ERROR:  cannot execute INSERT in a read-only transaction

Summary of the 3 step process for switchover.

repmgr -h <IP_From_The_Primary_Node> -U repmgr -d repmgr -f repmgr.conf standby clone -F
sudo systemctl start postgresql
repmgr -f repmgr.conf standby register

Enjoy.

Monday, April 10, 2023

Postgresql: Creating Hot Standby

Create hot standby

As a fictitious example, I will assign them 2 IPs, 192.168.1.11 as centos-1 and 192.168.1.12 as centos-1. This can be achieved by staging the first VM as centos-1 and then cloning it to centos-2 and assigning new ip to it as 192.168.1.12. Two VMs with Postgresql database should have been staged at this point. Both of the should be accessible as normal database would.

Creating a hot standby is relatively straightforward. Basically, creating a new database replicator user, and perform pg_basebackup that contains the standby.signal file and copy the centos-1 postgres data over to overwrite centos-2 data directory.

Create replicator user

postgres=# create user replicator with replication password 'password';
CREATE ROLE

Make changes to the following on postgresql.conf

archive_mode = on
archive_command = 'test  | -f /var/lib/pgsql/archivelog/%f && cp %p /var/lib/pgsql/archivelog/%f'
listen_addresses = '*'

Add a new one for coming from the host and everything coming from the same subnet

host    replication     replicator      192.168.1.0/24          md5

Perform pg_basebackup

[postgres@center-1 pgsql]$ pg_basebackup -h 192.168.1.11 -U replicator -p 5432 -D basebackup -Fp -Xs -P -R
Password: 
33610/33610 kB (100%), 1/1 tablespace

[postgres@center-1 pgsql]$ ls
13  archivelog  basebackup

perform entire data copy over to Slave node. This will overwrite the entire data directory in the Slave node with a standby.signal file indicating the node is a standby node.

rsync -a basebackup/ postgres@192.168.1.12:/var/lib/pgsql/13/data/

Hot Standby is completed. The standby node will not be able to write by select only.

Tuesday, July 7, 2020

postgresql: corruption: invalid primary checkpoint record

Corruptions are very common with PostgreSQL. It can easily be triggered on outages, shutting down the VM before shutting down the database gracefully or disk-related issues.

Here is a common one from the postgresql.log. It errors out with Checkpoint related error. In any database such as Oracle, Checkpoint has something to do with archivelogs or pg_xlogs for PostgreSQL. First thing coming to my mind is performing a pg_resetxlog and is probably a good attempt to gain access to the database again. Please note. pg_resetxlog will cause data loss but it still better than a dead database that refuses to come up. As the name suggested.

Here is what's in the postgresql.log



2020-05-07 23:57:02.492 UTC 5d35de42.8b5 0   LOG:  Updated instance status successfully.
2020-05-07 23:58:02.551 UTC 5d35de42.8b5 0   LOG:  Updating instance status...
2020-05-07 23:58:02.551 UTC 5d35de42.8b5 0   LOG:  Checking bloat of WAL caused by replication slots
2020-05-07 23:58:02.552 UTC 5d35de42.8b5 0   LOG:  Memory check: flag = rss, mem_used_bytes = 3216 kB, mem_avail = 715776 kB
2020-05-07 23:58:02.552 UTC 5d35de42.8b5 0   LOG:  Writing instance status...
2020-05-07 23:58:02.552 UTC 5d35de42.8b5 0   LOG:  Wrote instance status successfully.
2020-05-07 23:58:02.552 UTC 5d35de42.8b5 0   LOG:  Updated instance status successfully.
2020-05-07 23:59:02.604 UTC 5d35de42.8b5 0   LOG:  Updating instance status...
2020-05-07 23:59:02.604 UTC 5d35de42.8b5 0   LOG:  Checking bloat of WAL caused by replication slots
2020-05-07 23:59:02.605 UTC 5d35de42.8b5 0   LOG:  Memory check: flag = rss, mem_used_bytes = 3216 kB, mem_avail = 715776 kB
2020-05-07 23:59:02.605 UTC 5d35de42.8b5 0   LOG:  Writing instance status...
2020-05-07 23:59:02.605 UTC 5d35de42.8b5 0   LOG:  Wrote instance status successfully.
2020-05-07 23:59:02.605 UTC 5d35de42.8b5 0   LOG:  Updated instance status successfully.
2020-07-07 19:13:04.975 UTC 5f04c940.849d 0   LOG:  database system was interrupted; last known up at 2020-05-09 12:46:59 UTC
2020-07-07 19:13:05.195 UTC 5f04c940.849d 0   LOG:  invalid primary checkpoint record
2020-07-07 19:13:05.195 UTC 5f04c940.849d 0   LOG:  invalid secondary checkpoint record
2020-07-07 19:13:05.195 UTC 5f04c940.849d 0   PANIC:  could not locate a valid checkpoint record
2020-07-07 19:13:07.569 UTC 5f04c940.849a 0   LOG:  startup process (PID 33949) was terminated by signal 6: Aborted
2020-07-07 19:13:07.569 UTC 5f04c940.849a 0   LOG:  aborting startup due to startup process failure
2020-07-07 19:13:07.631 UTC 5f04c940.849a 0   LOG:  database system is shut down
2020-07-07 19:17:58.055 UTC 5f04ca66.8a6b 0   LOG:  database system was interrupted; last known up at 2020-05-09 12:46:59 UTC
2020-07-07 19:17:58.142 UTC 5f04ca66.8a6b 0   LOG:  invalid primary checkpoint record
2020-07-07 19:17:58.142 UTC 5f04ca66.8a6b 0   LOG:  invalid secondary checkpoint record
2020-07-07 19:17:58.142 UTC 5f04ca66.8a6b 0   PANIC:  could not locate a valid checkpoint record
2020-07-07 19:17:59.182 UTC 5f04ca65.8a69 0   LOG:  startup process (PID 35435) was terminated by signal 6: Aborted
2020-07-07 19:17:59.182 UTC 5f04ca65.8a69 0   LOG:  aborting startup due to startup process failure
2020-07-07 19:17:59.217 UTC 5f04ca65.8a69 0   LOG:  database system is shut down
2020-07-07 19:38:57.068 UTC 5f04cf51.9d32 0   LOG:  database system was interrupted; last known up at 2020-05-09 12:46:59 UTC
2020-07-07 19:38:57.562 UTC 5f04cf51.9d32 0   LOG:  invalid primary checkpoint record
2020-07-07 19:38:57.562 UTC 5f04cf51.9d32 0   LOG:  invalid secondary checkpoint record
2020-07-07 19:38:57.562 UTC 5f04cf51.9d32 0   PANIC:  could not locate a valid checkpoint record
2020-07-07 19:39:00.363 UTC 5f04cf50.9d2e 0   LOG:  startup process (PID 40242) was terminated by signal 6: Aborted
2020-07-07 19:39:00.363 UTC 5f04cf50.9d2e 0   LOG:  aborting startup due to startup process failure
2020-07-07 19:39:00.587 UTC 5f04cf50.9d2e 0   LOG:  database system is shut down
2020-07-07 20:01:33.348 UTC 5f04d49d.b1f3 0   LOG:  database system was interrupted; last known up at 2020-05-09 12:46:59 UTC
2020-07-07 20:01:33.431 UTC 5f04d49d.b1f3 0   LOG:  invalid primary checkpoint record
2020-07-07 20:01:33.431 UTC 5f04d49d.b1f3 0   LOG:  invalid secondary checkpoint record
2020-07-07 20:01:33.431 UTC 5f04d49d.b1f3 0   PANIC:  could not locate a valid checkpoint record
2020-07-07 20:01:33.944 UTC 5f04d49d.b1f1 0   LOG:  startup process (PID 45555) was terminated by signal 6: Aborted
2020-07-07 20:01:33.944 UTC 5f04d49d.b1f1 0   LOG:  aborting startup due to startup process failure
2020-07-07 20:01:33.976 UTC 5f04d49d.b1f1 0   LOG:  database system is shut down
2020-07-07 20:06:09.405 UTC 5f04d5b1.829 0   LOG:  database system was interrupted; last known up at 2020-05-09 12:46:59 UTC
2020-07-07 20:06:09.993 UTC 5f04d5b1.829 0   LOG:  invalid primary checkpoint record
2020-07-07 20:06:09.993 UTC 5f04d5b1.829 0   LOG:  invalid secondary checkpoint record
2020-07-07 20:06:09.993 UTC 5f04d5b1.829 0   PANIC:  could not locate a valid checkpoint record
2020-07-07 20:06:11.276 UTC 5f04d5b0.827 0   LOG:  startup process (PID 2089) was terminated by signal 6: Aborted
2020-07-07 20:06:11.276 UTC 5f04d5b0.827 0   LOG:  aborting startup due to startup process failure
2020-07-07 20:06:11.307 UTC 5f04d5b0.827 0   LOG:  database system is shut down
2020-07-07 20:08:32.940 UTC 5f04d640.97a 0   LOG:  database system was interrupted; last known up at 2020-05-09 12:46:59 UTC
2020-07-07 20:08:33.024 UTC 5f04d640.97a 0   LOG:  invalid primary checkpoint record
2020-07-07 20:08:33.024 UTC 5f04d640.97a 0   LOG:  invalid secondary checkpoint record
2020-07-07 20:08:33.024 UTC 5f04d640.97a 0   PANIC:  could not locate a valid checkpoint record
2020-07-07 20:08:33.531 UTC 5f04d640.978 0   LOG:  startup process (PID 2426) was terminated by signal 6: Aborted
2020-07-07 20:08:33.531 UTC 5f04d640.978 0   LOG:  aborting startup due to startup process failure
2020-07-07 20:08:33.562 UTC 5f04d640.978 0   LOG:  database system is shut down

Attempting to start up the PostgreSQL or vCenter service-control, it will refuse to.

root@photon-machine [ /storage/log/vmware/vpostgres ]# /opt/vmware/vpostgres/current/bin/psql -U postgres -d VCDB
psql.bin: could not connect to server: No such file or directory
        Is the server running locally and accepting
        connections on Unix domain socket "/var/run/vpostgres/.s.PGSQL.5432"?


root@photon-machine [ ~ ]# /usr/bin/service-control --start vpostgres
Operation not cancellable. Please wait for it to finish...
Performing start operation on service vmware-vpostgres...
Error executing start on service vmware-vpostgres. Details {
    "resolution": null,
    "componentKey": null,
    "problemId": null,
    "detail": [
        {
            "translatable": "An error occurred while starting service '%(0)s'",
            "args": [
                "vmware-vpostgres"
            ],
            "id": "install.ciscommon.service.failstart",
            "localized": "An error occurred while starting service 'vmware-vpostgres'"
        }
    ]
}
Service-control failed. Error: {
    "resolution": null,
    "componentKey": null,
    "problemId": null,
    "detail": [
        {
            "translatable": "An error occurred while starting service '%(0)s'",
            "args": [
                "vmware-vpostgres"
            ],
            "id": "install.ciscommon.service.failstart",
            "localized": "An error occurred while starting service 'vmware-vpostgres'"
        }
    ]

vCenter service-control --start --all will not work either .. obviously since vPostgreql isn't starting. It will be stuck at the "vmware-vmon" process

VMware vCenter Server Appliance 6.7.0.11000

Type: vCenter Server with an embedded Platform Services Controller

Last login: Tue Jul  7 20:04:16 2020 from 10.200.193.21
root@photon-machine [ ~ ]# /usr/bin/service-control --start --all
Operation not cancellable. Please wait for it to finish...
Performing start operation on service lwsmd...
Successfully started service lwsmd
Performing start operation on service vmafdd...
Successfully started service vmafdd
Performing start operation on service vmdird...
Successfully started service vmdird
Performing start operation on service vmcad...
Successfully started service vmcad
Performing start operation on service vmware-sts-idmd...
Successfully started service vmware-sts-idmd
Performing start operation on service vmware-stsd...
Successfully started service vmware-stsd
Performing start operation on service vmdnsd...
Successfully started service vmdnsd
Performing start operation on profile: ALL...
Successfully started service vmware-vmon

It is time to reset the pg_resetlog. Again, this will cause loss of transactions until the last known working checkpoint with the WAL files.

root@photon-machine [ ~ ]# su vpostgres -s /opt/vmware/vpostgres/current/bin/pg_resetxlog   /storage/db/vpostgres
could not change directory to "/root": Permission denied
Transaction log reset

One might want to cd into the /opt/vmware/vpostgres/9.6/bin directory.

su vpostgres -s /bin/sh
cd /opt/vmware/vpostgres/9.6/bin/
./pg_resetxlog /storage/db/vpostgres
Transaction log reset

pg_resetxlog is completed. Most of the time this will work but there times, it just wouldn't.

The following show that the PostgreSQL can be started with service-control --start vpostgres. "ps -ef|grep postgres" is showing PostgreSQL and other processes are up and running. Able to login to the PostgreSQL.

root@photon-machine [ ~ ]# /usr/bin/service-control --start vpostgres
Operation not cancellable. Please wait for it to finish...
Performing start operation on service vmware-vpostgres...
Successfully started service vmware-vpostgres

root@photon-machine [ ~ ]# ps -ef|grep postgres
vpostgr+  8974  4379  0 20:33 ?        00:00:00 /opt/vmware/vpostgres/current/bin/postgres -D /storage/db/vpostgres
vpostgr+  8981  8974  0 20:33 ?        00:00:00 postgres: logger process
vpostgr+  8983  8974  0 20:33 ?        00:00:00 postgres: checkpointer process
vpostgr+  8984  8974  0 20:33 ?        00:00:00 postgres: writer process
vpostgr+  8985  8974  0 20:33 ?        00:00:00 postgres: wal writer process
vpostgr+  8986  8974  0 20:33 ?        00:00:00 postgres: autovacuum launcher process
vpostgr+  8987  8974  0 20:33 ?        00:00:00 postgres: stats collector process
vpostgr+  8988  8974  0 20:33 ?        00:00:00 postgres: bgworker: health_status_worker
root     10526  3529  0 20:39 pts/2    00:00:00 grep --color=auto postgres


root@photon-machine [ ~ ]#  /opt/vmware/vpostgres/current/bin/psql -U postgres -d VCDB
psql.bin (9.6.6 (VMware Postgres 9.6.6.0-7698979 release))
Type "help" for help.

VCDB=#

That's it. The database is accessible. Whether the application is working or able to synch up with the database is a different story.

If all else not working, consider restoring the database to last working condition or redeploy. Overall, Postgresql is less resilient and more attention intensive than other commercial databases.

Friday, January 17, 2020

postgresql: Looking at the detail of wal file.

Using pg_xlogdump to look at the contents of a wal file.



root@photon-machine [ /storage/archive/vpostgres ]#  /opt/vmware/vpostgres/9.6/bin/pg_xlogdump 00000001000000020000002E
/opt/vmware/vpostgres/9.6/bin/pg_xlogdump /storage/archive/vpostgres/00000001000000020000002





Posted by



Migu




at

9:00 AM



No comments:
  









Email ThisBlogThis!Share to XShare to FacebookShare to Pinterest








postgresql: Purging WAL files with pg_archivecleanup in VC Appliance.







Purging wall files with pg_archivecleanup in VC Appliance. 
Essentially, this is similar to "rm" except, it detects dependency of WAL files and subsequent  WAL files.
The following -n flag provide information of what else to be removed if WAL file 0000000100000002000000EA is removed.
Note: watch out for space right after pg_xlog



root@photon-machine [ /storage/db/vpostgres/pg_xlog ]# /opt/vmware/vpostgres/9.6/bin/pg_archivecleanup -n /storage/db/vpostgres/pg_xlog/ 0000000100000002000000EA
/storage/db/vpostgres/pg_xlog//0000000100000002000000E6
/storage/db/vpostgres/pg_xlog//0000000100000002000000E9
/storage/db/vpostgres/pg_xlog//0000000100000002000000E8
/storage/db/vpostgres/pg_xlog//0000000100000002000000E7

Another example of deletion on different WAL file. The WAL file of 0000000100000002000000E6 will no longer in pg_xlog once deleted.


root@photon-machine [ /storage/db/vpostgres/pg_xlog ]# /opt/vmware/vpostgres/9.6/bin/pg_archivecleanup -d /storage/db/vpostgres/pg_xlog/ 0000000100000002000000E7
pg_archivecleanup: keep WAL file "/storage/db/vpostgres/pg_xlog//0000000100000002000000E7" and later
pg_archivecleanup: removing file "/storage/db/vpostgres/pg_xlog//0000000100000002000000E6"







Posted by



Migu




at

8:58 AM



No comments:
  









Email ThisBlogThis!Share to XShare to FacebookShare to Pinterest


        

          
        
Friday, November 16, 2018

          
        






postgres corruption: ERROR: invalid page in block









Certain select queries will hit the following error. It has something to do with physical disk issue/corruption/outage where the table resides. This is a very common issue from my experience where disk filled up and database crashed.









query failed: ERROR: invalid page in block 0 of relation base/16386/26697







As to how to fix it, I am not completely certain when the database got to this point. I have fairly good luck with doing VACUUM FULL then REINDEX the database. 




But first, set zero out the damaged pages so, the cleaning operations can be carried out and not stuck on the errors.






SET zero_damaged_pages = on;






Vacuum full and reindex.






VACUUM FULL ANALYZE VERBOSE;

REINDEX DATABASE "VCDB";






--note: the database name must match exactly so, double quotes are are necessary.




Might be getting the following in the verbose mode of what is operation is doing.






WARNING:  invalid page in block 6 of relation base/16385/11936; zeroing out page

WARNING:  invalid page in block 4 of relation base/16385/20087; zeroing out page

WARNING:  relation "vpx_compute_resource" page 4 is uninitialized --- fixing

WARNING:  invalid page in block 0 of relation base/16385/20224; zeroing out page

WARNING:  relation "vpx_dvs" page 0 is uninitialized --- fixing

WARNING:  invalid page in block 13 of relation pg_tblspc/16397/PG_9.4_201409291/16385/21977; zeroing out page

WARNING:  relation "vpx_hist_stat4_1" page 13 is uninitialized --- fixing











REINDEX table <broken_table>;

or

REINDEX DATABASE "VCDB";









Some folks reported that by setting zero_damaged_pages to ON and then run a select all on the impacted table, it fixes it. I am not seeing through the logic there. My guess on this logic is, by setting the zero_damaged_pages to ON, it omitted all the bad pages and subsequent operations will not stop by the errors. So, I do not agree that by selecting all on the damaged table is the fix or even attempting to fix anything.




select * from <damaged_table>;







Summary

This isn't really a complete fix of the issue. The zero_damaged_pages set to on basically telling PostgreSQL to zero out the error at the buffer level and continue to do. The outcome of this is loss of data on the invalid page. Bad news, even though, this remedy fixes the majority of the issue but not 100% guaranteed. In my opinion, restoring from the good backup set are the only good remedy and ensuring no data loss that impacting database integrity..





Posted by



Migu




at

7:31 AM



2 comments:
  









Email ThisBlogThis!Share to XShare to FacebookShare to Pinterest




Older Posts

Home




Subscribe to:
Posts (Atom)












Menu



Home


Administrator Desk


Oracle  Desk


SQL Server Desk


Data Engineering


Learning Python


MySql Desk


Links





Search This Blog




















Postgres Pages



postgresql: database size


postgresql: object description for vcenter tables ...


Postgres: postgres way to rebuild index online.


Postgresql: Tracking what's going on in the postgr...


Postgres: locating session information


Postgres: Index size


Postgresql: Getting views definition


Postgresql: Getting View definition


Postgresql: Getting index


Postgresql: Looking at Postgresql view the Oracle'...


Postgresql: interactive with psql  "&"


Postgresql: Getting pg_toast objects


Postgresql: tracking which tables belong to which ...


Postgresql: tablespace tables node id tracking


postgresql: table size


Postgres: Where are my Postgres files ?


postgres common database commands





Blog Archive








        ▼ 
      



2023

(3)





        ▼ 
      



April

(3)

Postgres: ERROR: this node should be a standby
Postgresql: Setup and configuring Postgresql Repli...
Postgresql: Creating Hot Standby










        ► 
      



2020

(3)





        ► 
      



July

(1)







        ► 
      



January

(2)









        ► 
      



2018

(1)





        ► 
      



November

(1)









        ► 
      



2017

(2)





        ► 
      



July

(1)







        ► 
      



January

(1)









        ► 
      



2016

(10)





        ► 
      



November

(1)







        ► 
      



October

(2)







        ► 
      



July

(1)







        ► 
      



May

(1)







        ► 
      



April

(4)







        ► 
      



January

(1)









        ► 
      



2015

(15)





        ► 
      



October

(1)







        ► 
      



August

(1)







        ► 
      



July

(2)







        ► 
      



June

(1)







        ► 
      



May

(6)







        ► 
      



April

(1)







        ► 
      



February

(1)







        ► 
      



January

(2)









        ► 
      



2014

(1)





        ► 
      



December

(1)









Popular Posts



postgres corruption: ERROR: invalid page in block


Postgresql: REINDEX DATABASE requires double quotes within psql.


postgresql: corruption: ERROR: unexpected chunk number 0 (expected 1)  for toast value XXXXXXXXX in pg_toast_XXXX


postgresql: corruption:  invalid primary checkpoint record


Postgresql: could not fork autovacuum worker process: Cannot allocate memory


Postgresql: database running out of space.


Postgres: Installing Postgres in RHel 6.4.


Postgresql: Python: Installing pyscopg on Red Hat 6.4


postgresql corruption: ERROR: tuple concurrently update (duplicate key in primary key column)


Postgres Corruption: ERROR: could not open file "base/.../.."





Microsoft







Certifications





















About Me





Migu


Welcome to my blog and my pen name is Migu. I am passionate about database technologies and automation. Blogging is a way to keep improving my skills and something I am having fun doing. I have been in the IT industry for a little over 2 decades primarily in database field. I have worked in different industries during those time (ERP, CRM, Healthcare and virtualization software development) and involved in a few different to roles from QA,System Engineer, System Administration, Customer Support Engineer, escalation engineer and Database Administration. I love continuous learning hence with my professional certifications. Obtaining a certification is the beginning of my learning. It offers the knowledge of what's available for certain technologies. There is 2 type of experiences one can be obtained with daily work routine which can be repetitive after awhile. The later is through academy where enable one to acquire additional skills that can't be obtained from the job.

View my complete profile



























Simple theme. Powered by Blogger.