Friday, January 17, 2020

postgresql: Looking at the detail of wal file.

Using pg_xlogdump to look at the contents of a wal file.


root@photon-machine [ /storage/archive/vpostgres ]#  /opt/vmware/vpostgres/9.6/bin/pg_xlogdump 00000001000000020000002E
/opt/vmware/vpostgres/9.6/bin/pg_xlogdump /storage/archive/vpostgres/00000001000000020000002

postgresql: Purging WAL files with pg_archivecleanup in VC Appliance.


Purging wall files with pg_archivecleanup in VC Appliance. 
Essentially, this is similar to "rm" except, it detects dependency of WAL files and subsequent  WAL files.
The following -n flag provide information of what else to be removed if WAL file 0000000100000002000000EA is removed.
Note: watch out for space right after pg_xlog



root@photon-machine [ /storage/db/vpostgres/pg_xlog ]# /opt/vmware/vpostgres/9.6/bin/pg_archivecleanup -n /storage/db/vpostgres/pg_xlog/ 0000000100000002000000EA
/storage/db/vpostgres/pg_xlog//0000000100000002000000E6
/storage/db/vpostgres/pg_xlog//0000000100000002000000E9
/storage/db/vpostgres/pg_xlog//0000000100000002000000E8
/storage/db/vpostgres/pg_xlog//0000000100000002000000E7
Another example of deletion on different WAL file. The WAL file of 0000000100000002000000E6 will no longer in pg_xlog once deleted.

root@photon-machine [ /storage/db/vpostgres/pg_xlog ]# /opt/vmware/vpostgres/9.6/bin/pg_archivecleanup -d /storage/db/vpostgres/pg_xlog/ 0000000100000002000000E7
pg_archivecleanup: keep WAL file "/storage/db/vpostgres/pg_xlog//0000000100000002000000E7" and later
pg_archivecleanup: removing file "/storage/db/vpostgres/pg_xlog//0000000100000002000000E6"

Friday, November 16, 2018

postgres corruption: ERROR: invalid page in block



Certain select queries will hit the following error. It has something to do with physical disk issue/corruption/outage where the table resides. This is a very common issue from my experience where disk filled up and database crashed.


query failed: ERROR: invalid page in block 0 of relation base/16386/26697


As to how to fix it, I am not completely certain when the database got to this point. I have fairly good luck with doing VACUUM FULL then REINDEX the database. 

But first, set zero out the damaged pages so, the cleaning operations can be carried out and not stuck on the errors.

SET zero_damaged_pages = on;

Vacuum full and reindex.

VACUUM FULL ANALYZE VERBOSE;
REINDEX DATABASE "VCDB";

--note: the database name must match exactly so, double quotes are are necessary.

Might be getting the following in the verbose mode of what is operation is doing.

WARNING:  invalid page in block 6 of relation base/16385/11936; zeroing out page
WARNING:  invalid page in block 4 of relation base/16385/20087; zeroing out page
WARNING:  relation "vpx_compute_resource" page 4 is uninitialized --- fixing
WARNING:  invalid page in block 0 of relation base/16385/20224; zeroing out page
WARNING:  relation "vpx_dvs" page 0 is uninitialized --- fixing
WARNING:  invalid page in block 13 of relation pg_tblspc/16397/PG_9.4_201409291/16385/21977; zeroing out page
WARNING:  relation "vpx_hist_stat4_1" page 13 is uninitialized --- fixing


REINDEX table <broken_table>;
or
REINDEX DATABASE "VCDB";


Some folks reported that by setting zero_damaged_pages to ON and then run a select all on the impacted table, it fixes it. I am not seeing through the logic there. My guess on this logic is, by setting the zero_damaged_pages to ON, it omitted all the bad pages and subsequent operations will not stop by the errors. So, I do not agree that by selecting all on the damaged table is the fix or even attempting to fix anything.

select * from <damaged_table>;


Summary
This isn't really a complete fix of the issue. The zero_damaged_pages set to on basically telling PostgreSQL to zero out the error at the buffer level and continue to do. The outcome of this is loss of data on the invalid page. Bad news, even though, this remedy fixes the majority of the issue but not 100% guaranteed. In my opinion, restoring from the good backup set are the only good remedy and ensuring no data loss that impacting database integrity..

Thursday, July 6, 2017

postgresql corruption: ERROR: tuple concurrently update (duplicate key in primary key column)


I am not exactly sure how to classify this type of issue. It seems like a pg_toast out of sync issue or typle concurrently update error. Either way, the table have an exact records where PK has been defined.

The workaround is to delete one of the duplicate record from the table.

The only way to delete the record is with using CTID instead of any ID defined in the column with PK as the record itself has 2 identical PK key, so it would not worked. It will wiped out both records and one is needed.


Typical errors
delete from vpx_vm where id=2398;
ERROR:  tuple concurrently updated


There are times, select on the table may be fine but the example id could return 2 of the identical records. The workaround is to delete one of them so it would not have duplicates. The trick is which on to delete ?

select ctid, id from vpx_vm where id = 1433; -- this may return 2 records with 2 different CTIDs. One of the 2 records will result of unexpected chunk number.







Example: 
select * from vpx_vm where ctid = (8,2);   fine
select * from vpx_vm where ctid = (69,2); - error out with pg_toast missing chunk 

delete the bad ctid
delete from vpx_vm where ctid = (69,2),    track that vm from vpx_host too.


Select each of the rows with their corresponding CTID and one of them will fail. Delete the one that is failed and leave the one return output alone.
select  * from vpx_vm where ctid = xxxx


Technique that have been used to address this issue and I found out that it may not be that efficient as there is a chance that the good rescord to be deleted and the corrupt record left in the table.

delete from vpx_vm where ctid=(select ctid from vpx_vm where id=2398 limit 1);  

The above issue is fairly common, typically found right after disk filled up and database crashed.

Saturday, January 7, 2017

postgresql: corruption: ERROR: unexpected chunk number 0 (expected 1) for toast value XXXXXXXXX in pg_toast_XXXX


This issue is very common but luckily, they can be resolved majority of the time (it was once it did not worked for me). The corruption is a bit difficult to description and I am using a real scenario here.

The vcenter crashed and in the logs, the error reported was.


2017-01-05T15:51:06.640Z [7FC61081D700 error 'Default' opID=SWI-3a1cc46] [VdbStatement] SQLError was thrown: "ODBC error: (XX000) - ERROR: unexpected chunk number 0 (ex
pected 1) for toast value 25270341 in pg_toast_16875;
--> Error while executing the query" is returned when executing SQL statement "SELECT IS_CONSOLIDATE_NEEDED, DAS_PROTECTED_ACTUAL, DAS_PROTECTED_DESIRED, MANAGED_BY_TYP
E, MANAGED_BY_EXT_KEY, STORAGE_SPACE_UPDATED_TIME, AGGR_UNSHARED_STORAGE_SPACE, AGGR_UNCOMMITED_STORAGE_SPACE, AGGR_COMMITED_STORAGE_SPACE, RECORD_REPLAY_STATE, FAULT_T
OLERANCE_STATE, PENDING_ANNOT_SET_FLG, PENDING_NAME, ONLINE_STANDBY, MKS_CONNECTIONS, SCREEN_WIDTH, SCREEN_HEIGHT, TOOLS_MOUNTED, MEMORY_OVERHEAD, SUSPEND_INTERVAL, BOO
T_TIME, SUSPEND_TIME, UUID_INSTANCE, UUID_BIOS, NUM_DISK, NUM_NIC, NUM_VCPU, MEMORY_RES"


When trying to delete the offending entry from the vpx_vm using the id, it throws a different errors.

ERROR: tuple concurrently updated

Backing up the table will throw the same problem happens.

ERROR: unexpected chunk number 0 (expected 1) for toast value 25270341 in pg_toast_16875;


Steps to resolve the issue for VCSA.


1: Back up the database first.

2: Stop vpxd & vdcs service:
     service vmware-vpxd stop
     service vmware-vdcs stop

3: Connect to VCDB with:
   /opt/vmware/vpostgres/current/bin/psql -U postgres -d VCDB

4: Execute below query:
   select pg_toast_16875::regclass;

   Above query should give you output something like:
   regclass
   --------
   VPX_VM

5: Now re-index the table with below queries:
   REINDEX table pg_toast.pg_toast_16875;
   REINDEX table VPX_VM; (Table name should be whatever is returned from step#4)

6: Do vacuum and analyze for vpx_vm table with following query:
    VACUUM ANALYZE VPX_VM; (Table name should be whatever is returned from step#4)

7: \q

8: Start vpxd & vdcs service:
   service vmware-vpxd start
   service vmware-vdcs start

NOTE: The steps can lead to some data loss from table.

The key is to locate the pg_toast then reindex the pg_toast and its corresponding table.

Friday, November 18, 2016

postgresql: pg_relation_size glitches?

How did pg_relation_size locate something does not exist in the db?

VCDB=# select current_database(), version();
 current_database |                                                                     version

------------------+---------------------------------------------------------------------------------------------------------------
-----------------------------------
 VCDB             | PostgreSQL 9.4.9 (VMware Postgres 9.4.9.1-4536683 release) on x86_64-unknown-linux-gnu, compiled by x86_64-vmk
-linux-gnu-gcc (GCC) 4.8.4, 64-bit
(1 row)

VCDB=# select pg_relation_size(table_name) as Size  FROM information_schema.tables WHERE table_schema NOT IN ('information_schema', 'pg_catalog') ORDER BY size DESC LIMIT 10;
ERROR:  relation "vci_version" does not exist

VCDB=# \d vci_version;
Did not find any relation named "vci_version".
VCDB=#

VCDB=#  select table_name FROM information_schema.tables WHERE table_schema NOT IN ('information_schema', 'pg_catalog');
           table_name
--------------------------------
 cis_kv_providers
 cis_kv_keyvalue
 vpx_alarm_expression
 vpx_alarm_expr_comp
 vpx_alarm
...

SELECT tablename,  pg_size_pretty(pg_relation_size(schemaname || '.' || tablename))
FROM pg_tables WHERE schemaname = 'vc';

           tablename            | pg_size_pretty
--------------------------------+----------------
 cis_kv_providers               | 8192 bytes
 cis_kv_keyvalue                | 8192 bytes
 vpx_alarm_expression           | 112 kB
 vpx_alarm_expr_comp            | 48 kB
 vpx_alarm                      | 104 kB
 vpx_alarm_runtime              | 8192 bytes


Tuesday, October 18, 2016

postgresql: Confusing piece of search_path in vcdb.


For some reason, every time I need to look at the tables vc owned, even login in with "vc", I have to perform the following to look for objects in the db. This isn't causing much problem but accessing with Python pyscopg2, it will not find the tables.


set search_path to public, vpx;



vcdb=> select usename, useconfig from pg_user
vcdb-> ;
 usename  |      useconfig
----------+----------------------
 postgres |
 vc       | {search_path=public}
(2 rows)


No matter how I changed the search_path in the postgresql.conf. A fresh login just wouldn't let me find my own tables. Finally, I decided to just let database search_path.


vcdb=> alter database vcdb set search_path to public, vpx, vc, postgres;
ALTER DATABASE
vcdb=> \d
No relations found.
vcdb=> \c vcdb
psql (9.4.9, server 9.2.9)
You are now connected to database "vcdb" as user "vc".
vcdb=> \d
                     List of relations
 Schema |              Name              |   Type   | Owner
--------+--------------------------------+----------+-------
 vpx    | vpx_access                     | table    | vc
 vpx    | vpx_access_seq                 | sequence | vc
 vpx    | vpx_alarm                      | table    | vc
 vpx    | vpx_alarm_action               | table    | vc
 vpx    | vpx_alarm_disabled_actions     | table    | vc
 vpx    | vpx_alarm_expr_comp            | table    | vc
 vpx    | vpx_alarm_expression           | table    | vc
 vpx    | vpx_alarm_repeat_action        | table    | vc
 vpx    | vpx_alarm_runtime              | table    | vc
 vpx    | vpx_alarm_seq                  | sequence | vc
 vpx    | vpx_alarm_state                | table    | vc
 vpx    | vpx_ansfile_seq                | sequence | vc
 vpx    | vpx_array_seq                  | sequence | vc
 vpx    | vpx_binary_data                | table    | vc
 vpx    | vpx_binary_data_seq            | sequence | vc
 vpx    | vpx_bulletin_operation         | table    | vc
 vpx    | vpx_change_tag                 | table    | vc
 vpx    | vpx_change_tag_seq             | sequence | vc
 vpx    | vpx_compliance_status          | table    | vc
 vpx    | vpx_compliance_status_seq      | sequence | vc
 vpx    | vpx_compute_res_failover_host  | table    | vc
 vpx    | vpx_compute_res_user_hb_ds     | table    | vc
 vpx    | vpx_compute_resource           | table    | vc
 vpx    | vpx_compute_resource_das_vm    | table    | vc
 vpx    | vpx_compute_resource_dpm_host  | table    | vc
vcdb=>


postgres=# SELECT r.rolname, d.datname, rs.setconfig
postgres-# FROM   pg_db_role_setting rs
postgres-# LEFT   JOIN pg_roles      r ON r.oid = rs.setrole
postgres-# LEFT   JOIN pg_database   d ON d.oid = rs.setdatabase;
 rolname | datname |                 setconfig
---------+---------+-------------------------------------------
 vc      |         | {"search_path=public, vpx"}
         | vcdb    | {"search_path=public, vpx, vc, postgres"}
(2 rows)




Voila, Everybody can access now!