The world's most advanced open source database
Top posters
Number of posts in the past two months
Top teams
Number of posts in the past two months
  • Policy for being listed on Planet PostgreSQL.
  • Add your blog to Planet PostgreSQL.
  • List of all subscribed blogs.
  • Manage your registration.
  • Get in touch with the Planet PostgreSQL administrators at planet at
Posted by David Z in Highgo Software on 2020-10-23 at 22:51

1. Overview

Previously, we discussed the MAIN fork file and corresponding extension at Heap file and page in details. This blog will explain a little bit more about the Free Space Mapping file and corresponding extension.

2. What is a Free Space Mapping file

A Free Space Mapping, FSM, file is a file that keeps track of the availability of free space of a page. Within each FSM is a binary tree, stored in an array with one byte per node. The FSM is used to quickly locate a page with enough free space to hold a record to be stored during insertion, and it will be updated during an insertion or a vacuum on the table.

Starting from PostgreSQL 8.4 each relation has its own extensible FSM files stored on disk. The FSM file is stored under its relation folder. From previous example,

postgres=# create table orders1 (id int4, test text) using heap;
postgres=# insert into orders1 values(1, 'hello world!');

You will see the heap files under the database folder like below,

$ ls -ltrh /home/user/pgdata/base/12705/16384*
-rw------- 1 user user  24K Oct  8 14:33 /home/user/pgdata/base/12705/16384_fsm
-rw------- 1 user user 8.0K Oct  8 14:34 /home/user/pgdata/base/12705/16384_vm
-rw------- 1 user user 512K Oct  8 14:34 /home/user/pgdata/base/12705/16384

Where 16384 is the table orders1’s oid, and 16384_fsm is the corresponding Free Space Mapping file.

3. How the Free Space Mapping files are managed

Before PostgreSQL 8.4, the FSM is maintained in memory with the limitation of fixed-size. Now, all FSM files are on disk and can be extended in the same way as heap segmentation files. The FSM files will be updated during an insertion and a Vacuum process either periodically or triggered manually. In some conditions, the FSM may need to be rebuilt, for example, if a leaf node’s value is less than the root node or if there is truncate operation.

FSM file is mainly used to quickly locate a page for insert operation and in the meantime it helps to speed up the insertion by trying to inert records in


Today we are going to walk through some of the preliminary data shaping steps in data science using SQL in Postgres. I have a long history of working in data science, including my Masters Degree (in Forestry) and Ph.D. (in Ecology) and during this work I would often get raw data files that I had to get into shape to run analysis.

In the first part of this blog series, I’ve presented a couple of benchmark results showing how PostgreSQL OLTP performance changed since 8.3, released in 2008. In this part I plan to do the same thing but for analytical / BI queries, processing large amounts of data. There’s a number of industry benchmarks for testing […]

In our previous blog post, we talked about upgrading a PostgreSQL cluster to a new major version with pg_upgrade. This can be fast and with little downtime even for large databases, but in some instances zero downtime may be essential for doing a major upgrade. While logical replication can ease zero downtime, there are still some things to consider. For some hands-on experience, you can check out this learning module on logical replication.

Posted by Ian Barwick on 2020-10-20 at 17:31

There have been a lot of articles about new features in the recent PostgreSQL 13 release, but one set of improvements which seems to have escaped much attention is in pg_rewind , which has gained some additional useful functionality deserving of closer examination.

New option "-R /--write-recovery-conf"

As the name implies, this option causes pg_rewind to write recovery (replication) configuration (and additionally the " standby.signal " file) after rewinding a server, in exactly the same way that pg_basebackup does. This is useful for example when restoring a former primary to become a standby. Note that it can only be used in conjunction with the option --source-server , and (unless the --dry-run option is supplied) that the configuration will be written regardless of whether a rewind is needed.

As replication configuration via the recovery.conf file was removed in PostgreSQL 12, the generated replication configuration is actually appended to , which is the one configuration file which is guaranteed to be in a known location (the data directory) and which will be read last when PostgreSQL starts up, ensuring any prior replication configuration is overridden by whatever pg_rewind has written.

Currently the only replication configuration item actually written is primary_conninfo - be aware that this is generated from a blend of the connection information passed in --source-server and the default parameters generated by libfq , which may be different to the primary_conninfo string you would normally generate and will look something like this:

primary_conninfo = 'user=postgres passfile=''/var/lib/pgsql/.pgpass'' channel_binding=prefer host=node1 port=5432 sslmode=prefer sslcompression=0 ssl_min_protocol_version=TLSv1.2 gssencmode=disable krbsrvname=postgres target_session_attrs=any'

Note that currently replication slot configuration ( primary_slot_name ) cannot be generated by pg_rewind .

Also be aware that if attempting to rewind a standby which was not shut down prop

Posted by Hans-Juergen Schoenig in Cybertec on 2020-10-20 at 07:30

To dig a bit deeper into zheap and PostgreSQL storage technology in general I decided to provide some more empirical information about space consumption. As stated in my previous blog post about zheap is more efficient in terms of storage consumption. The reasons are:

  • The tuple header is much smaller
  • Alignment has been improved

The question is: While those theoretical statements are true one wants to know what this means in a real-world scenario. This blog will shed some light on this question and give you some more empirical insights as to what is to be expected as soon as zheap is production-ready (which it is NOT as of October 2020).

Creating sample data for PostgreSQL heap and zheap

To show the differences in storage consumption I have created some sample data. To make it fair I have first added data to a temporary table which is in memory. This way there are no undesired side effects:

test=# SET temp_buffers TO '1 GB';
SELECT  id, 
hashtext(id::text) as name, 
random() * 10000 AS n, true AS b 
FROM generate_series(1, 10000000) AS id;
SELECT 10000000

10 million rows will roughly translate to half a gigabyte of data:

test=# \d+
                        List of relations
  Schema   | Name | Type  | Owner | Persistence |  Size  | Description
 pg_temp_5 | raw  | table | hs    | temporary   | 498 MB |
(1 row)

A standard temporary table is absolutely fine for our purpose.

Populating a zheap table

One of my favorite features in PostgreSQL is CREATE TABLE … LIKE …. It allows you to quickly create identical tables. This feature is especially useful if you want to clone a table containing a large number of columns and you don’t want to to list them all, manually create all indexes etc.
Copying the data from “raw” into a normal heap table takes around 7.5 seconds:

test=# \timing
Timing is on.
test=# CREATE TABLE h1 (LIKE raw) USING heap;
Time: 7.836 

When inserting data into a table, you can specify that the value of a field is null. For example, for a table t (i int, j int, k int), we can execute insert into t values (8,1,6), or insert into t values (3, null, 7) to insert a record with a null value. This blog will explore some technical details of storing null values in PostgreSQL.

The issue how PG storage null value

A tuple in PG is divided into TupleHead part and data part. Two examples of inserting data into table ‘t’ are mentioned above. Here, a picture is used to throw this question: how to represent this null value in tuples?


As shown in the figure, an int data type can occupy 4 bytes of space without a null value. But in the case of null value, how should the database indicate that there is a null between 3 and 7?

Storage method of null value in PG

Using the pageinspact tool to observe how null values are stored. The following tests are performed:

postgres=# create table t(i int, j int, k int);
postgres=# insert into t values(8,1,6);
postgres=# insert into t values(3,NULL,7);
postgres=# insert into t values(4,9,2);
postgres=# select lp,t_infomask,t_bits,t_data  from  heap_page_items(get_raw_page('t', 0));
 lp | t_infomask |  t_bits  |           t_data           
  1 |       2048 |          | \x080000000100000006000000
  2 |       2049 | 10100000 | \x0300000007000000
  3 |       2048 |          | \x040000000900000002000000
(3 rows)


As you can see, null values are not marked in the data area.


At the same time, we notice that the ‘t_infomask’ and ‘t_bits’ values of the second data with null value are inconsistent with those of the first and third data. Therefore, this may be the secret of reading the null value.

  1. When (t_infomask & HEAP_HASNULL) is true, the column representing the tuple has a null value.

  2. The ‘t_bits’ stores all the null columns in the tuple.

Posted by Andreas 'ads' Scherbaum on 2020-10-19 at 14:00
PostgreSQL Person of the Week Interview with Damien Clochard: I’m Damien Clochard, I live in France and I’m the president of PostgreSQLFr, the French Speaking PostgreSQL association. I’m also one of the founder of Dalibo, a French PostgreSQL company and one the sysadmins behind the platform.
Posted by Michał Mackiewicz on 2020-10-19 at 10:28

In contrast to physical (streaming) replication, write access on subscriber node is not disabled. This means that DML and DDL statements like INSERT, UPDATE, DELETE or ALTER can be executed on subscriber. As the replication is unidirectional, however, these changes won’t be replicated back to publisher. This behavior can lead to problems – the replication can stop with all its consequences or the data can become out of sync. In this post, I’d like to share some tips how to deal with such problems.

Scenario 1: data INSERTed on subscriber side

Data can be inserted into a replication target table without any problems, provided that a row with identical REPLICA IDENTITY wont’ be subsequently inserted on publisher node. For example, consider the following table:

CREATE TABLE mydata(id integer primary key, value integer, description varchar);

its REPLICA IDENTITY will be using the “id” primary key, which is the default.

You can safely execute the following INSERT on subscriber:

INSERT INTO mydata VALUES (100, 1024, 'test');

…but if you execute an INSERT with identical primary key value on the publisher,

INSERT INTO mydata VALUES (100, 65536, 'test on publisher');

you will get an error on subscriber node:

ERROR: duplicate key value violates unique constraint "mydata_pkey"
DETAIL: Key (id) = (100) already exists.

The replication will stop, and WAL segments will start to pile on publisher. Luckily, the log contains all the required information: you have to DELETE the problematic row on the subscriber:

DELETE FROM mydata WHERE id=100;

and the replication will restart automatically.

Scenario 2: data UPDATEd on subscriber side

As long as required privileges are granted, UPDATE statements can be executed on subscriber without any error or warning. The replication will not be halted. The publisher doesn’t know about the change, and as the result, the values in the same row can differ between publisher and subscriber. But if the same row gets updated on publisher, the whole row will be


With Postgres 13 released recently, a new season of testing has begun! I recently wrote about the impact of BTree index deduplication, finding that improvement is a serious win. This post continues looking at Postgres 13 by examining performance through a few steps of an OpenStreetMap data (PostGIS) workflow.

Reasons to upgrade

Performance appears to be a strong advantage to Postgres 13 over Postgres 12. Marc Linster wrote there's "not one headline-grabbing feature, but rather a wide variety of improvements along with updates to previously released features." I am finding that to be an appropriate description. At this point I intend to upgrade our servers for the improved performance, plus a few other cool benefits.

Temporary tables have been around forever and are widely used by application developers. However, there is more to temporary tables than meets the eye. PostgreSQL allows you to configure the lifespan of a temporary table in a nice way and helps to avoid some common pitfalls.


By default, a temporary table will live as long as your database connection. It will be dropped as soon as you disconnect. In many cases this is the behavior people want:

tmp=# CREATE TEMPORARY TABLE x (id int);
tmp=# \d
        List of relations
  Schema   | Name | Type  | Owner 
 pg_temp_3 | x    | table | hs
(1 row)

tmp=# \q
iMac:~ hs$ psql tmp
psql (12.3)
Type "help" for help.

tmp=# \d
Did not find any relations.

Once we have reconnected, the table is gone for good. Also, keep in mind that the temporary table is only visible within your session. Other connections are not going to see the table (which is, of course, the desired behavior). This also implies that many sessions can create a temporary table having the same name.

However, a temporary table can do more. The most important thing is the ability to control what happens on commit:


As you can see, there are three options. “PRESERVE ROWS” is the behavior you have just witnessed. Sometimes you don’t want that. It is therefore also possible to empty a temporary table on commit:

tmp=# BEGIN;
        SELECT * FROM generate_series(1, 5) AS y;
tmp=# SELECT * FROM x;
(5 rows)

tmp=# COMMIT;
tmp=# SELECT * FROM x;
(0 rows)

In this case, PostgreSQL simply leaves us with an empty table as soon as the transaction ends. The table itself is still around and can be used.

Let us drop the table for now:

tmp=# DROP TABLE x;

Sometimes you want the entire table to be gone at the end of the transaction: “ON COMMIT DROP”

Posted by Scott Mead on 2020-10-15 at 20:12

Database applications are living, [(sometimes) fire-]breathing systems that behave in unexpected ways. As a purveyor of the pgCraft, it’s important to understand how to interrogate a Postgres instance and learn about the workload. This is critical for lots of reasons:

  1. Understanding how the app is using the database
  2. Understanding what risks there are in the data model
  3. Designing a data lifecycle management plan (i.e. partitions, archiving)
  4. Learning how ORM is behaving towards the database
  5. Building a VACUUM strategy

There’s lots of other reasons this data is useful, but let’s take a look at some examples and get down to a few scripts you can use to pull this together into something useful.

First, take a visit to the pgCraftsman’s toolbox to find an easy-to-use snapshot script. This script is designed to be completely self-contained. It will run at whatever frequency you’d like and will save snapshots of the critical monitoring tables right inside your database. There’s even a few reporting functions included to help you look at stats over time.

What to Watch

There’s a number of critical tables and views to keep an eye on in the Postgres catalog, this isn’t an exhaustive list, but a quick set that the toolbox script already watches.

  • pg_stat_activity
  • pg_locks
  • pg_stat_all_tables
  • pg_statio_all_tables
  • pg_stat_all_indexes
  • pg_stat_database

These tables views provide runtime stats on how your application is behaving in regards to the data model. The problem with many of these is that they’re either point-in-time (like pg_stat_activity) or cumulative (pg_stat_all_tables.n_tup_ins contains the cumulative number of inserts since pg_stat_database.stats_reset). In order to glean anything useful from these runtime performance views, you should be snapshot-ing them periodically and saving the results.

I’ve seen (and built) lots of interesting ways to do this over the years, but the simplest way to generate some quick stats over time is with the P

Posted by Alexey Lesovsky in Data Egret on 2020-10-15 at 17:38

New shiny Postgres 13 has been released and now it’s the  time for making some updates to “Postgres Observability” diagram.

New release includes many improvements related to monitoring, such as new stats views and new added fields to existing views. Let’s take a closer look at these.

List of progress views has been extended with two new views. The first one is the “pg_stat_progress_basebackup” which helps to observe running base backups and estimate their progress, ETA and other properties.  The second view is the “pg_stat_progress_analyze” as the name suggests, it watches over execute/analyze operations.

The third new view is called pg_shmem_allocations which is supposed to be used for deeper inspection of how shared buffers are used.

The fourth, and the last new view is “pg_stat_slru” related to the inspection of SLRU caches. Both recently added views are help to answer the question “How Postgres spends its allocated memory”

Other improvements are general-purpose and related to the existing views.

The “pg_stat_statements” has few modifications:

  • New fields related to time planning have been added, and due to this the existing “time” fields have been renamed to executing time. So all monitoring tools that rely on pg_stat_statements should be adjusted accordingly. 
  • New fields related to WAL have been added – now it’s possible to understand how much WAL has been generated by each statement.

WAL usage statistics have also been added to EXPLAIN (added WAL keyword), auto_explain and autovacuum. WAL usage stats are appended to the logs (that is if log_autovacuum_min_duration is enabled)

Pg_stat_activity has a new column “leader_pid”, which shows the PID of the parallel group leader and helps to explicitly identify background workers with their leader.

A huge thank you goes to many who contributed to this new release, among which are my colleagues Victor Yegorov and Sergei Kornilov and also those who help to spread the word about Postgres to other communi


As a database grows and scales up from a proof of concept to a full-fledged production instance, there are always a variety of growing pains that database administrators and systems administrators will run into.

Very often, the engineers on the Crunchy Data support team help support enterprise projects which start out as small, proof of concept systems, and are then promoted to large scale production uses.  As these systems receive increased traffic load beyond their original proof-of-concept sizes, one issue may be observed in the Postgres logs as the following:

LOG:  checkpoints are occurring too frequently (9 seconds apart)
HINT:  Consider increasing the configuration parameter "max_wal_size".
LOG:  checkpoints are occurring too frequently (2 seconds apart)
HINT:  Consider increasing the configuration parameter "max_wal_size".

I did an interview with EDB recently, and a blog post based on that interview was published yesterday. It covers the Postgres 13 feature set and the effects of open source on the software development process.

Is your database growing at a rapid rate? Does your database system slow down all the time? And maybe you have trouble understanding why this happens? Maybe it is time to take a look at pg_squeeze and fix your database once and for all. pg_squeeze has been designed to shrink your database tables without downtime. No more need for VACUUM FULL – pg_squeeze has it all.

The first question any PostgreSQL person will ask is: Why not use VACUUM or VACUUM FULL? There are various reasons: A normal VACUUM does not really shrink the table in disk. Normal VACUUM will look for free space, but it won’t return this space to the operating system. VACUUM FULL does return space to the operating system but it needs a table lock. In case your table is small this usually does not matter. However, what if your table is many TBs in size? You cannot simply lock up a large table for hours just to shrink it after table bloat has ruined performance. pg_squeeze can shrink large tables using only a small, short lock.

However, there is more. The following listing contains some of the operations pg_squeeze can do with minimal locking:

  • Shrink tables
  • Move tables and indexes from one tablespace to another
  • Index organize (“cluster”) a table
  • Change the on-disk FILLFACTOR

After this basic introduction it is time to take a look and see how pg_squeeze can be installed and configured.

PostgreSQL: Installing pg_squeeze

pg_squeeze can be downloaded for free from our GitHub repository. However, binary packages are available for most Linux distributions. If you happen to run Solar, AIX, FreeBSD or some other less widespread operating system just get in touch with us. We are eager to help.

After you have compiled pg_squeeze or installed the binaries some changes have to be made to postgresql.conf:

wal_level = logical
max_replication_slots = 10 # minimum 1
shared_preload_libraries = 'pg_squeeze'

The most important thing is to set the wal_level to logical. Internally pg_squeeze works as follows: It crea

Posted by rob sullivan on 2020-10-13 at 00:00

As your database keeps growing, there is a good chance you're going to have to address database bloat.
While Postgres 13 has launched with some exciting features with built-in methods to rebuild indexes concurrently, many people still end up having to use pg_repack to do an online rebuild of the tables to remove the bloat. Customers on AWS RDS struggle figuring out how to do this. Ready to learn how?

Since you have no server to access the local binaries, and because AWS RDS provides no binaries for the versions they are using, you're going to have to build your own. This isn't as hard as one might think because the official pg repos have an installer (ie: sudo apt install postgresql-10-pg_repack). If you don't use the repos, the project itself, is an open source project with directions:

While you were getting up to speed above, I was spinning up a postgres 10.9 db on RDS. I started it yesterday so that it would be ready by the time you got to this part of the post. Lets create some data:

-- let's create the table
CREATE TABLE burritos (
id SERIAL UNIQUE NOT NULL primary key,
toppings TEXT NOT NULL,
thoughts TEXT,
UNIQUE (title, toppings)

--disable auto vacuum
ALTER TABLE burritos SET (autovacuum_enabled = false, toast.autovacuum_enabled = false);

-- orders up
INSERT INTO burritos (title, toppings, thoughts, code)
left(md5(i::text), 10),
left(md5(random()::text), 4)
FROM GENERATE_SERIES(1, 1000000) s(i);

UPDATE burritos SET toppings = md5(random()::text) WHERE id < 250;
UPDATE burritos SET toppings = md5(random()::text) WHERE id between 250 and 500;
UPDATE burritos SET code = left(md5(random()::text), 4) WHERE id between 2050 and 5000;
UPDATE burritos SET thoughts = md5(random()::text) WHERE id between 10000 and 20000;
UPDATE burritos SET thoughts = md5(random()::text) WHERE id between 800000 and 900000;
UPDATE burritos SET toppings = md5(random()::text) WHERE id between 600000

We've seen an example of how to set up PostgreSQL monitoring in Kubernetes. We've looked at two sets of statistics to keep track of in your PostgreSQL cluster: your vitals (CPU/memory/disk/network) and your DBA fundamentals.

While starting at these charts should help you to anticipate, diagnose, and respond to issues with your Postgres cluster, the odds are that you are not staring at your monitor 24 hours a day. This is where alerts come in: a properly set up alerting system will let you know if you are on the verge of a major issue so you can head it off at the pass (and alerts should also let you know that there is a major issue).

Dealing with operational production issues was a departure from my application developer roots, but I looked at it as an opportunity to learn a new set of troubleshooting skills. It also offered an opportunity to improve communication skills: I would often convey to the team and customers what transpired during a downtime or performance degradation situation (VSSE: be transparent!). Some of what I observed I used to  help us to improve to application, while other parts helped me to better understand how PostgreSQL works.

But I digress: let's drill into alerts on your Postgres database.

Note that just because an alert or alarm is going off, it does not mean you need to immediately react: for example, a transient network degradation issue may cause a replica to lag further behind a primary for a bit too long but will clear up when the degradation passes. That said, you typically want to investigate the alert to understand what is causing it.

Additionally, it's important to understand what actions you want to take to solve the problem. For example, a common mistake during an "out-of-disk" error is to delete the PostgreSQL WAL logs with a rm command; doing so can lead to a very bad day (and is also an advertisement for ensuring you have backups).

As mentioned in the post on setting up PostgreSQL monitoring in Kubernetes, the Postgres Operator uses pgMonito

Posted by Andreas 'ads' Scherbaum on 2020-10-12 at 14:00
PostgreSQL Person of the Week Interview with Greg Sabino Mullane: My name is Greg Sabino Mullane, and I live in Pennsylvania (USA). I have been working remotely a long time - since before it became necessary this year - as a programmer and a DBA (database administrator).
Posted by Paul Ramsey in PostGIS on 2020-10-12 at 08:00

Here in the Covid-times, I haven’t been able to keep up my previous schedule of speaking at conferences, but I have managed to participate in a couple of episodes of the MapScaping Podcast, hosted by Daniel O’Donohue.


Daniel is a great interviewer and really puts together a tight show. So far I’ve been on two, and I quietly hope to join him again some time in the future.

Posted by Jędrzej Biedrzycki on 2020-10-11 at 18:39

0. The problem

When I hear the term fuzzy searching, I think of high computation cost, but this is not always the case. In the great book Art of Computer Programming by D. Knuth, vol. 3, ch. 6: Searching we can read about the soundex algorithm. It was originally used to deal with misspelled English surnames. The other usages could involve for example:

  • Validation of user input data upon registration against the database. Maybe this user already exists, but they misspelled their name. Maybe there’s a typo in the name of the city, etc.
  • Clearing out the meaning or finding the name during speech recognition. Some words are spelled similarly and it is easy to confuse them.

1. Soundex

The soundex algorithm takes a string as an input and produces 4-character-long code as the result. It is obvious that this method is faster than computing the distance between each value and the reference. The question is – how faster? It turns out postgres already has a package for fuzzy searching, that includes everything we need. And even more…

2. Benchmark

I have used the sakila database to conduct the benchmarks. The customer table I wanted to use is small – 599 rows. To make it more significant I have created a new table and populated it with all the names cross-joined with the surnames from the customer table:

create table huge_cust
    id         bigint primary key,
    first_name character varying,
    last_name  character varying
insert into huge_cust
select nextval('customer_customer_id_seq'), c1.first_name, c2.last_name
from customer c1
         cross join customer c2;

Now the table huge_cust has 358801 rows, which is simply 599 squared.

To use the fuzzy search package we need to create an extension:

create extension if not exists fuzzystrmatch;

Besides soundex and levenshtein methods the package contains implementation of yet another algorithm that turns any string into a code – metaphone. The levenshtein method mentioned computes the Levenshtein distance between tw


Postgres is an amazing RDBMS implementation. Postgres is open source and it’s one of the most standard-compliant SQL implementations that you will find (if not the most compliant.) Postgres is packed with extensions to the standard, and it makes writing and deploying your applications simple and easy. After all, Postgres has your back and manages all the complexities of concurrent transactions for you.

In this post I am excited to announce that a new version of pg_auto_failover has been released, pg_auto_failover 1.4.

pg_auto_failover is an extension to Postgres built for high availability (HA), that monitors and manages failover for Postgres clusters. Our guiding principles from day one have been simplicity and correctness. Since pg_auto_failover is open source, you can find it on GitHub and it’s easy to try out. Let’s walk through what’s new in pg_auto_failover, and let’s explore the new capabilities you can take advantage of.

Fault Tolerance and Graceful Degradation

As soon as your application depends on Postgres in production, many of you want to implement a setup that is resilient to the most common failures we know about. That’s what Fault Tolerance is all about:

Fault tolerance is the property that enables a system to continue operating properly in the event of the failure of (or one or more faults within) some of its components. If its operating quality decreases at all, the decrease is proportional to the severity of the failure, as compared to a naively designed system, in which even a small failure can cause total breakdown. Fault tolerance is particularly sought after in high-availability or life-critical systems. The ability of maintaining functionality when portions of a system break down is referred to as graceful degradation.

If you are running your application in the cloud and you’re using a managed database service—such as Azure Database for PostgreSQL—then you’re probably relying on your database service provider to implement HA for you. But what if you are ma


I am an accidental DBA, with a huge emphasis on "accidental." I came to PostgreSQL as an application developer who really liked to program with SQL and use the database to help solve my problems. Nonetheless, these systems would enter into production, and as such I had to learn to support them.

PostgreSQL monitoring and performance optimization is a vast topic. In fact, I'll read content like what my colleague Greg Smith wrote on benchmarking PostgreSQL 13 on Ubuntu and be reminded that I have much more to learn! There's no way a few blog posts on the topic can do justice to it, but I want to try and share some of the insights I gleaned as the application developer thrust into an operations role.

In the previous post, I talked about how I could learn a lot from monitoring my Postgres system just by looking at the vital statistics: CPU, memory, disk, and network utilization. These stats about resource utilization were helpful as a starting point: they could indicate what part of the system was having an issue, but maybe not exactly what the problem is.

This is where the fundamental DBA statistics come into play.

These statistics provide an overall look at PostgreSQL database activity and can help you spot problems with performance and availability, and can provide "early warnings" before "bad things" start to happen.

So, what do these metrics tell you, and how can you use them to successfully manage your PostgreSQL database?

For these examples, we are using the monitoring stack powered by pgMonitor that can be installed alongside the Postgres Operator.

Posted by David Z in Highgo Software on 2020-10-09 at 23:56

1. Overview

PostgreSQL is a great open source database, and many users chose it because of the efficiency of its central algorithms and data structures. As a software developer, I was always curious about how each part was done, such as the physical files storage. The reason is that I always see a lot of files and folders were created after a simple initdb command. For example,

$ ls -ltr /home/user/pgdata
drwx------ 4 user user  4096 Oct  8 13:38 pg_logical
drwx------ 5 user user  4096 Oct  8 13:38 base
drwx------ 2 user user  4096 Oct  8 13:38 global
…. …
-rw------- 1 user user     3 Oct  8 13:38 PG_VERSION

I can do a simple command like below to check the file PG_VERSION, but what about the rest of the files and folders?

$ cat /home/user/pgdata/PG_VERSION 

In this blog, I am going to share what I learned about the heap files under base folder.

1. How the Heap files are created?

To explain this, let’s take a look at two key data structures.

  • RelFileNode defined in src/include/storage/felfilenode.h.
typedef struct RelFileNode
  Oid     spcNode;    /* tablespace */
  Oid     dbNode;     /* database */
  Oid     relNode;    /* relation */
} RelFileNode;

Where, spcNode is the tablespace oid, dbNode is the database oid, and relNode is table oid. The table oid will be used as the file name to create the corresponding heap file. However, the Heap files will not be created unless you insert the first record to the table. For example,

postgres=# create table orders1 (id int4, test text) using heap;
postgres=# insert into orders1 values(1, 'hello world!');

You will see the heap files under the database folder like below,

$ ls -ltrh /home/user/pgdata/base/12705/16384*
-rw------- 1 user user  24K Oct  8 14:33 /home/user/pgdata/base/12705/16384_fsm
-rw------- 1 user user 8.0K Oct  8 14:34 /home/user/pgdata/base/12705/16384_vm
-rw------- 1 user user 512K Oct  8 14:34 /home/user/pgdata/base/12705/16384

Where 16384 is the table orders1’s oid. To verify it, there is a built-in fu


The topic of caching appeared in PostgreSQL as far back as 22 years ago, and at that time the focus was on database reliability.

Fast forward to 2020, the disk platters are hidden even deeper into virtualized environments, hypervisors, and associated storage appliances. Furthermore, interconnected, distributed applications operating at global scale are screaming for low latency connections and all of a sudden tuning server caches, and SQL queries compete with ensuring the results are returned to clients within milliseconds. Application level and in-memory caches are born, and read queries are now saved close to the application servers. As a result, I/O operations are reduced to writes only, and network latency is dramatically improved. With one catch. Implementations are responsible for their own cache management which sometimes leads to performance degradation.

Caching writes is a much more complicated matter, as explained in the PostgreSQL wiki.

This blog is an overview of the in-memory query caches and load balancers that are being used with PostgreSQL.

PostgreSQL Load Balancing

The idea of load balancing was brought up about at the same time as caching, in 1999, when Bruce Momjiam wrote:

[...] it is possible we may be _very_ popular in the near future.

The foundation for implementing load balancing in PostgreSQL is provided by the built-in Hot Standby feature. The only requirement is for the application to handle the failover and this is where 3rd party solutions come in. We’ll look at some of those solutions in the next sections.

Load balanced queries can only return consistent results so long as the synchronous replication lag is kept low. In practice, even state of the art network infrastructure such as AWS may exhibit tens of milliseconds delays:

We typically observe lag times in the 10s of milliseconds. [...] However, under typical conditions, under a minute of replication lag is common. [...]


Posted by Michał Mackiewicz on 2020-10-09 at 11:07

Row level security is a great Postgres feature that allows to grant privileges to selected rows only, without having to create additional data structures. The common setup is to add a column with users’ names, and a policy that compares this column value with CURRENT_USER:

CREATE TABLE issues(id integer generated always as identity primary key, title varchar, description varchar, status varchar, assignee varchar);


CREATE POLICY issue_rls ON issues FOR ALL TO public USING (assignee = CURRENT_USER);

And then, any user that isn’t a superuser or hasn’t the BYPASSRLS privilege will see (and be able to edit) only rows with relevant “assignee” value. And it works well in an environment with only handful of users. But as the number of users grow, it can become difficult to maintain: sometimes we’ll want to assign some rows to multiple users, or a whole group of users. What to do then?

Enter the pg_has_role function. It checks if a particular user has given role. Its usage is quite simple:

SELECT pg_has_role('chinook', 'service', 'USAGE');

the first argument is the user name, the second is a role name, and the third is a bit tricky: ‘MEMBER’ determines if user is an actual member of role or can execute SET ROLE to become a member, while ‘USAGE’ checks for actual membership only. As users are in fact specific case of roles with LOGIN privilege (for very old Postgres versions – 8.0 and earlier – this isn’t true, but those versions don’t have row level security anyway) , this function works equally well for individual users and group roles.

So, we’ll modify the policy as such:

DROP POLICY issue_rls;
CREATE POLICY issue_rls ON issues FOR ALL TO public USING (pg_has_role(current_user, assignee, 'USAGE'));

…and now the rows can be assigned to individual users or whole roles.

Posted by Egor Rogov in Postgres Professional on 2020-10-09 at 00:00

In the previous articles we discussed PostgreSQL indexing engine and the interface of access methods, as well as B-trees, GiST, SP-GiST, GIN, RUM, and BRIN. But we still need to look at Bloom indexes.


General concept

A classical Bloom filter is a data structure that enables us to quickly check membership of an element in a set. The filter is highly compact, but allows false positives: it can mistakenly consider an element to be a member of a set (false positive), but it is not permitted to consider an element of a set not to be a member (false negative).

The filter is an array of m bits (also called a signature) that is initially filled with zeros. k different hash functions are chosen that map any element of the set to k bits of the signature. To add an element to the set, we need to set each of these bits in the signature to one. Consequently, if all the bits corresponding to an element are set to one, the element can be a member of the set, but if at least one bit equals zero, the element is not in the set for sure.

In the case of a DBMS, we actually have N separate filters built for each index row. As a rule, several fields are included in the index, and it's values of these fields that compose the set of elements for each row.

By choosing the length of the signature m, we can find a trade-off between the index size and the probability of false positives. The application area for Bloom index is large, considerably "wide" tables to be queried using filters on each of the fields. This access method, like BRIN, can be regarded as an accelerator of sequential scan: all the matches found by the index must be rechecked with the table, but there is a chance to avoid considering most of the rows at all.

Posted by Andres Freund in CitusData on 2020-10-08 at 18:30

One common challenge with Postgres for those of you who manage busy Postgres databases, and those of you who foresee being in that situation, is that Postgres does not handle large numbers of connections particularly well.

While it is possible to have a few thousand established connections without running into problems, there are some real and hard-to-avoid problems.

Since joining Microsoft last year in the Azure Database for PostgreSQL team—where I work on open source Postgres—I have spent a lot of time analyzing and addressing some of the issues with connection scalability in Postgres.

In this post I will explain why I think it is important to improve Postgres’ handling of large number of connections. Followed by an analysis of the different limiting aspects to connection scalability in Postgres.

In an upcoming post I will show the results of the work we’ve done to improve connection handling and snapshot scalability in Postgres—and go into detail about the identified issues and how we have addressed them in Postgres 14.

  1. Why connection scalability in Postgres is important
  2. Surveying connection scalability issues
  3. Memory usage
  4. Constant connection overhead
    1. Conclusion: connection memory overhead is acceptable
  5. Cache bloat
    1. Problem illustration
    2. Conclusion: cache bloat is not the major issue at this moment
  6. Query memory usage
  7. Snapshot scalability
    1. Cause
    2. Conclusion: Snapshot scalability is a significant limit
  8. Connection model & context switches
  9. Conclusion: Start by improving snapshot scalability in Postgres

Why connection scalability in Postgres is important

In some cases problems around connection scalability are caused by unfamiliarity with Postgres, broken applications, or other issues in the same vein. And as I already mentioned, some applications can have a


Watchdog is the high availability component of Pgpool-II. Over the past few releases watchdog has gotten a lot of attention from the Pgpool-II developer community and received lots of upgrades and stability improvements.

One of the not very strong areas of pgpool-II watchdog is its configuration interface. Watchdog cluster requires quite a few config settings on each node, and it’s very easy to get it wrong and hard to debug.

For example in a three-node Pgpool-II cluster, we normally require to configure the following parameters in each pgpool.conf

# Local node identificatin (Unique for each node) 

# Node #1 endpoint

# Node #2 endpoint

# Local device setting for sending heartbeat

#Node #1 heartbeat endpoint

#Node #2 heartbeat endpoint

The main issue here is not the number of parameters. The issue is the value of almost each of these parameters is different for each pgpool-II node and that makes configuring, debugging, adding and removing of a watchdog difficult to manage.

One pgpool.conf for every node

When we were discussing the features for Pgpool-II 4.2 last year we decided to make the ease of use as one of the priorities for the next release. And since the most difficult configuration belongs to the watchdog area so we decided to fix that first up.

In the upcoming 4.2 version Pgpool-II uses the unified watchdog configuration and unlike the previous versions now we can use the same pgpool.conf file for every Pgpool-II node.

For the same three node pgpool-II cluster, The configuration file will now need to set these below parameters once and can use same pgpool.conf file for each node

# Node #1 config

# Node #2 config

# Node #3 config

My professional background has been in application development with a strong affinity for developing with PostgreSQL (which I hope comes through in previous articles). However, in many of my roles, I found myself as the "accidental" systems administrator, where I would troubleshoot issues in production and do my best to keep things running and safe.

When it came to monitoring my Postgres databases, I initially took what I knew about monitoring a web application itself, i.e. looking at CPU, memory, and network usage, and used that to try to detect issues. In many cases, it worked: for instance, I could see a CPU spike on a PostgreSQL database and deduce that there was a runaway query slowing down the system.

Over time, I learned about other types of metrics that would make it easier to triage and mitigate PostgreSQL issues. Combined with what I learned as an accidental systems administrator, I've found they make a powerful toolkit that even helps with application construction.

To help with sharing these experiences, I set up a few PostgreSQL clusters with the PostgreSQL Operator monitoring stack. The Postgres Operator monitoring stack uses the Kubernetes support of pgMonitor to collect metrics about the deployment environment (e.g. Pods) and the specific PostgreSQL instances. I'll also add a slight Kubernetes twist to it, given there are some special consideration you need to make when monitoring Postgres on Kubernetes.

We'll start with my "go to" set of statistics, what I call "the vitals."

The Vital Statistics: CPU, Memory, Disk, and Network Utilization