The world's most advanced open source database
Top posters
Number of posts in the past two months
Top teams
Number of posts in the past two months
  • Policy for being listed on Planet PostgreSQL.
  • Add your blog to Planet PostgreSQL.
  • List of all subscribed blogs.
  • Manage your registration.
  • Get in touch with the Planet PostgreSQL administrators at planet at
Just so that it will be perfectly clear: the logs I have in mind are the ones for DBAs to read – with slow queries, errors, and other interesting information. So, how does one find them? If you can connect to PostgreSQL, you can get settings from there, and usually, you will not even need … Continue reading "Starting with Pg – where are logs?"

target_session_attrs used in real life - well, almost real
© Laurenz Albe 2021

PostgreSQL commit ee28cacf61 has added new options for the target_session_attrs connection parameter. This article explores how the new options can be used to simplify high availability and horizontal scaling solutions.

What is target_session_attrs?

The PostgreSQL client shared library libpq has support for connection strings to more than one database server:

psql ', port=5432,5432 dbname=mydb user=myself'

In the URL connection string format, that would look like

psql 'postgresql://,'

Then the client will first try to connect to, and if it cannot reach that server, it will try to connect to

This is useful in combination with high-availability solutions, because it allows the client to fail over to a secondary server without the need for additional load balancing software. However, there is one shortcoming: if we don’t know which of the servers is the currently active primary server, we might accidentally end up connecting to a standby server. This is usually not what you want, because you cannot run data modifying statements there.

To make up for this, you can specify the additional parameter “target_session_attrs”, which determines when a server is acceptable. If you specify


the PostgreSQL client will not accept a connection to a server where it cannot modify data. This allows you to connect to the streaming replication primary, regardless of the order of the servers in the connection string.

Improved options for target_session_attrs in v14

PostgreSQL v14 introduces these new options:

  • read-only: only connect to a server that does not allow data modifications. This is the converse of read-write.
  • primary: only connect to a database that is not a streaming replication standby.
  • standby: only connect to a streaming replication standby.
  • prefer-standby: connect to a

1. Overview

PostgreSQL supports many System Information Functions, such as Session Information Functions, Access Privilege Inquiry Functions, Schema Visibility Inquiry Functions, System Catalog Information Functions, Transaction ID and Snapshot Information Functions, etc. However, you may want build some special functions and integrate them into the PostgreSQL. This blog is going to walk through the steps about how to build your own System Information Functions into PostgreSQL.

2. Analyze the requirement

Since there are so many functions built-in PostgreSQL already, you should perform some research and analysis before you decide to create a new one. In this blog, I want to check the transaction id after each savepoints in an ongoing transaction, so that I can perform some visibility check before the whole transaction is committed. For example, I have a query like below, but I can’t figure out the transaction id after each savepoint if I use existing System Information Function.

postgres=# CREATE TABLE tbl (data text);
postgres=# BEGIN;
postgres=# INSERT INTO tbl VALUES('HelloWorld-1');
postgres=# SELECT txid_current();
(1 row)

postgres=# SAVEPOINT AA;
postgres=# INSERT INTO tbl VALUES('HelloWorld-2');

postgres=# SELECT txid_current();
(1 row)


As you can see, using existing txid_current() function, I always get the same Transaction ID 488 even after a savepoint. For this reason, I decide to create my own system information function to retrieve the information I want.

3. System Information Function template

To create your own System Information Function, you need to check the System Catalogs to see in which category it can fit. For the case mentioned above, I chose the catalog pg_proc which stores information about functions, procedures, aggregate functions, and window functions. The document System Catalog Initial Data provides more deta


1. Introduction

As an experienced PostgreSQL user, you may have a lot of experience in setting up streaming replication in your database clusters to make multiple backups of your data. But have you wondered how the standby is able to correctly determine if a tuple sent from the primary should be visible to the user or not. In the case where a transaction contains multiple subtransactions, how does the standby determine the visibility in this case? You might say… well it is PostgreSQL so it will just work… This is true. If you are someone who is curious and interested in knowing how PostgreSQL does certain things internally, then this blog may be interesting for you as we will talk about normally transaction and subtransactions and how both affect the standby’s ability to determine tuple visibility.

2. How is Tuple Visibility Determined?

PostgreSQL determines a tuple’s visibility based on the concept of transaction ID (a.k.a txid) and it is normally stored in a tuple’s header as either xmin or xmax where xmin holds the txid of the transaction that inserted this tuple and t_xmax holds the txid of the transaction that deleted or updated this tuple. If a tuple has not been deleted, its t_xmax is 0. There is also another mechanism called commit log (a.k.a clog) where it has information of currently active transaction IDs. When an inserted tuple has a valid t_xmin and invalid t_xmax, PostgreSQL will have to consult the clog first to determine if the txid stored in t_xmin is visible or not, then the result will be stored in the tuple’s hint_bit field about the visibility information such that it does not always have to consult the clog which could be resource intensive. If a tuple has a valid t_xmax, then there is no need to consult the clog as the tuple must be invisible.

This is just the basic ideas of how visibility is determined but it is enough for me to continue with this blog. For more information about visibility, you could refer to this resource

3. What is a Subtransaction?

As the nam

Yaroslav Schekin (ysch) reported on irc that Index Scans Backward do not display properly. After checking I found out that if explain is in JSON/YAML/XML – node type is changed to “Index Scan" (or “Index Only Scan" if it was originally “Index Only Scan Backward"). Some more debugging/diagnosing later and I found the problem. Basically, … Continue reading "Fixed display of Backward scans on"
Recently landed in PostgreSQL, jsonb subscripting support doesn't look as exciting as some other improvements around jsonb. But it's user visible changes are only tip of the iceberg. How many people were involved to make it, and what decisions choices were made? How long did it take, and what are the good/bad ideas to work on a patch?
In this post I will show you how I discovered that the backend was putting data into a new table (generate_plist) but the website was still using the ports.pkg_plist column. The result was stale data being shown on the website (for older ports) or no data at all (for newer ports). How long? Since Oct [...]

To make it easier for our readers to install PostgreSQL on Ubuntu, we have compiled a quick guide to getting started. Don’t worry, it’s not too hard. You will learn:

  • How to download PostgreSQL
  • How to install PostgreSQL on Ubuntu
  • How to create a database instance
  • How to ensure the service is running
  • Creating your first table

Note that this is an introduction. If you are looking to create a PostgreSQL High-Availability cluster, we recommend checking out some of our other content.

Preparing the system

For this demonstration I have created a virtual machine using VirtualBox on my local machine. We use Ubuntu 20.10 in this tutorial.
Installing Ubuntu on VirtualBox is easy. Simply download the Ubuntu ISO file from the website and create a virtual machine:

postgresql on ubuntu


Then ensure that the ISO file is inserted into the virtual CD drive:

postgresql on ubuntu


You can then boot the machine. Simply follow the instructions. Once Ubuntu is installed, we can proceed with the installation of PostgreSQL itself.

Installing repositories and keys

To download PostgreSQL, we suggest checking out the official PostgreSQL website. Those PostgreSQL packages provided by the community are high quality and we recommend using them for your deployment:

Please select your desired operating system. In our example, I have selected the latest version of Ubuntu (20.10). The following steps are now necessary to install PostgreSQL:

  • Add the repository
  • Add the keys for the repository
  • Update your packages
  • Deploy PostgreSQL

The first thing you have to do is to add the repository to your Ubuntu installation. Here is how it works:

 hans@cybertec:~# sudo sh -c 'echo "deb $(lsb_release -cs)-pgdg main" > /etc/apt/sources.list.d/pgdg.list' 

Then you can add the keys to the system to make sure the repository is trustworthy:

 hans@cybertec:~# wget --quiet -O -
Posted by Andreas 'ads' Scherbaum on 2021-03-01 at 14:00
PostgreSQL Person of the Week Interview with David E. Wheeler: I’m a long-time open-source hacker in his early fifties. I’ve lived and worked in San Francisco, Portland OR, and now New York City.
Previously I wrote about locating config files. The thing is – postgresql.conf is not the only place you can set your configuration in. In here, I'll describe all the places that can be used, why do we even have more than one place, and finally – how to find out where given value comes from. … Continue reading "Starting with Pg – where/how can I set configuration parameters?"
Posted by Suman Michael in MigOps on 2021-02-28 at 18:18

While migrating from Oracle to PostgreSQL, most of the developers ask about the options available in Handling Global associative arrays in PostgreSQL. It might get challenging if the variables need to be accessible in the Exception block. To avoid developers from finding it difficult while performing conversions of PL/SQL from Oracle to PostgreSQL, we are writing this blog post to demonstrate the workarounds available with not much of additional programming effort.

By the way, the fact to note after reading this blog post is that, several features you want to observe in PostgreSQL during migrations or Code conversions are available without the need of an additional Enterprise license with PostgreSQL (which could create a vendor lock-in).

Ask us about possibilities with Community PostgreSQL (Open Source) before switching to an Enterprise licensing. 🙂

What is an Associative Array ?

Before getting to the workarounds, let us understand what is an associative array. In programming languages, an associative array may either be called a map or a dictionary. Associative array is a collection of (key, value) pairs where a key associates to a value. This association between a key and a value may be referred to as mapping. The key can either be of a text or a numeric type that can be mapped to any value type. 

PostgreSQL Array type

PL/pgSQL in PostgreSQL allows us to declare a variable as an ARRAY type. This ARRAY can be either a base or a custom type. For example, if we want to store a list of PINCODE values, then, we can declare the variable as v_pincode INT[]. Once we declare this variable, then, we can load all the pincode values from a table/view/function into this array. 

We can also initialize the v_pincode array type with a static set of pincode values. Here, the v_pincode holds all the values and we can fetch a specific state’s pincode valu

Posted by Bo Peng in SRA OSS, Inc. on 2021-02-28 at 08:35

Pgpool-II has been released more than 15 years. When the version 1.0 was released, Pgpool-II had only replication mode. With the increase of new features (e.g. automatic failover, support for logical replication, etc.), the parameters for configuring cluster mode have increased. 

Clustering mode

Pgpool-II has six clustering modes available: 

It is important to specify the correct clustering mode to make Pgpool-II work as expected, because the clustering mode is used to determine how the data is replicated. Since Pgpool-II 4.2, the multiple parameters for configuring cluster mode have been merged into a single parameter backend_clustering_mode.

clustering mode Pgpool-II 4.1 or previous Pgpool-II 4.2 or later
master_slave_mode master_slave_sub_mode replication_mode backend_clustering_mode
Streaming replication mode on stream off streaming_replication
Logical replication mode on logical off logical_replication
Slony mode on slony off slony
Native replication mode off - on native_replication
Raw mode off - off raw
Connection Queuing in pgBouncer

Yes, this post is about connection queueing, not just pooling. Because “connection pooling” – pre-created connections as a pool – is a much-celebrated feature. Almost every discussion on connection pool/pgBouncer starts with the overhead of establishing a new connection to PostgreSQL… and how pre-created connections in the pool can save the world.

But there is a non-celebrity feature in pgBouncer (not denying others) that can address some of the real big operational challenges. It is the connection queueing. Many new PostgreSQL users don’t know there is something like this. In this blog post, I am planning to discuss some of the problems related to connection management first, and then explore how connection queueing can address those problems.

Problem 1: Spike in Load can Jam and Halt the Server

PostgreSQL has a dedicated backend server process for every user connection. So there will be one backend process running on a CPU core for every active queries/sessions. This means there is a one-to-one mapping between active sessions and running processes in the server. If we consider parallel execution of SQL statements, there will be many more running processes than active sessions. In many real-world cases, a sudden spike in load can result in hundreds of active queries starting at once while the server is equipped with a small number of CPUs (sometimes just virtual CPUs with only fraction of performance). As the number of active sessions/processes increases, the overhead of scheduling and context switches takes over.  Many times, the host server becomes unresponsive, and even opening a bash shell/terminal can take time. This is quite easy to simulate. Just 10 active connections on a two virtual CPU server with SELECT only workload can cause this.

With two active sessions:

$ time ssh -t postgres@pghost 'ls -l'
real 0m0.474s
user 0m0.017s
sys 0m0.006s

When there are 10 active sessions on PostgreSQL, just establishing an ssh connection to the server took 15 seconds.

real 0m15.307s
user 0m0.026
Posted by Lætitia AVROT in EDB on 2021-02-26 at 14:43
This is something a lot of people want: being able to track changes happening on their database when and who ordered them. Well, I have good news, my friends, there is a native feature in Postgres that can help you achieve that exact goal! DDL and DML First things first, what are DDL and DML? Both DDL and DML are a subset of the SQL language. DDL means Data Definition Language (or Data Description Language).
Posted by Ryan Lambert on 2021-02-26 at 05:01

A few weeks ago I decided to seriously consider Postgres' declarative table partitioning for our OpenStreetMap data. Once the decision was made to investigate this option, I outlined our use case with requirements to keep multiple versions of OpenStreetMap data over time. That process helped draft my initial plan for how to create and manage the partitioned data. When I put the initial code to the test I found a snag and adjusted the plan.

This post shows a working example of how to partition OpenStreetMap data loaded using PgOSM-Flex.


Spoiler alert!

It works, I love it! I am moving forward with the plan outlined in this post. Some highlights from testing with Colorado sized data:

  • Bulk import generates 17% less WAL
  • Bulk delete generates 99.8% less WAL
  • Simple aggregate query runs 75% faster
Posted by Pavel Stehule on 2021-02-26 at 04:43
I released new version of pspg. Now, there is an possibility to select rows, columns or block (by mouse or by keys) for export.

Sizing a connection pool the wrong way
© Laurenz Albe 2021

PostgreSQL v14 has new connection statistics in pg_stat_database. In this article, I want to explore one application for them: estimating the correct size for a connection pool.

New connection statistics in v14

Commit 960869da080 introduced some new statistics to pg_stat_database:

  • session_time: total time spent by sessions in the database
  • active_time: time spent executing SQL statements in the database
  • idle_in_transaction_time: time spent idle in a transaction in the database
  • sessions: total number of sessions established to the database
  • sessions_abandoned: number of sessions to the database that were terminated by lost client connections
  • sessions_fatal: number of sessions to the database terminated by fatal errors
  • sessions_killed: number of sessions to the database killed by an operator

Some applications are obvious: for example, you may want to keep the number of fatal errors or operator interventions low, or you may want to fix your application to properly close database connections.

But I want to show you another less obvious, application for these statistics.

Connection pooling modes

In the following, I assume that the application connects to the database with a single database user. That means that there will be one connection pool per database.

If you use session level connection pooling, a client gets a session from the pool for the whole duration of the database session. In this case, sizing the connection pool is simple: the pool has to be large enough to accommodate the maximum number of concurrent sessions. The number of connections to a database is available in numbackends in pg_stat_database, and many monitoring systems capture this value. Session level pooling is primarily useful if the client’s database connections are short.

However, it is normally more efficient to use transaction level connection pooling. In this mode, a client gets a session only for the duratio


In the previous article, we explored GitOps and how to apply GitOps concepts to PostgreSQL in a Kubernetes environment with the Postgres Operator and custom resources. The article went on to mention additional tooling that has been created to help employ GitOps principles within an environment, including Helm.

While the methods demonstrated in the previous blog post show how direct creation and manipulation of the PostgreSQL custom resources allow for GitOps workflows, tools like Helm can take this a step further by providing simpler encapsulation to both get started and further refine a PostgreSQL cluster.

Let's look at using Helm with the PostgreSQL Operator. To run through this example, you will need to both install the PostgreSQL Operator and Helm. This example assumes that you used the quickstart to install the Postgres Operator and will reference the pgo namespace as where both the Postgres Operator and the PostgreSQL clusters will be deployed.

The Postgres Operator repository provides an example Helm chart that can be used to create a PostgreSQL cluster and let you select the following attributes:

  • Memory / CPU resources
  • Disk size
  • High availability
  • Monitoring

You can see all the attributes listed out in both the README and the values file.

PostgreSQL is one of the most advanced open source databases globally with a lot of great features. One of them is Streaming Replication which was introduced in PostgreSQL 9.0—known as Physical Replication. It works by sending the WAL segments generated by the primary PostgreSQL to another standby server also running PostgreSQL. However, if we only want to replicate a portion of the database tables, then Physical Replication in this case is not ideal. To improve the replication feature, PostgreSQL 9.4 came up with a feature called Logical Replication, in which we can perform efficient synchronous streaming replication on data-object levels.

The logical replication model needs one publisher and one or more subscribers. Subscribers pull data from the publisher they subscribe to and subsequently re-publish data to allow cascading replication or more complex configurations. Every change should be replication in real-time, so these two databases remain synchronized. 

Use-Cases for Logical Replication:  Let’s take a look at why we need this feature in the database system.

  • Sending incremental changes in a single database
  • Consolidation multiple databases in a single one
  • Replicating between different major versions of PostgreSQL
  • Sharing a subset of the database between various databases

A subscriber can be a publisher for other databases by defining its publications. Conflicts can arise when writing to the same table from either side. Therefore, we use a read-only application to avoid this problem.

Step by step set up introduction

1.Build Publisher Servers

Create a Publisher database cluster using initdb tools:

$ postgres/bin/initdb /rep_example/db_master

You can install this database in your directory. db_mater is the name of our database cluster, and I put it under /rep_example/. 

Now we need to set wal_level to ‘logical’. If you want to have another database cluster as the subscriber in the same machine(both servers are in the localhost), it


Just like in most databases, in PostgreSQL a trigger is a way to automatically respond to events. Maybe you want to run a function if data is inserted into a table. Maybe you want to audit the deletion of data, or simply respond to some UPDATE statement. That is exactly what a trigger is good for. This post is a general introduction to triggers in PostgreSQL. It is meant to be a simple tutorial for people who want to get started programming them.

Create a simple table

Writing a trigger is easy. The first important thing you will need is a simple table. A trigger is always associated with a table:

test=# CREATE TABLE t_temperature (
  id serial,
  tstamp timestamptz,
  sensor_id int,
  value float4
test=# \d t_temperature
                               Table "public.t_temperature"
Column     | Type                     | Collation | Nullable | Default
id         | integer                  |           | not null | nextval('t_temperature_id_seq'::regclass)
tstamp     | timestamp with time zone |           |          |
sensor_id  | integer                  |           |          |
value      | real                     |           |          |

The goal of this example is to check the values inserted and silently “correct” them if we think that the data is wrong. For the sake of simplicity, all values below zero will simply be set to -1.

CREATE TRIGGER: Syntax and execution order

If you want to define a trigger, there are two things which have to be done:

  • Define a function
  • Create the trigger based on the function

In the following section you will be guided through that process. Before we get started, let’s first take a look at CREATE TRIGGER:

Description: define a new trigger
     ON table_name
     [ FROM refere
Posted by Paul Ramsey in Crunchy Data on 2021-02-22 at 19:08

A surprisingly common problem in both application development and analysis is: given an input name, find the database record it most likely refers to. It's common because databases of names and people are common, and it's a problem because names are a very irregular identifying token.

The page "Falsehoods Programmers Believe About Names" covers some of the ways names are hard to deal with in programming. This post will ignore most of those complexities, and deal with the problem of matching up loose user input to a database of names.

Posted by Andreas 'ads' Scherbaum on 2021-02-22 at 14:00
PostgreSQL Person of the Week Interview with Takayuki Tsunakawa: I’m a Japanese man who was born and lives in Japan, near Tokyo. I am visually impaired by nature, and I have only enough eyesight to feel the light now. But I had somewhat better vision when I was a child, and enjoyed playing video games with friends or alone. I can’t see the screen, so I use screen reader software that reads characters by synthesized voice.
Posted by Ryan Lambert on 2021-02-21 at 05:01

My previous two posts set the stage to evaluate declarative Postgres partitioning for OpenStreetMap data. This post outlines what I found when I tested my plan and outlines my next steps. The goal with this series is to determine if partitioning is a path worth going down, or if the additional complexity outweighs any benefits. The first post on partitioning outlined my use case and why I thought partitioning would be a potential benefit. The maintenance aspects of partitioning are my #1 hope for improvement, with easy and fast loading and removal of entire data sets being a big deal for me.

The second post detailed my approach to partitioning to allow me to partition based on date and region. In that post I even bragged that a clever workaround was a suitable solution.

"No big deal, creating the osmp.pgosm_flex_partition table gives each osm_date + region a single ID to use to define list partitions."     -- Arrogant Me

Read on to see where that assumption fell apart and my planned next steps.

I was hoping to have made a "Go / No-Go" decision by this point... I am currently at a solid "Probably!"

Load data

For testing I simulated Colorado data being loaded once per month on the 1st of each month and North America once per year on January 1. This was conceptually easier to implement and test than trying to capture exactly what I described in my initial post. This approach resulted in 17 snapshots of OpenStreetMap being loaded, 15 with Colorado and two with North America. I loaded this data twice, once using the planned partitioned setup and the other using a simple stacked table to compare performance against.

Posted by Sai Srirampur in CitusData on 2021-02-20 at 17:55

In my day to day, I get to work with many customers migrating their data to Postgres. I work with customers migrating from homogenous sources (PostgreSQL) and also from heterogenous database sources such as Oracle and Redshift. Why do people pick Postgres? Because of the richness of PostgreSQL—and features like stored procedures, JSONB, PostGIS for geospatial workloads, and the many useful Postgres extensions, including my personal favorite: Citus.

A large chunk of the migrations that I help people with are homogenous Postgres-to-Postgres data migrations to the cloud. As Azure Database for PostgreSQL runs open source Postgres, in many cases the application migration can be drop-in and doesn’t require a ton effort. The majority of the effort usually goes into deciding on and implementing the right strategy for performing the data migration. For those of you who cannot afford any downtime during the Postgres migration process, there are of course data migration services that can help. But if you can afford some downtime for the migration during a specific maintenance window (e.g. during weekends, nights, etc.), then simple Postgres utilities such as pg_dump and pg_restore can be used.

In this post, let’s walk through the tradeoffs to consider while using pg_dump and pg_restore for your Postgres database migrations—and how you can optimize your migrations for speed, too. Let’s also explore scenarios in which you need to migrate very large Postgres tables. With large tables, using pg_dump and pg_restore to migrate your database might not be the most optimal approach. The good news is we’ll walk through a nifty Python tool for migrating large database tables in Postgres. With this tool we observed the migration of a large Postgres table (~1.4TB) complete in 7 hrs. 45 minutes vs. more than 1 day with pg_dump/pg_restore.

Faster migrations with pg_dump & pg_restore

pg_dump is a standard and traditional utility for backing up a PostgreSQL database. pg_dump takes a consistent snapshot of your Postgres da

Posted by Kat Batuigas in Crunchy Data on 2021-02-19 at 19:42

"I want to work on optimizing all my queries all day long because it will definitely be worth the time and effort," is a statement that has hopefully never been said. So when it comes to query optimizing, how should you pick your battles? Luckily, in PostgreSQL we have a way to take a system-wide look at database queries: 

  • Which ones have taken up the most amount of time cumulatively to execute
  • Which ones are run the most frequently
  • And how long on average they take to execute

And that's just a few places you can shine a light on, using pg_stat_statements.

pg_stat_statements is an extension that has been around in PostgreSQL since version 8.4. It's evolved over the years (notably since 9.2) to become such a helpful tool. So we thought now would be as good a time as any to highlight it here on our Crunchy blog. 

Ever since PostgreSQL 11 we have JIT (Just-In-Time compilation). Information about JIT is displayed in explain analyze plans, but up to this moment, didn't display it properly. But not anymore. Thanks to release 1.05 of Pg::Explain and subsequent change in templates for the site, you can now see nice JIT info. Hope you'll find … Continue reading " – now with JITs"

Recently I ran across grand sweeping statements that suggest containers are not ready for prime time as a vehicle for deploying your databases. The definition of "futile" is something like "serving no useful purpose; completely ineffective". See why I say this below, but in short, you probably are already, for all intents and purposes, running your database in a "container". Therefore, your resistance is futile.

And I'm here to tell you that, at least in so far as PostgreSQL is concerned, those sweeping statements are patently false. At Crunchy Data we have many customers that are very successfully running large numbers of PostgreSQL clusters in containers. Examples of our success in this area can be found with IBM and SAS.

However, just as you better have a special license and skills if you want to drive an 18 wheeler down the highway at 70 MPH, you must ensure that you have the skills and knowledge (either yourself or on your team) to properly operate your infrastructure, whether it be on-prem or in the cloud. This has always been true, but the requisite knowledge and skills have changed a bit.

PostgreSQL query optimization with CREATE STATISTICS is an important topic. Usually, the PostgreSQL optimizer (query planner) does an excellent job. This is not only true for OLTP but also for data warehousing. However, in some cases the optimizer simply lacks the information to do its job properly. One of these situations has to do with cross-column correlation. Let’s dive in and see what that means.

Violating statistical independence

Often data is not correlated at all. The color of a car might have nothing to do with the amount of gasoline consumed. You wouldn’t expect a correlation between the number of dogs in China and the price of coffee. However, sometimes there is indeed a correlation between columns which can cause a problem during query optimization.

Why is that the case? In PostgreSQL, statistics are stored for each column. PostgreSQL knows about the number of distinct entries, the distribution of data, and so on – by default it does NOT know how values relate to each other. Some examples: you know that age and height correlate (babies are not born 6 ft tall), you know that “country” and “language” spoken are usually correlated. However, the PostgreSQL query planner does not know that.
The solution to the problem is extended statistics (“CREATE STATISTICS”).

Let us create some same sample data:

test=# CREATE TABLE t_test (id serial, x int, y int, z int);
test=# INSERT INTO t_test (x, y, z)
          SELECT id % 10000,
                   (id % 10000) + 50000,
                   random() * 100
          FROM generate_series(1, 10000000) AS id;
INSERT 0 10000000
test=# ANALYZE ;

What we do here is to create 10 million rows. The magic is in the first two columns: We see that the “y” is directly related to “x” – we simply add 50.000 to make sure that the data is perfectly correlated.

Inspecting optimizer estimates

In the next example we will take a look at a simple query and see how PostgreSQL handles statistics. To make the plan more readable and t

Posted by Kat Batuigas in Crunchy Data on 2021-02-16 at 14:31

As a GIS newbie, I've been trying to use local open data for my own learning projects. I recently relocated to Tampa, Florida and was browsing through the City of Tampa open data portal and saw that they have a Public Art map. That looked like a cool dataset to work with but I couldn't find the data source anywhere in the portal. I reached out to the nice folks on the city's GIS team and they gave me an ArcGIS-hosted URL. 

To get the public art features into PostGIS I decided to use the "ArcGIS Feature Service" option in QGIS to point to the ArcGIS API, then export the feature layer to PostGIS. 

Our PostgreSQL 24×7 support team recently received a request from one of our customers who was facing a performance problem. The solution to the problem could be found in the way PostgreSQL handles query optimization (specifically, statistics). So I thought it would be nice to share some of this knowledge with my beloved readers. The topic of this post is therefore: What kinds of statistics does PostgreSQL store, and where can they be found? Let’s dive in and find out.

The purpose of optimizer statistics

Before we dig into PostgreSQL optimization and statistics, it makes sense to understand how PostgreSQL runs a query. The typical process works as follows:

PostgreSQL query execution, run query

First, PostgreSQL parses the query. Then, the traffic cop separates the utility commands (ALTER, CREATE, DROP, GRANT, etc.) from the rest. After that, the entire thing goes through the rewrite system, which is in charge of handling rules and so on.

Next comes the optimizer – which is supposed to produce the best plan possible. The plan can then be executed by the executor. The main question now is: What does the optimizer do to find the best possible plan? In addition to many mathematical transformations, it uses statistics to estimate the number of rows involved in a query. Let’s take a look and see:

test=# CREATE TABLE t_test AS SELECT *, 'hans'::text AS name
        FROM generate_series(1, 1000000) AS id;
SELECT 1000000
test=# ALTER TABLE t_test
test=# ANALYZE;

I have created 1 million rows and told the system to calculate statistics for this data. To make things fit onto my website, I also told PostgreSQL to reduce the precision of the statistics. By default, the statistics target is 100. However, I decided to use 10 here, to make things more readable – but more on that later.

Now let’s run a simple query:

test=# explain SELECT * FROM t_test WHERE id < 150000; 
                        QUERY PLAN