PostgreSQL
The world's most advanced open source database
Top posters
Number of posts in the past two months
Top teams
Number of posts in the past two months
Feeds
Planet
  • Policy for being listed on Planet PostgreSQL.
  • Add your blog to Planet PostgreSQL.
  • List of all subscribed blogs.
  • Manage your registration.
Contact
  • Get in touch with the Planet PostgreSQL administrators at planet at postgresql.org.
When you run an application with a relational database attached, you will no doubt have encountered this question: Which indexes should I create? For some of us, indexing comes naturally, and B-tree, GIN and GIST are words of everyday use. And for some of us it’s more challenging to find out which index to create, taking a lot of time to get right. But what unites us is that creating and tweaking indexes is part of our job when we use a relational database such as Postgres in production. We need…

How hard could it be to reset a sequence?

Restarting a sequence: how hard could it be? (PostgreSQL and Oracle)

One reason I like PostgreSQL so much is that it makes me feel at home: it has a very consistent and coherent interface to its objects. An example of this, is the management of sequences: ALTER SEQUENCE allows you to modify pretty much every detail about a sequence, in particular to restart it from its initial value.
Let’s see this in action:



testdb=> create sequence batch_seq 
         increment by 1 start with 1;
CREATE SEQUENCE

testdb=> do $$
declare
  i int;
begin
  for i in 1..100 loop
     perform nextval( 'batch_seq' );
  end loop;
end
$$
;
DO


testdb=> select currval( 'batch_seq' );
 currval 
---------
     100



In the above piece of code, I’ve created a batch_seq and queried it one hundred times, so that the current value of the sequence is holding 100.

How is it possible to make the sequence start over again?
A first possibility is to use the setval function:



testdb=> select setval( 'batch_seq', 1 );
 setval 
--------
      1


testdb=> select currval( 'batch_seq' );
 currval 
---------
       1



Another option is to use ALTER SEQUENCE, that is a command aimed to this purpose (and others):



testdb=> alter sequence batch_seq restart;
ALTER SEQUENCE

testdb=> select nextval( 'batch_seq' );   
 nextval 
---------
       1



An important thing to note here, is that the only option specified has been RESTART, that is the sequence already knows what restarting means: it means reset to its original starting value.
It is also possible to specify a specific value for the restarting:



testdb=> alter sequence batch_seq restart with 666;
ALTER SEQUENCE
        
testdb=> select nextval( 'batch_seq' );
 nextval 
---------
     666



That’s so simple!
The above behaviour is guaranteed back to the 8.1 PostgreSQL version (and probably even before): see the old documentation here.

Wait, what a

[...]
Posted by Franck Pachot on 2021-09-22 at 21:18

Let's say you have a script of one million of insert statements with literal values. Of course, this is a bad idea. SQL is language, not a data exchange format. You should have a file to import with COPY, with all values as CSV for example. Or at least, if you are a big fan of INSERT statements, have thousands of row values in it. Or prepare the statement with parameters and call with each parameter.

However, this is the kind of things we encounter in real life. I still remember 20 years ago when I got to look at a Mobile Telecom Billing application where the vendor specification required 6 large servers for a system that had to at most one thousand of post-paid customers to bill. Call Data record came in a very efficient proprietary binary format from the network Mobile Switching Centre. They had a C program to decode it into... plain text INSERT statements to be "ingested" with sqlplus. Clearly, it is hard to imagine worst design, and that was for the most critical part of the system as it required nearly real-time actions. Of course performance was poor. And you can imagine how there were no correct error handling there. And no security with all call records in plain text. But they sell it to people who do not look at how it works, easily satisfied by nice powerpoints. Today, I'm convinced that, whatever the reputation of the vendor, you should not put your data on software that is not open-source. Look at the code, look at the design, and you will get a good idea of the quality of the software.

Back to our technical workaround, I'm generating one million of insert statements:

for i in {1..1000000}
do
 echo "insert into demo (key,value) values ($i,$RANDOM);" 
done > inserts.sql

PostgreSQL

Loading them as-is with psql is not very efficient because it has to send the command one by one, parse them each time, and commit each row:

time psql <
[...]

person holding black and white electronic device

In the previous post, I explained how to control the client connections by using reserved_connections parameter. In this post I would like to introduce how to use shared relation cache to improve performance.

Shared relation cache is added in Pgpool-II 4.1. Before explaining this feature, let's have a look at what relation cache of Pgpool-II is.

What is relation cache?

If Pgpool-II receives a client query, Pgpool-II will parse the query and extract the table name, then access PostgreSQL system catalogs to get the relation information. Pgpool-II obtains the following information from PostgreSQL system catalogs:

  • whether a table included in the query is a temporary table or not
  • whether a table included in the query is an unlogged table or not
  • whether a function included in the query is "immutable" or not (if a function is used in the query and query cache mode is enabled)

If the table included in the query is a temporary table or an unlogged table, then Pgpool-II must send the query to the primary PostgreSQL.

To reuse the information obtained from system catalogs, Pgpool-II stores the information in local cache. If the same object (e.g. table, function) appears in the subsequent queries, Pgpool-II fetches information from the local cache instead of accessing system catalogs.

Each child process stores the local relation cache in process private memory and other child processes can not access the process local relation cache. If a different process is assigned to a client request, the process has to access PostgreSQL system catalogs, even if same table information is stored in other child process's local relation cache. If a large value is set to num_init_children, Pgpool-II accesses PostgreSQL system catalog as many times as the value of num_init_children, and it may increase the overhead.

Shared relation cache feature introduced in Pgpool-II 4.1 can resolve this problem.

Shared relation cache

Pgpool-II can create relation cache in shared memory by enabling

[...]

Moving from Oracle to PostgreSQL has become a popular sport, widely adopted by many who want to free themselves from license costs, hefty support costs and also technical limitations on the Oracle side. The same is true for people moving from MS SQL and other commercial platforms to PostgreSQL. However, my impression is that moving from Oracle to PostgreSQL is by far the most popular route.

A lot has been written about Oracle to PostgreSQL migration; today, I want to share some of the lessons learned when it comes to actually building a migration tool. Over the years, we’ve tried out many different approaches to creating the perfect tool, and we’ve definitely made some mistakes on the way – mistakes you will not want to repeat. I’ve decided to share some of the insights I personally had on this multi-year journey. I can’t give you a complete account of them in the space of a short article, so what I’ve done here is to highlight the most useful methods you can take advantage of to make migrations more efficient.

Transaction length matters

In case you are migrating just one or two gigabytes of data from Oracle to PostgreSQL, transaction leng this really not relevant. However, things are different if we are talking about 10+ TB of data. For many clients, downtime is not an option. To achieve close-to-zero downtime, you need to support CDC (= change data capture). The idea is to take an initial snapshot of data and then apply the changes as the target system (= PostgreSQL) catches up with the source database.

In reality, this means that we have to copy a vast amount of data from Oracle while changes are still being made to the source database. What you soon face on the Oracle side is a famous problem:

ORA-01555 Error Message “Snapshot Too Old”

Many of you with migration experience will be painfully aware of this issue. Ideally, you need to ensure that Oracle is configured properly to handle real transactions, not just short read bursts. It makes sense to teach the migrator beforehand to check

[...]

A lot of years Postgres will have some big pillar or theme to the release. Often this is thought of after the fact. Everything that is committed is looked at and someone thinks, "This is the key thing to talk about." In Postgres 9.2 it was JSON, in 9.4 it was JSONB, in 10 it was logical replication, 12 was a broader performance theme. While I look forward to each of these big highlights, in each release I'm equally excited to browse through and pull out the small things that simply make my life better.

Posted by Bo Peng in SRA OSS, Inc. on 2021-09-21 at 17:57

Pgpool-II is a feature-rich PostgreSQL cluster management tool. To determine which configuration is best for your database cluster, you need to understand the purpose of the parameters. Since this blog, I will introduce several effective parameters to improve performance.

In this blog, I will explain reserved_connections parameter and how to configure this parameter.

This parameter is used to refuse the incoming client connections with error message "Sorry, too many clients already" rather than block it, when the number of client connections exceeds the value of "num_init_children - reserved_connections"

First, let me describe why this parameter is added.

Control of client connections in Pgpool-II 4.0 and earlier

At startup, Pgpool-II parent process preforks num_init_children child processes, and each child process is waiting for a client connection. If the maximum number of concurrent client connections (the value of num_init_children) has been reached, how Pgpool-II handles the new requests?

Pgpool-II will stacked up the new request in the request queue. When a child process is released and the process is ready to accept a new connection request again, OS will assign the connection request to that process. In other words, the connection request will wait for an available child process.

However, if the client connects to Pgpool-II and occupies the Pgpool-II's child process for a long time without doing anything, the OS queue will be full and new client requests will never be accepted. In addition, it may cause high server loads and unstable state.

Configure reserved_connections

Since Pgpool-II 4.1 you can use reserved_connections parameter to control the client connections. 

The default value is 0. 

reserved_connections = 0 

If reserved_connections is set to 1 or greater, the incoming client connections will be refused with error message "Sorry, too many clients already", rather than be waiting for an available process, when the number of client connections exceed

[...]
Posted by Andreas 'ads' Scherbaum on 2021-09-20 at 14:00
PostgreSQL Person of the Week Interview with Michael Goldberg: I’m from a small town located in the Ural mountains. In 1991 our family moved to Israel. I live in Tel Aviv since then.

1. Overview

In my previous blog, I briefly walked through how the bulk/batch insertion was done for postgres_fdw in PG14. In this blog, I am going to run some basic tests to compare the performance for before and after the batch insertion was introduced in postgres_fdw, so that we can have a general idea about whether this feature makes any difference.

2. PG Servers Setup

The key of the blog is to see if there is any difference for batch insertion. To make the testing simple, here is how I set up a simple environment.

As this bulk/batch insertion was introduced for PG14, so we need to switch to the stable PG14 branch, i.e. REL_14_STABLE. After checked out the source code, simply run the commands: configure, make and make install. Here are the commands used in this blog.

./configure --prefix=$HOME/sandbox/postgres/pgapp --enable-tap-tests --enable-debug CFLAGS="-g3 -O0"
make && make install
cd contrib/postgres_fdw/
make && make install
export PGDATA=$HOME/sandbox/postgres/pgdata
initdb -D $PGDATA
pg_ctl -D $PGDATA -l logfile start

In order to test Foreign Data Wrapper, we need to start another PG Server. To make it easy, I simply start a PG Server on a Foreign data cluster and change the default port to a different one, for example, 5433. Below are the commands used to setup Foreign Server.

export FPGDATA=$HOME/sandbox/postgres/pgdata2
initdb -D $FPGDATA

After the Foreign data cluster has been initialized, change the port to 5433, then start the Foreign PG Server.

vim $FPGDATA/postgresql.conf 
pg_ctl -D $FPGDATA -l logfile-f start

3. Foreign Tables Setup

Now, we can setup the basic Foreign Data Wrapper testing environment like below.

On the Local PG Server:

3.1. Create a Foreign Server using default batch settings
postgres=# create server fs foreign data wrapper postgres_fdw options (dbname 'postgres', host '127.0.0.1', port '5433');
CREATE SERVER
3.2. Create the user mapping
postgres=# create user mapping for david server fs options( user 'david');
CREATE USER MAPPI
[...]
Posted by Onder Kalacı in CitusData on 2021-09-17 at 15:23

Citus 10.2 is out! If you are not yet familiar with Citus, it is an open source extension to Postgres that transforms Postgres into a distributed database—so you can achieve high performance at any scale. The Citus open source packages are available for download. And Citus is also available in the cloud as a managed service, too.

You can see a bulleted list of all the changes in the CHANGELOG on GitHub. This post is your guide to what’s new in Citus 10.2, including some of these headline features.

And if you want to catch up on all the new things in the previous releases, check out our earlier blog posts about Citus 10.1, Citus 10, Citus 9.5, and Citus 9.4.

Citus 10.2 hero graphic
The Citus 10.2 release graphic with an elephant in the pilot seat and of course a decal of our open source mascot, the Citus Elicorn.

PostgreSQL 14 Beta3 support—and ready for PostgreSQL 14

Because Citus is an extension to Postgres, it’s easier for us to keep Citus in sync with the newest releases of Postgres (which would not be true if Citus were a fork.) This means when a new Postgres version is released, we can quickly release a new version of Citus that supports all or most of the new features.

One of the things to be excited about: Citus 10.2 is compatible with the Postgres 14beta3 release and already supports

[...]
Posted by Lætitia AVROT in EDB on 2021-09-17 at 00:00
As a consultant, I have to analyze various databases daily. I need to be able to grasp what the major problems are quickly and find the best way to solve them. Analyzing logs is a very efficient way to find problems. I was frustrated with the existing tools as I couldn’t deep dive into a problem to find more accurate data related to that problem only. That’s where I decided to use SQL to analyze my customers' logfiles.
Posted by Michał Mackiewicz on 2021-09-16 at 09:15

Upgrading one’s operating system to new major version is an important system maintenance task and it’s usually a good thing. It brings new features, security improvements, access to newer packages and so on. Sometimes it doesn’t go that smoothly, for example the updated system will refuse to start. But upgrading the OS running a Postgres cluster and involving a glibc library version update, or migrating a database to another machine running another OS (and glibc) version poses a little known, but very significant risk…

The disease

The technical reasons behind the problem are very well explained in this post by Citus. Long story short, changes in glibc code can make the operating system text sorting order inconsistent with the previous version, and in turn with already created B-Tree indexes. The possible damage is limited to indexes on textual data (VARCHAR, TEXT), the numbers (integers, floats, NUMERICs…) and of course booleans are immune.

Symptomes and diagnosis

Index corruption can cause a variety of bad things, all of them can be described as “weird behavior” or “Postgres doesn’t to such things!”. Examples include:

  • write transactions (INSERT, UPDATE, or ALTER) on affected table causing unexpected crashes or restarts, with “segmentation fault” message in server logs,
  • write transactions failing due to “ERROR: posting list tuple with X items cannot be split at offset”
  • pgAgent jobs failing due to “std::__ios failure’ what(): basic_filebuf::_M_convert_to_external conversion error: iostream error
  • duplicate key values despite UNIQUE constraint in place
  • SELECT queries not returning rows that indeed exist in a table and fulfill the WHERE clause

If those things happen to you, an you have recently upgraded your OS, switched over to a replica that was running a newer OS version, or migrated the database using streaming replication to another machine running a newer OS version – you may be a victim of index corruption.

The cure

First, check the system logs for any

[...]
Posted by Pavlo Golub in Cybertec on 2021-09-16 at 08:00

Hello, my name is Pavlo Golub, and I am a scheduler addict. That began when I implemented pg_timetable for PostgreSQL. I wrote a lot about it. In this post, I want to share the result of my investigations on the schedulers available for PostgreSQL. I gave a talk about this topic at the CERN meetup, so you may want to check it out for more details.

Comparison table of PostgreSQL schedulers

Let’s start with the complete comparison table. If you want to know more about each aspect, you’ll find further explanations below.

I would like to get comments and suggestions on this table, especially from developers or users of these products. I can be biased towards my own creation, so please don’t judge me too harshly. 🙂

Feature\Product pg_timetable pg_cron pgAgent jpgAgent pgbucket
Architecture
Year 2019 2016 2008 2016 2015
Implementation standalone bgworker standalone standalone standalone
Language Go C C++ Java C++
Can operate w\o extension ✔ ❌ ❌ ❌ ✔
Jobs meta stored in database database database database file
Remote Database Execution ✔ ❌ ✔
[...]
In the upcoming release of PostgreSQL-14, we will see multiple enhancements in Logical Replication which I hope will further increase its usage. This blog is primarily to summarize and briefly explain all the enhancements in Logical Replication.

Decoding of large transactions:

Allow streaming large in-progress transactions to subscribers. Before PostgreSQL-14, the transactions were streamed only at commit time which leads to a large apply lag for large transactions. With this feature, we will see apply lag to be reduced, and in certain scenarios that will lead to a big performance win. I have explained this feature in detail in my previous blog.

Performance of logical decoding:

Reduced the CPU usage and improve decoding performance of transactions having a lot of DDLs. It has been observed that decoding of a transaction containing truncation of a table with 1000 partitions would be finished in 1s whereas before this work it used to take 4-5 minutes. Before explaining, how we have achieved this performance gain, let me briefly tell what an invalidation message is in PostgreSQL as that is important to understand this optimization. These are messages to flush invisible system cache entries in each backend session. We normally execute these at the command end in the backend which generated them and send them at the transaction end via a shared queue to other backends for processing. These are normally generated for insert/delete/update operations on system catalogs which happens for DDL operations.

While decoding we use to execute all the invalidations of an entire transaction at each command end as we had no way of knowing which invalidations happened before that command. Due to this, transactions involving large amounts of DDLs use to take more time and also lead to high CPU usage. But now we know specific invalidations at each command end so we execute only required invalidations. This work has been accomplished by commit d7eb52d718.

Initial table sync:

The initial table sy
[...]

Introduction

In the previous posts ([1], [2], [3]), I have explained the Incremental View Maintenance (IVM) that we are proposing as a new feature of PostgreSQL. As I explained in [1], our IVM implementation supports tuple duplicates, that is, it allows views to contain duplicated tuples. This article describes how we are handling tuple duplicate in the IVM implementation.

Tuple Duplicates in Incremental View Maintenance

In a [3], I introduced the basic theory of Incremental View Maintenance (IVM). Now, I will briefly review it.

Suppose table R is updated and the changes are stored into two delta tables, ∇R and ΔR. ∇R is the collection of tuples deleted from R, and ΔR is the collection of tuples inserted into R. Using them, we can calculate the delta tables of the view, ∇V and ΔV, that contain the changes that will occur on the view.

Finally, we can update the view incrementally by applying the delta tables ∇V and ΔV to the view V. Specifically, tuples in ∇V are deleted from V, then tuples in ΔV are inserted into V.

Note here that tables and views may contain duplicate tuples. Because SQL is based on bag-semantics, not set-semantics. For example, suppose view V contains two tuples A and three tuples B. This is expressed as V = {A (2), B (3)}, which means that tuple A has a multiplicity of 2 and tuple B has a multiplicity of 3. Then, suppose you want to delete one tuple A and two tuple B. That is, ΔV = {A (1), B (2)}, and we want to perform the operation V ← V ∸ ΔV. (Here, ∸ is an operator called “monus”.)

How to Remove the Specified Number of Tuples?

Now, how can we get the result of this operation in PostgreSQL?

If you simply use DELETE as shown below, you will not be able to delete the specified number of tuples because all rows will be deleted.

DELETE FROM V WHERE x IN (SELECT x FROM ΔV);

One way to get the desired result is to use EXCEPT ALL as follows:

(SELECT * FROM V) EXCEPT ALL (SELECT * FROM ΔV);

However, while this gives the result of the deletion, it does

[...]

In my recent post on time-weighted averages, I described how my early career as an electrochemist exposed me to the importance of time-weighted averages, which shaped how we built them into TimescaleDB hyperfunctions. A few years ago, soon after I started learning more about PostgreSQL internals (check out my aggregation and two-step aggregates post to learn about them yourself!), I worked on backends for an ad analytics company, where I started using TimescaleDB.

Like most companies, we cared a lot about making sure our website and API calls returned results in a reasonable amount of time for the user; we had billions of rows in our analytics databases, but we still wanted to make sure that the website was responsive and useful.

There’s a direct correlation between website performance and business results: users get bored if they have to wait too long for results, which is obviously not ideal from a business and customer loyalty perspective. To understand how our website performed and find ways to improve, we tracked the timing of our API calls and used API call response time as a key metric.

Monitoring an API is a common scenario and generally falls under the category of application performance monitoring (APM), but there are lots of similar scenarios in other fields including:

  1. Predictive maintenance for industrial machines
  2. Fleet monitoring for shipping companies
  3. Energy and water use monitoring and anomaly detection

Of course, analyzing raw (usually time-series) data only gets you so far. You want to analyze trends, understand how your system performs relative to what you and your users expect, and catch and fix issues before they impact production users, and so much more. We built TimescaleDB hyperfunctions to help solve this problem and simplify how developers work with time-series data.

For reference, hyperfunctions are a series of SQL functions that make it easier to manipulate and analyze time-series data in PostgreSQL with fewer lines of code. You can use hyperfunct

[...]

Respecting the majority, questioning the status quo as a minority

Yesterday, the PostgreSQL Core Team launched an unprecedented attack on Fundación PostgreSQL as a whole and individually against Alvaro Hernandez. Fundación PostgreSQL responded promptly.

This was an attack against a Postgres Community Non-Profit and a Community member at the individual level. It was unprecedented, as no similar attack has happened in the past. But more concerning, it was unnecessary and disproportionate:

  • Unnecessary: Core had previously decided that they didn’t want to have any conversation and would substantiate their claims in court. The processes are awaiting resolution. If Core believes court is the way to go, why take this action before proceedings are finalized? Fundación PostgreSQL has always offered disposition to have conversations and negotiations with the Core Team, which has never wanted to establish any conversation with us, with or without lawyers.
    As the steward of the Postgres Community, Core should have tried everything possible to resolve the dispute, in an amicable manner (never tried, despite their claims), and in private. To try every possible measure without causing harm to the Community and its members. Core’s way of acting has weakened the Community, and wasn’t necessary. It has created an unnecessary divide, only to polarize and attack part of it. Conflicts and disputes may appear in a Community; but the stewards must exercise extreme care in how they are handled, and should put global interests ahead, trying to its fullest to resolve disputes, without creating outrage and undermining some members of the Community. Core has never tried to resolve this dispute, only to strong-arm it.

  • Disproportionate: there was no urgency, nor harm being inflicted upon the Postgres Community. Even if the claim and defense of the trademarks by Core Team may be legit, Fundación PostgreSQL has made it very clear that the registered trademarks are solely for the use of the

[...]

Over the years, many of our PostgreSQL clients have asked whether it makes sense to create indexes before – or after – importing data. Does it make sense to disable indexes when bulk loading data, or is it better to keep them enabled? This is an important question for people involved in data warehousing and large-scale data ingestion. So let’s dig in and figure it out:

B-tree indexes in PostgreSQL

Before we dive into the difference in performance, we need to take a look at how B-trees in PostgreSQL are actually structured:

create indexes after bulk loading

An index points to pages in the heap (= “table”). In PostgreSQL, a B-tree index always contains at least as much data as the underlying table. Inserting content into the table also means that data has to be added to the B-tree structure, which can easily turn a single change into a more expensive process. Keep in mind that this happens for each row (unless you are using partial indexes).

In reality, the additional data can pile up, resulting in a really expensive operation.

Some basic performance data

To show the difference between indexing after an import and before an import, I have created a simple test on my old Mac OS X machine. However, one can expect to see similar results on almost all other operating systems, including Linux.

Here are some results:

test=# CREATE TABLE t_test (id int, x numeric);
CREATE TABLE
test=# INSERT INTO t_test SELECT *, random()
FROM generate_series(1, 10000000);
INSERT 0 10000000
Time: 50249.299 ms (00:50.249)
test=# CREATE INDEX idx_test_x ON t_test (x);
CREATE INDEX
Time: 68710.372 ms (01:08.710)

As you can see, we need roughly 2 minutes and 50 seconds to load the data, and a little more than one minute to create the index.

But what happens if we drop the table, and create the index before loading the data?

test=# CREATE TABLE t_test (id int, x numeric);
CREATE TABLE
test=# CREATE INDEX idx_test_x ON t_test (x);
CREATE INDEX
Time: 11.192 ms
test=# INSERT INTO t_test SELECT *, random()
FROM generate_series(1, 100
[...]
Posted by Andreas 'ads' Scherbaum on 2021-09-13 at 14:00
PostgreSQL Person of the Week Interview with Soumyadeep Chakraborty: My hometown is Kolkata, India and I have spent the majority of my life there, including 3 years in the tech industry. More recently I have been living in the United States: I lived in Long Island, NY for a year and a half. I have been living in San Jose, California for the past two years, working at VMware.
Posted by Pavel Stehule on 2021-09-13 at 05:14
you can download source code from https://github.com/okbob/pspg/releases/tag/5.3.5

Postgres Core Team launches unprecedented attack against the Postgres Community

Being distributed is the key to the project’s resilience

One of the often cited advantages of the PostgreSQL project is its resiliency. Especially in the presence of rogue actors: being a distributed Community, it is hard to target any individual, group or entity and affect/disrupt the whole Community. Similarly, it is not possible to “buy PostgreSQL”, irrespective of how much money you have, since there’s no single entity, group or company that constitutes the whole project and thus could be acquired. PostgreSQL is a Distributed Community, and this is one of its core strengths.

But actually: Who is the PostgreSQL Community? Who develops PostgreSQL? The “PostgreSQL Global Development Group” (PGDG), which is an abstract term that covers developers and volunteers around the world that have contributed to PostgreSQL. Diving into the COPYRIGHT, we see that effectively for the period 1996-2021 it is assigned to the PGDG. The PostgreSQL Developer FAQ further clarifies that “contributors keeps their copyright […]. They simply consider themselves to be part of the PGDG”. Copyright is also distributed, reinforcing the previous reasoning about PostgreSQL’s resiliency.

Intellectual Property (IP) protection for the project also requires adequate trademark protection. As of today there are three PostgreSQL Community Non-Profit Organizations (NPOs) that hold trademark registrations for the PostgreSQL project. And the three of them have made them public, open and free for anyone to use for the benefit of PostgreSQL: the PostgreSQL Association of Canada (PAC), PostgreSQL Europe (PEU) and Fundación PostgreSQL (FPG), though some of them introduced lately some restrictions to their use, as we shall see later.

There are several other variations of these trademarks related to Postgres products and Postgres company names, that are registered by their respective (for-profit) companies.

PostgreSQL trademark protection history

T

[...]
Posted by Regina Obe in PostGIS on 2021-09-11 at 00:00

The PostGIS Team is pleased to release the first alpha of the upcoming PostGIS 3.2.0 release.

Best served with PostgreSQL 14 beta3. This version of PostGIS utilizes the faster GiST building support API introduced in PostgreSQL 14. If compiled with the in-development GEOS 3.10dev you can take advantage of improvements in ST_MakeValid. This release also includes many additional functions and improvements for postgis_raster and postgis_topology extensions.

Continue Reading by clicking title hyperlink ..
Posted by brian davis on 2021-09-10 at 20:00

Below are the dates of Postgres major version releases and when they first became available on RDS and Aurora.

Useful for gaining a leg up in your office AWS Managed Postgres Major Version Release Date betting pool.

WITH releases AS (
  SELECT * FROM
  (VALUES
    ('13',  '2020-09-24'::date, '2021-02-24'::date, '2021-08-26'::date),
    ('12',  '2019-10-03'::date, '2020-03-31'::date, '2021-01-28'::date),
    ('11',  '2018-10-18'::date, '2019-03-13'::date, '2019-11-26'::date),
    ('10',  '2017-10-05'::date, '2018-02-27'::date, '2018-09-25'::date),
    ('9.6', '2016-09-29'::date, '2016-11-11'::date, '2017-10-24'::date)
  ) AS t (version, pg, rds, aurora)
)
SELECT
  version,
  pg AS "Postgres Release",
  rds AS "RDS Release",
  aurora AS "Aurora Release",
  rds - pg AS "PG -> RDS Lag",
  aurora - pg AS "PG -> Aurora Lag",
  aurora - rds AS "RDS -> Aurora Lag"
FROM
  releases
ORDER BY
  version::decimal
 version | Postgres Release | RDS Release | Aurora Release | PG -> RDS Lag | PG -> Aurora Lag | RDS -> Aurora Lag
---------+------------------+-------------+----------------+---------------+------------------+-------------------
 9.6     | 2016-09-29       | 2016-11-11  | 2017-10-24     |            43 |              390 |               347
 10      | 2017-10-05       | 2018-02-27  | 2018-09-25     |           145 |              355 |               210
 11      | 2018-10-18       | 2019-03-13  | 2019-11-26     |           146 |              404 |               258
 12      | 2019-10-03       | 2020-03-31  | 2021-01-28     |           180 |              483 |               303
 13      | 2020-09-24       | 2021-02-24  | 2021-08-26     |           153 |              336 |               183
(5 rows)

Postgres release dates https://www.postgresql.org/support/versioning/

13 on RDS https://aws.amazon.com/about-aws/whats-new/2021/02/amazon-rds-now-supports-postgresql-13/

13 on Aurora https://aws.amazon.com/about-aws/whats-new/2021/08/amazon-aurora-postgresql-13/

12 on RDS https://aws.amazon.co

[...]
Posted by Brandur Leach on 2021-09-10 at 15:55

One of the major revelations for almost every new user to Postgres is that there’s no technical advantage of specifying columns as varchar(n) compared to just using bound-less text. Not only is the text type provided as a convenience (it’s not in the SQL standard), but using it compared to constrained character types like char and varchar carries no performance penalty. From the Postgres docs on character type (and note that character varying is the same thing as varchar):

There is no performance difference among these three types, apart from increased storage space when using the blank-padded type, and a few extra CPU cycles to check the length when storing into a length-constrained column. While character(n) has performance advantages in some other database systems, there is no such advantage in PostgreSQL; in fact character(n) is usually the slowest of the three because of its additional storage costs. In most situations text or character varying should be used instead.

For many of us this is a huge unburdening, as we’re used to micromanaging length limits in other systems. Having worked in large MySQL and Oracle systems, I was in the habit of not just figuring out what column to add, but also how long it needed to be – should this be a varchar(50) or varchar(100)? 500? (Or none of the above?) With Postgres, you just stop worrying and slap text on everything. It’s freeing.

I’ve since changed my position on that somewhat, and to explain why, I’ll have to take you back to Stripe circa ~2018.

One day we came to a rude awakening that we weren’t checking length limits on text fields in Stripe’s API. It wasn’t just that a few of them weren’t checked – it was that practically none of them were. While the API framework did allow for a maximum length, no one had ever thought to assign it a reasonable default, and as a matter of course the vast majority of parameters (of which there were thousands by this point) didn’t set one. As long as senders didn’t break any limits around aroun

[...]
On 10th of September 2021, Noah Misch committed patch: Revoke PUBLIC CREATE from public schema, now owned by pg_database_owner.   This switches the default ACL to what the documentation has recommended since CVE-2018-1058. Upgrades will carry forward any old ownership and ACL. Sites that declined the 2018 recommendation should take a fresh look. Recipes for … Continue reading "Waiting for PostgreSQL 15 – Revoke PUBLIC CREATE from public schema, now owned by pg_database_owner."

pgbackrest supports the JSON output format, and this can be useful to automate some information analysys.

Using jq to get information out of pgbackrest

pgbackrest offers the output of its commands in the JSON format. I’m not a great fan of JSON, but it having such an output offers a few advantages, most notably it is a stable text output format that can be inspected easily with other tools.
In other words, no need for regular expression to parse the textual output, and moreover, the output is guaranteed to be stable, that means no changes will happen (or better, no fields will be removed), while a simple rephrasing in the text output could crash your crafty regular expression!

Among the available tools, jq is a good sheel program that allows you to parse and navigate a JSON content.
Let’s see how it is possible to get some output combining jq and pgbackrest.

Get the last backup information

When your stanza has a lot of backup, you probably don’t want to monitor all of them in deep, but would rather like to get a quick hint on when the last backup did took place.
The pgbackrest info command reports all the backup available for a given stanza, and it can then be piped into jq to get more human readable information.
Quick! Show me the snippet:



$ pgbackrest info --output json | jq '"Stanza:  " + .[].name + " (" +  .[].status.message + ") " + "Last backup completed at "  +   (.[].backup[-1].timestamp.stop | strftime("%Y-%m-%d %H:%M") )' 

"Stanza:  miguel (ok) Last backup completed at 2021-07-27 09:23"



This is what I would like to see when I’m in a rush and need to see which machine are in trouble with backups: it shows me the name of the stanza, the status of the backup (ok) and the time and date the backup ended.
Let’s analyze the command in more detail:

  • pgbackrest info --output json enables the output of the info command as JSON;
  • jq is used to parse the JSON output concatenating strings, delimited by " with +
    • .[].name pr
[...]
Posted by Miranda Auhl in Timescale on 2021-09-09 at 15:32

Time-series data is everywhere, and it drives decision-making in every industry. Time-series data collectively represents how a system, process, or behavior changes over time. Understanding these changes helps us to solve complex problems across numerous industries, including observability, financial services, Internet of Things, and even professional football.

Depending on the type of application they’re building, developers end up collecting millions of rows of time-series data (and sometimes millions of rows of data every day or even every hour!). Making sense of this high-volume, high-fidelity data takes a particular set of data analysis skills that aren’t often exercised as part of the classic developer skillset. To perform time-series analysis that goes beyond basic questions, developers and data analysts need specialized tools, and as time-series data grows in prominence, the efficiency of these tools becomes even more important.

Often, data analysts’ work can be boiled down to evaluating, cleaning, transforming, and modeling data. In my experience, I’ve found these actions are necessary for me to gain understanding from data, and I will refer to this as the “data analysis life cycle” throughout this post.

Graphic showing the “data analysis lifecycle”, Evaluate -> Clean -> Transform -> Model
Data analysis lifecycle

Excel, R, and Python are arguably some of the most commonly used data analysis tools, and, while they are all fantastic tools, they may not be suited for every job. Speaking from experience, these tools can be especially inefficient for “data munging” at the early stages of the lifecycle; specifically, the evaluating data, cleaning data, and transforming data steps involved in pre-modeling work.

As I’ve worked with larger and more complex datasets, I’ve come to believe that databases built for specific types of data - such as time-series data - are more effective for data analysis.

For background, TimescaleDB is a relational database for time-series data. If your analysis is based on time-series datasets, TimescaleDB can be a great choice not

[...]

After a few months of research and experimentation with running a heavily DB-dependent Go app, we’ve arrived at the conclusion that sqlc is the figurative Correct Answer when it comes to using Postgres (and probably other databases too) in Go code beyond trivial uses. Let me walk you through how we got there.

First, let’s take a broad tour of popular options in Go’s ecosystem:

  • database/sql: Go’s built-in database package. Most people agree – best to avoid it. It’s database agnostic, which is kind of nice, but by extension that means it conforms to the lowest common denominator. No support for Postgres-specific features.

  • lib/pq: An early Postgres frontrunner in the Go ecosystem. It was good for its time and place, but has fallen behind, and is no longer actively maintained.

  • pgx: A very well-written and very thorough package for full-featured, performant connections to Postgres. However, it’s opinionated about not offering any ORM-like features, and gets you very little beyond a basic query interface. Like with database/sql, hydrating database results into structs is painful – not only do you have to list target fields off ad nauseam in a SELECT statement, but you also have to Scan them into a struct manually.

    • scany: Scany adds some quality-of-life improvement on top of pgx by eliminating the need to scan into every field of a struct. However, the desired field names must still be listed out in a SELECT ... statement, so it only reduces boilerplate by half.
  • go-pg: I’ve used this on projects before, and it’s a pretty good little Postgres-specific ORM. A little more below on why ORMs in Go aren’t particularly satisfying, but another downside with go-pg is that it implements its own driver, and isn’t compatible with pgx.

    • Bun: go-pg has also been put in maintenance mode in favor of Bun, which is a go-pg rewrite that works with non-Postgres databases.
[...]
Posted by Laurenz Albe in Cybertec on 2021-09-08 at 08:00

Index bloat in bad hands
© Laurenz Albe 2021

PostgreSQL v12 brought more efficient storage for indexes, and v13 improved that even more by adding deduplication of index entries. But Peter Geoghegan is not done yet! PostgreSQL v14 will bring “bottom-up” index entry deletion, which is targeted at reducing unnecessary page splits, index bloat and fragmentation of heavily updated indexes.

Why do we get index bloat?

In a B-tree index, there is an index entry for every row version (“tuple”) in the table that is not dead (invisible to everybody). When VACUUM removes dead tuples, it also has to delete the corresponding index entries. Just like with tables, that creates empty space in an index page. Such space can be reused, but if no new entries are added to the page, the space remains empty.

This “bloat” is unavoidable and normal to some extent, but if it gets to be too much, the index will become less efficient:

  • for an index range scan, more pages have to be scanned
  • index pages cached in RAM means that you cache the bloat, which is a waste of RAM
  • fewer index entries per page mean less “fan out”, so the index could have more levels than necessary

This is particularly likely to happen if you update the same row frequently. Until VACUUM can clean up old tuples, the table and the index will contain many versions of the same row. This is particularly unpleasant if an index page fills up: then PostgreSQL will “split” the index page in two. This is an expensive operation, and after VACUUM is done cleaning up, we end up with two bloated pages instead of a single one.

Current features to improve index bloat and performance

HOT tuples

The creation of HOT tuples is perhaps the strongest weapon PostgreSQL has to combat unnecessary churn in the index. With this feature, an UPDATE creates tuples that are not referenced from an index, but only from the previous version of the table row. That way, there is no need to write a new index entry at all, which is good for performance and completely avoids index

[...]
Posted by Ryan Lambert on 2021-09-07 at 03:15

If you use Postgres and Python together you are almost certainly familiar with psycopg2. Daniele Varrazzo has been the maintainer of the psycopg project for many years. In 2020 Daniele started working full-time on creating psycopg3, the successor to psycopg2. Recently, the Beta 1 release of psycopg3 was made available via PyPI install. This post highlights two pieces of happy news with psycopg3:

  • Migration is easy
  • The connection pool rocks

As the first section shows, migration from psycopg2 to psycopg3 is quite easy. The majority of this post is dedicated to examining pyscopg3's connection pool and the difference this feature can make to your application's performance.

Migration

Easy migration is an important feature to encourage developers to upgrade. It is frustrating when a "simple upgrade" turns into a cascade of error after error throughout your application. Luckily for us, psycopg3 got this part right! In the past week I fully migrated two projects to psycopg3 and started migrating two more projects. So far the friction has been very low and confined to edge case uses.

The following example shows a simplified example of how my projects have used psycopg2.