PostgreSQL
The world's most advanced open source database
Top posters
Number of posts in the past two months
Top teams
Number of posts in the past two months
Feeds
Planet
  • Policy for being listed on Planet PostgreSQL.
  • Add your blog to Planet PostgreSQL.
  • List of all subscribed blogs.
  • Manage your registration.
Contact
  • Get in touch with the Planet PostgreSQL administrators at planet at postgresql.org.

It has been some time since something has been posted on this blog, and here is a short story about the following commit, for a feature that will land in the upcoming version 14 of PostgreSQL:

commit: bbe0a81db69bd10bd166907c3701492a29aca294
author: Robert Haas 
date: Fri, 19 Mar 2021 15:10:38 -0400
Allow configurable LZ4 TOAST compression

There is now a per-column COMPRESSION option which can be set to pglz
(the default, and the only option in up until now) or lz4. Or, if you
like, you can set the new default_toast_compression GUC to lz4, and
then that will be the default for new table columns for which no value
is specified. We don't have lz4 support in the PostgreSQL code, so
to use lz4 compression, PostgreSQL must be built --with-lz4.

[...]

Dilip Kumar. The original patches on which this work was based were
written by Ildus Kurbangaliev, and those were patches were based on
even earlier work by Nikita Glukhov, but the design has since changed
very substantially, since allow a potentially large number of
compression methods that could be added and dropped on a running
system proved too problematic given some of the architectural issues
mentioned above; the choice of which specific compression method to
add first is now different; and a lot of the code has been heavily
refactored.  More recently, Justin Przyby helped quite a bit with
testing and reviewing and this version also includes some code
contributions from him. Other design input and review from Tomas
Vondra, Álvaro Herrera, Andres Freund, Oleg Bartunov, Alexander
Korotkov, and me.

Discussion: http://postgr.es/m/20170907194236.4cefce96%40wp.localdomain
Discussion: http://postgr.es/m/CAFiTN-uUpX3ck%3DK0mLEk-G_kUQY%3DSNOTeqdaNRR9FMdQrHKebw%40mail.gmail.com

Storing large values in Postgres is addressed with the concept of TOAST, that basically compresses the data value using a compression algorithm proper to PostgreSQL called pglz (from src/common/pg_lzcompress.c in the code tree for its APIs). This compression method

[...]

A Few Belated PGXN Updates

The last couple weeks I’ve returned to PGXN and made a few updates. Nothing huge, but all long overdue.

First up, PGXN Manager is all TSL now. No public HTTP-only site. What started as a convenient division of labor (http for the public site and https for the authenticated site) turned out to be quite irksome — especially since some browsers wouldn’t load the non-TLS site at all anymore. So I did away with it, and now it’s all TLS, authenticated and not. (The API and search sites have been TLS for a while now.)

I also fixed a few long-standing bugs in PGXN Manager, most of which weren’t visible to end-users, but have annoyed me over the years. A few silly server errors, some instances of uploads failing for anything other than zip files, that sort of thing.

Oh, and if you’re a extension author, PGXN Manger now allows updates to old distribution versions to be uploaded. Previously it only allowed a new version to be greater than all previous versions. Now it will allow a new X.Y.Z version if X.Y previously existed and the new .Z is greater, and a new X.Y version if X previously existed and the new .Y is greater than any previous X.Y. To get this to work properly, I also dropped the check for versions of extensions in the uploaded files. It would just be too complicated to add a bunch of rules more likely to annoy than not. So it now only enforces patterns for distribution release versions. I trust extension authors not to then lower extension versions on new releases — that would just be silly, and not helpful to your users.

As part of this work, I also revamped the management of the PGXN server. Back in 2010 I wrote Capistrano files to manage the server, but have long ceased to use them, as they ceased to work. I’ve now removed all that detritus from the PGXN Manager, API, and Site repositories, and replaced them all with a single new repository, pgxn-ops, which contains Ansible playbooks to manage all the services. They don’t deal with a lot of the server-side st

[...]
Posted by Bruce Momjian in EDB on 2021-05-14 at 16:00

Optimizing compilers allow users to control how hard the compiler will work to produce efficient CPU instructions, e.g., -O, -O3. Postgres also has an optimizer, but there is limited control over its aggressiveness in generating efficient plans for the executor.

The default behavior of the Postgres optimizer is considered to be sufficient for almost all queries, both OLTP and OLAP. Postgres does have the ability to automatically reduce the aggressiveness of the optimizer for queries with many tables. These queries are known to cause the optimizer to take a pathological amount of time, perhaps more than the time required to execute even less-optimized queries. It is easy to enable the solution for this problem, called the Genetic Query Optimizer (GEQO), because it is a well-known problem, and it is easy to detect if the problem could happen by counting the number of tables listed in the FROM clause. From_collapse_limit and join_collapse_limit can also prevent the inefficient increase in the number of tables processed by each FROM clause, and hence the plan/optimization duration.

There are a few options for increasing the aggressiveness of the optimizer, and potentially increasing plan creation time. Just-in-Time Compilation (JIT) enables a completely new way of processing expressions and row formation. There are three JIT costs that control when its optimizations are enabled. Constraint_exclusion can also be increased and decreased in its aggressiveness.

Continue Reading »

Not too long ago I wrote a blog post about how to deploy TLS for Postgres on Kubernetes in attempt to provide a helpful guide from bringing your own TLS/PKI setup to Postgres clusters on Kubernetes. In part, I also wanted a personal reference for how to do it!

Posted by Egor Rogov in Postgres Professional on 2021-05-13 at 00:00

So, we got acquainted with the structure of the buffer cache and in this context concluded that if all the RAM contents got lost due to failure, the write-ahead log (WAL) was required to recover. The size of the necessary WAL files and the recovery time are limited thanks to the checkpoint performed from time to time.

In the previous articles we already reviewed quite a few important settings that anyway relate to WAL. In this article (being the last in this series) we will discuss problems of WAL setup that are unaddressed yet: WAL levels and their purpose, as well as the reliability and performance of write-ahead logging.

WAL levels

The main WAL task is to ensure recovery after a failure. But once we have to maintain the log anyway, we can also adapt it to other tasks by adding some more information to it. There are several logging levels. The wal_level parameter specifies the level, and each next level includes everything that gets into WAL of the preceding level plus something new.

...

Posted by Bruce Momjian in EDB on 2021-05-12 at 13:15

Hash aggregation has always confused me. I know what hash joins are, and even wrote about it, and I know what aggregates are. So, what is hash aggregation? Is it a hash join that uses an aggregate function?

Well, no. Hash aggregates allow accumulation/aggregation of duplicate values or removal of duplicate values. There are four ways to perform this aggregation, as listed in AggStrategy:

  1. Plain: simple agg across all input rows
  2. Sorted: grouped agg, input must be sorted
  3. Hashed: grouped agg, use internal hashtable
  4. Mixed: grouped agg, hash and sort both used

The first option does not retain or "pass upward in the executor plan" any individual values. The second option processes the rows in order, allowing easy accumulation or removal of duplicate values. The last two use hash aggregation, that is accumulation or removal of duplicates using hashing. Hashing aggregation is used to perform GROUP BY, DISTINCT, and UNION (without ALL) processing.

Continue Reading »

There are two types of replication available in PostgreSQL at the moment: Streaming replication & Logical replication. If you are looking to set up streaming replication for PostgreSQL 13, this is the page you have been looking for. This tutorial will show you how to configure PostgreSQL replication and how to set up your database servers quickly.

PostgreSQL replication: What we want to achieve

Before we get started with configuring PostgreSQL, it makes sense to take a look at what we want to achieve. The goal of this tutorial is to create a primary server replicating data to a secondary one, using asynchronous replication.

Here is what the desired setup will look like:

PostgreSQL streaming replication

The entire setup will be done using CentOS 8.3. The process on RHEL (Redhat Enterprise Linux) is expected to be the same. Simply follow the same procedures.

To show how the setup works, we are using two virtual machines with the following IPs:

  • Primary: 10.0.3.200 (node1)
  • Secondary: 10.0.3.201 (node2)

Let’s prepare these systems step-by-step.

Installing PostgreSQL

Once you have installed CentOS / RHEL you can already prepare the installation of PostgreSQL itself. The way to do that is to go to the PostgreSQL website and follow the instructions. The following script shows how things work. You can simply copy / paste the script:


sudo dnf install -y https://download.postgresql.org/pub/repos/yum/reporpms/EL-8-x86_64/pgdg-redhat-repo-latest.noarch.rpm
sudo dnf -qy module disable postgresql
sudo dnf install -y postgresql13-server

# can be skipped on the 2nd node
sudo /usr/pgsql-13/bin/postgresql-13-setup initdb
sudo systemctl enable postgresql-13

# can be skipped on the 2nd node
sudo systemctl start postgresql-13

Let’s check what we should now see on node1:


[root@node1 ~]# ps axf | grep post
5542 pts/1 S+ 0:00 \_ grep --color=auto post
5215 ? Ss 0:00 /usr/pgsql-13/bin/postmaster -D /var/lib/pgsql/13/data/
5217 ? Ss 0:00 \_ postgres: logger
5219 ? Ss 0:00 \_ postgres: checkpointer
5220 ? Ss 
[...]
Posted by Jonathan Katz in Crunchy Data on 2021-05-11 at 10:05

A recent (well depending on when you read this) Twitter discussion mentioned the topic of creating the quintessential "read-only Postgres user" that can, well, only read info from a database, not write to it. A simple way to handle this case is to create a read-only Postgres replica, but that may not make sense based on your application.

So, how can you simply create a read-only Postgres user (and note that I will use "user" and "role" interchangeably)? Let's explore!

Posted by Bruce Momjian in EDB on 2021-05-11 at 01:30

Having written over 600 blog entries, I thought I would have already covered the complexities of the CLUSTER command, but it seems I have not, so let's do that now.

CLUSTER is an unusual SQL command because, like non-unique CREATE INDEX, it only affects performance. In fact, CLUSTER requires the existence of an index. So, what does CLUSTER do? Well, what does CREATE INDEX do? Let's look at how storage works in Postgres.

User data rows are stored in heap files in the file system, and those rows are stored in an indeterminate order. If the table is initially loaded in INSERT/COPY order, later inserts, updates, and deletes will cause rows to be added in unpredictable order in the heap files. CREATE INDEX creates a secondary file with entries pointing to heap rows, and index entries are ordered to match the values in the columns specified in the CREATE INDEX command. By quickly finding desired values in the index, index pointers can be followed to quickly find matching heap rows.

Continue Reading »

Posted by Andreas 'ads' Scherbaum on 2021-05-10 at 13:30
PostgreSQL Person of the Week Interview with Laurenz Albe: I live in Vienna, Austria, the city of my birth. I am working for Cybertec as a PostgreSQL consultant, trainer and developer.
In part 1 of this blog mini-series we looked at how to setup PostgreSQL so that we can perform regression analysis on our data using TensorFlow from within the database server using the pl/python3 procedural language. In part 2 we looked at how to understand and pre-process the data prior to training a neural network. [Continue reading...]
Posted by Luca Ferrari on 2021-05-10 at 00:00

A possible system that differs from sudo.

A glance at doas & pg_ctl

doas(1) is a replacement for sudo(1), a program that allows you to execute commands as a different user. The main advantage of using sudo(1) and hence doas(1) is that you can gain different privileges without the need to know the authentication tokens (e.g., a password) to do that.
I use sudo(1) on pretty much every machine I use, both Linux and FreeBSD.
In this post I glance at doas(1) and how it can be quickly configured to run PostgreSQL commands, mainly pg_ctl.

doas introduction

doas(1) is a program that was born in the OpenBSD ecosystem as a replacement for sudo(1) because, in short, the latter is too big and cannot be easily integrated into the base system.
doas is now available on FreeBSD and Linux too, so it is worth spending some time to learn how it works.
doas(1) is based on a configuration file, namely doas.conf (in FreeBSD /usr/local/etc/doas.conf), that has a syntax a lot clearer than that of sudo, at least in my opinion.

Rules are pretty simple:

  • every line in the configuration file is a rule, and rules are read from top to the bottom;
  • a rule can be either permit or deny, allowing a user to run a command or not;
  • a command is prefix by the special keyword cmd;
  • a target user, that is the user you want to run the command as, is prefix by the keyword as;
  • the special keyword nopass does not ask for password (same as NOPASSWD option for sudo);
  • it is possible to specify or keep the environment or change it.


The usage of doas(1) is pretty much the same of sudo(1), and mainly;

  • doas is the entry command;
  • -u specifies the user to run the command as;
  • the command is the remaining part of the command line.



doas has a lot less features (and thus syntax cluttering) than sudo, and therefore it is a lot faster and easy to setup, and according to me a lot less prone to errors.

Using doas to control a PostgreSQL cluster

Assuming you want t

[...]

Getting a single random row, or a few rows, from a table in order to get representative data for example is a frequent need. The most common way to do this in PostgreSQL is using ORDER BY random() like:

SELECT id FROM data ORDER BY random() LIMIT 1

But when run on a large table this can be very slow because it will have to scan the entire table to find the rows. Jonathan Katz mentioned a different way to do it on Twitter, which reminded me that people keep coming up with different (and sometimes very complicated) ways of trying to solve this problem.

And while Jonathan's method (he has the super simple sample code and results up on a gist) is still about twice as fast as ORDER BY random() on my test (with his data), it comes with some problems. For example, it requires a contiguous set of id values, that have to be integers. And it still takes about a second to run on my machine with his sample of 5 million rows -- and will keep getting slower as the table grows.

And it turns out, if you don't need your row to be perfectly random, just mostly random, and can deal with some caveats, PostgreSQL has built-in functionality that does the job about 20,000 times faster than Jonathan's version and 40,000 times faster than ORDER BY random(). Enter TABLESAMPLE.

Posted by Bruce Momjian in EDB on 2021-05-07 at 14:15

Postgres has always run well on commodity hardware, and works well on Raspberry Pi and consumer devices too. However, we have always lagged in optimizing Postgres on very larger hardware, mostly because our community has limited access to such hardware. In recent years, Postgres support companies have given the community access to large hardware, and run performance test themselves, which has allowed the community to enhance Postgres for such hardware. However, since typical workloads do not use large hardware, there is always a lag in how quickly Postgres can adjust to ever-larger hardware platforms.

This email report from July of 2020 is about a server actively using 5k tables. When there are only a few hundred connections, basically 2–3x the number of CPUs, the system works fine, but the system experiences a steep performance decline as thousands of active connections try to issue queries. Whether the thousands of connections happen because of increased user demand or increased slowness due to the number of active connections is unclear. The email thread suggests various lock table and autovacuum improvements that might improve the situation, but there is no clear conclusion. One bright spot is that two weeks after this report, another report complained about the performance of autovacuum in a database with many tables, and supplied a patch which was applied in December and will be in Postgres 14.

Ideally we would have been able to test this patch against the workload reported in the July email to see if it fixes the problem. However, it is difficult to get the patch into production use, or get a test server of suitable size and produce a workload that matches the report. This is why it is much slower to improve the performance of large systems compared to simple workloads on commodity hardware — we make progress, but it is slower.

PostgreSQL has an extension to jsonpath: ** operator, which explores arbitrary depth finding your values everywhere. At the same time, there is a lax mode, defined by the standard, providing a “relaxed” way for working with json. In the lax mode, accessors automatically unwrap arrays; missing keys don’t trigger errors; etc. In short, it appears that the ** operator and lax mode aren’t designed to be together :)

The story started with the bug report. The simplified version is below. Jsonpath query is intended to select the value of key "y" everywhere. But it appears to select these values twice.

1
2
3
4
5
6
7
8
9
# SELECT * FROM jsonb_path_query([{“x: a, y: [{“x:b”}]}]::jsonb,
                                 ‘$.**.x);
 jsonb_path_query
——————
 a
 a
 b
 b
(4 rows)

This case looks like a bug. But is it? Let’s dig into details. Let’s split the jsonpath query into two parts: one containing the ** operator and another having the key accessor.

1
2
3
4
5
6
7
8
9
10
11
12
13
# SELECT var,
         jsonb_path_query_array(var, ‘$.x) key_x
  FROM jsonb_path_query([{“x: a, y: [{“x:b”}]}]::jsonb,
                        ‘$.**) var;
               var               | key_x
———————————+——-
 [{“x: a, y: [{“x: b”}]}] | [a]
 {“x: a, y: [{“x: b”}]}   | [a]
 a                             | []
 [{“x: b”}]                    | [b]
 {“x: b”}                      | [b]
 b                             | []
(6 rows)

As you can see, the ** operator selects every child in the json document as expected. The key accessor extracts corresponding values from both objects themselves and their wrapping arrays. And that’s also expected in the lax mode. So, it appears there is no bug; everything works as designed, although it’s surprising for users.

Finally, I’ve committed a paragraph to the

[...]

What happens to table that are not logged into WALs when a physical replication is in place?

To WAL or not to WAL? When unlogged becomes logged…

Creating and populating a database to test

First of all, let’s create a clean database just to keep the test environment separated from other databases:



testdb=# CREATE DATABASE rep_test WITH OWNER luca;
CREATE DATABASE



Now let’s create and populate three tables (one temporary, one unlogged and one normal):



rep_test=> CREATE TABLE t_norm( pk int GENERATED ALWAYS AS IDENTITY,
                  t text,
                  primary key( pk ) );

rep_test=> CREATE UNLOGGED TABLE 
           t_unlogged( like t_norm including all );

rep_test=> CREATE TEMPORARY TABLE 
           t_temp( like t_norm including all );
           

rep_test=> INSERT INTO t_norm( t )
               SELECT 'Row #' || v
               FROM generate_series( 1, 1000000 ) v;
INSERT 0 1000000
Time: 4712.185 ms (00:04.712)

rep_test=> INSERT INTO t_temp( t )
            SELECT 'Row #' || v
            FROM generate_series( 1, 1000000 ) v;
INSERT 0 1000000
Time: 1789.473 ms (00:01.789)

rep_test=> INSERT INTO t_unlogged( t )
               SELECT 'Unlogged #' || v
               FROM generate_series( 1, 1000000 ) v;
INSERT 0 1000000
Time: 1746.729 ms (00:01.747)



The situation now is as follows:

Table Status Insertion time
t_norm Ordinary table 4.7 secs
t_temp Temporary table 1.8 secs
t_unlogged Unlogged table 1.7 secs


As you can see, timing for temporary and unlogged tables is pretty much the same, and this is because both are not inserted into WAL records, and therefore there is no crash-recovery machinery involved. This al

[...]

Analysing data within PostGIS is just one side of the coin. What about publishing our datasets as maps that various clients can consume and profit from? Let’s have a look at how this can be accomplished with Geoserver. Geoserver acts in this regard as a full-fledged open source mapping server, supporting several OGC compliant services, such as OGC Web Mapping (WMS) or Web Feature Service (WFS). OGC is an abbreviation for Open GIS Consortium, which is a community working on improving access to spatial information by developing standards in this area.

Here’s a quick overview of this article’s structure:

  1. Geoserver setup
  2. Dataset and layer configuration
  3. Service configuration
  4. Service consumption

Geoserver setup

Let’s first start by setting up Geoserver as a single, non-clustered instance. For our demo, I recommend using one of the available Docker images to start up quickly. The steps listed below show how to pull our geoserver image (kartoza) first, and subsequently start the container with its default configuration within a tmux session, which allows us to easily monitor geoserver’s output.

docker pull kartoza/geoserver:2.18.2
tmux
docker run -it --name geoserver  -p 8080:8080 -p 8600:8080 kartoza/geoserver:2.18.2

If everything runs smoothly, Apache Tomcat reports back with

org.apache.catalina.startup.Catalina.start Server startup in XX ms

Dataset and layer configuration

In one of my last blog posts, I imported a csv containing locations of airports (https://ourairports.com/data/airports.csv). Let’s build on this and use these point geometries as the main vector layer for our visualization.

By using kartoza’s image defaults, Apache Tomcat will listen on port 8080 serving Geoserver’s web-interface with http://XX.XX.XX.XX:8080/geoserver. So, let’s continue and login with geoserver/admin.

Figure 1 Geoserver Web Administrator

Figure 1 represents Geoserver’s main web screen after logging in. To publish our airport dataset, we must register our PostGIS

[...]
Posted by Paul Ramsey in Crunchy Data on 2021-05-05 at 20:43
Spatial indexes are used in PostGIS to quickly search for objects in space. Practically, this means very quickly answering questions of the form:
  • "all the things inside this this" or
  • "all the things near this other thing"

Because spatial objects are often quite large and complex (for example, coastlines commonly are defined with thousands of points), spatial indexes use "bounding boxes" as index and search keys:

  • Bounding boxes are of a small, fixed size, only 4 floats for a 2D box; and,
  • Bounding boxes are very inexpensive to compare to test things like containment.

So the simple story of spatial indexes is: if you are planning to do spatial queries (which, if you are storing spatial objects, you probably are) you should create a spatial index for your table.

Posted by Bruce Momjian in EDB on 2021-05-05 at 15:00

I previously covered the use of jsonb_to_record(), which uses record types to cast multiple JSON values. Record types are effectively special SQL rows.

There is an even more complicated SQL type called SET OF RECORD. It is a set of records, or set of rows. While the RECORD type allows you to create a row with multiple columns, SET OF RECORD allows you to create multiple rows, each having multiple columns. If you do \dfS in psql, you will see many functions defined as accepting or returning RECORD values, and many returning SET OF RECORD values, particularly database system information functions, e.g., pg_get_keywords().

As an example, the unnest function converts an array to a list of rows, effectively a set of records:

Continue Reading »

Rocky Linux 8 RC is out. This is a great step for the final release. Rocky Linux is a "community enterprise operating system designed to be 100% bug-for-bug compatible with Red Hat Enterprise Linux". Thus, PostgreSQL packages built for RHEL will also work on Rocky Linux 8 (and already verified this!) Continue reading "How to install PostgreSQL 13, 12, 11, 10 and 9.6 to Rocky Linux 8"
Posted by Paul Ramsey in PostGIS on 2021-05-04 at 08:00

So, this happened:

Tweet about Indexes

Basically a GeoDjango user posted some workarounds to some poor performance in spatial queries, and the original query was truly awful and the workaround not a lot better, so I snarked, and the GeoDjango maintainer reacted in kind.

Sometimes a guy just wants to be a prick on the internet, you know? But still, I did raise the red flag of snarkiness, so it it seems right and proper to pay the fine.

I come to this not knowing entirely the contract GeoDjango has with its users in terms of “hiding the scary stuff”, but I will assume for these purposes that:

  • Data can be in any coordinate system.
  • Data can use geometry or geography column types.
  • The system has freedom to create indexes as necessary.

So the system has to cover up a lot of variability in inputs to hide the scary stuff.

We’ll assume a table name of the_table a geometry column name of geom and a geography column name of geog.

Searching Geography

This is the easiest, since geography queries conform to the kind of patterns new users expect: the coordinates are in lon/lat but the distances are provided/returned in metres.

Hopefully the column has been spatially indexed? You can check in the system tables.

SELECT * 
FROM pg_indexes 
WHERE tablename = 'the_table';

Yes, there are more exact ways to query the system tables for this information, I give the simple example for space reasons.

If it has not been indexed, you can make a geography index like this:

CREATE INDEX the_table_geog_x 
  ON the_table
  USING GIST (geog);

And then a “buffered” query, that finds all objects within a radius of an input geometry (any geometry, though only a point is shown here) looks like this.

SELECT *
FROM the_table
WHERE ST_DWithin(
    the_table.geog,
    ST_SetSRID(ST_MakePoint(%lon, %lat), 4326),
    %radius
    );

Note that there is no “buffering” going on here! A radius search is logically equivalent and does not pay the cost of building up buffers, which is an expensive

[...]
Posted by Paul Laurence in Crunchy Data on 2021-05-03 at 20:45

Whether you are starting a new development project, launching an application modernization effort, or engaging in digital transformation, chances are you are evaluating Kubernetes.  If you selected Kubernetes, chances are you will ultimately need a database

Kubernetes provides many benefits for running applications including efficiency, automation, or infrastructure abstraction. These features allow you to deploy highly availability databases and scale, making it easier to manage hardware for databases as they grow. 

More users are adopting databases on Kubernetes. The Cloud Native Computing Foundation (CNCF) provides great data on Kubernetes adoption.  The Cloud Native Survey 2020 that 55% of respondents are using stateful applications in containers in production.  Crunchy Data has many customers who successfully run Postgres on Kubernetes. 

So how do you start with databases on Kubernetes? 

Posted by Halil Ozan Akgul in CitusData on 2021-05-03 at 16:22

Citus is an extension to Postgres that lets you distribute your application’s workload across multiple nodes. Whether you are using Citus open source or using Citus as part of a managed Postgres service in the cloud, one of the first things you do when you start using Citus is to distribute your tables. While distributing your Postgres tables you need to decide on some properties such as distribution column, shard count, colocation. And even before you decide on your distribution column (sometimes called a distribution key, or a sharding key), when you create a Postgres table, your table is created with an access method.

Previously you had to decide on these table properties up front, and then you went with your decision. Or if you really wanted to change your decision, you needed to start over. The good news is that in Citus 10, we introduced 2 new user-defined functions (UDFs) to make it easier for you to make changes to your distributed Postgres tables.

Before Citus 9.5, if you wanted to change any of the properties of the distributed table, you would have to create a new table with the desired properties and move everything to this new table. But in Citus 9.5 we introduced a new function, undistribute_table. With the undistribute_table function you can convert your distributed Citus tables back to local Postgres tables and then distribute them again with the properties you wish. But undistributing and then distributing again is… 2 steps. In addition to the inconvenience of having to write 2 commands, undistributing and then distributing again has some more problems:

  1. Moving the data of a big table can take a long time, undistribution and distribution both require to move all the data of the table. So, you must move the data twice, which is much longer.
  2. Undistributing moves all the data of a table to the Citus coordinator node. If your coordinator node isn’t big enough, and coordinator nodes typically don’t have to be, you might not be able to fit the table into your coordinator node.

[...]
Posted by Bruce Momjian in EDB on 2021-05-03 at 16:15

I already covered JSONB type casting of single values. However, if you are extracting multiple values from JSONB that casting method can be cumbersome, and it also has limitations. Fortunately, there is another way — a way to retrieve multiple JSONB or JSON values and cast everything in one location.

Here is a simple table taken from my previous blog post:

CREATE TABLE test(x JSONB);
 
INSERT INTO test VALUES ('{"a": "xyz", "b": 5}');

Continue Reading »

Posted by Andreas 'ads' Scherbaum on 2021-05-03 at 14:00
PostgreSQL Person of the Week Interview with Dave Page: I’m Dave Page, also known as pgSnake on IRC, Twitter and so on. I grew up and continue to live near Oxford in the UK. I’ve been working on PostgreSQL for over 20 years, contributing (or hindering, depending on your point of view!) in many different areas of the project. I currently serve on the project’s Core and Sysadmin teams, and on the board of directors for PostgreSQL Europe and the PostgreSQL Community Association of Canada.

In PostgreSQL, every database connection is a server-side process. This makes PostgreSQL a robust multi-process rather than a multi-threaded solution. However, occasionally people want to terminate database connections. Maybe something has gone wrong, maybe some kind of query is taking too long, or maybe there is a maintenance window approaching.

In this blog you will learn how to terminate queries and database connections in PostgreSQL.

How to cancel PostgreSQL queries

In PostgreSQL there are two functions we need to take into consideration when talking about cancellation or termination:

  • pg_cancel_backend(pid): Terminate a query but keep the connection alive
  • pg_terminate_backend(pid): Terminate a query and kill the connection

pg_cancel_backend ist part of the standard PostgreSQL distribution and can be used quite easily:


test=# \x
Expanded display is on.
test=# \df+ pg_cancel_backend
List of functions
-[ RECORD 1 ]-------+---------------------------------------
Schema              | pg_catalog
Name                | pg_cancel_backend
Result data type    | boolean
Argument data types | integer
Type                | func
Volatility          | volatile
Parallel            | safe
Owner               | hs
Security            | invoker
Access privileges   |
Language            | internal
Source code         | pg_cancel_backend
Description         | cancel a server process' current query

As you can see, all that’s necessary is to pass the process ID (pid) to the function. The main question is therefore: How can I find the ID of a process to make sure that the right query is cancelled?

The solution to the problem is a system view: pg_stat_activity.


test=# \d pg_stat_activity
View "pg_catalog.pg_stat_activity"
Column            | Type                     | Collation | Nullable | Default
------------------+--------------------------+-----------+----------+---------
datid             | oid                      |           |          |
datname           | name           
[...]

If you have ever wanted OpenStreetMap data in Postgres/PostGIS, you are probably familiar with the osm2pgsql tool. Lately I have been writing about the osm2pgsql developments with the new Flex output and how it is enabling improved data quality. This post changes focus away from the flex output and examines the performance of the osm2pgsql load itself.

One challenge with osm2pgsql over the years has been generic recommendations have been difficult to make. The safest recommendation for nearly any combination of hardware and source data size was to use osm2pgsql --slim --drop to put most of the intermediate data into Postgres instead of relying directly on RAM, which it needed a lot of. This choice has offsetting costs of putting all that data into Postgres (only to be deleted) in terms of disk usage and I/O performance.

A few days ago, a pull request from Jochen Topf to create a new RAM middle caught my eye. The text that piqued my interest (emphasis mine):

When not using two-stage processing the memory requirements are much much smaller than with the old ram middle. Rule of thumb is, you'll need about 1GB plus 2.5 times the size of the PBF file as memory. This makes it possible to import even continent-sized data on reasonably-sized machines.

Wait... what?! Is this for real??

Posted by Bruce Momjian in EDB on 2021-04-30 at 14:30

I have already covered the use of clusters, databases, and schemas, but a recent email thread highlighted the complexity of restoring database attributes and the objects inside of databases. In summary, CREATE DATABASE copies the contents of databases, but not their attributes, except for database attributes that are tied to the database contents, like collation and encoding. The thread ends with application of a documentation patch that mentions that database-level permissions are also not copied. Thanks to Tom Lane for clarifying this issue.

Posted by Bo Peng in SRA OSS, Inc. on 2021-04-30 at 14:00

black and yellow computer mouse on white desk

Logging and debugging help to monitor and identify issues or problems occurring in your program. Sometimes we need to log debug information to figure out the problems during software development and testing. However, if debug is enabled,  a large number of debug messages are generated and it is hard to read. Proper logging and debugging configurations are important.

There are a number of ways to retrieve debug information from Pgpool-II. In this post, I will describe the various ways for logging and debugging Pgpool-II. 

Logging

Before Pgpool-II 4.1, some log processing tools (e.g. rsyslog) are required to store Pgpool-II logs and rotate them.  

For example, below are the relevant configuration parameters for logging to syslog.

log_destination = 'syslog'
syslog_facility = 'LOCAL1'
syslog_ident = 'pgpool' 

Since Pgpool-II 4.2, logging collector process has been implemented. The logging collector process collects log messages sent to stderr and redirects them into the specified log file. And the built-in log rotation functionality is available by setting the configuration parameter log_rotation_age and log_rotation_size.

Below are the relevant configuration parameters:

log_destination = 'stderr'
logging_collector = on
log_directory = '/tmp/pgpool_log'
log_filename = 'pgpool-%Y-%m-%d_%H%M%S.log'
log_file_mode = 0600
log_truncate_on_rotation = off
log_rotation_age = 1d
log_rotation_size = 10MB

Useful logging parameters for debugging

Debugging messages can be enabled by starting Pgpool-II with "-d" option or configuring log_min_messages. However, it generates a large number of debug massages.

Proper logging and debugging configurations are important. in this section I will describe some useful configuration parameters for troubleshooting.

log_line_prefix

log_line_prefix outputs a printf-style string at the beginning of each log line. %a in log_line_prefix parameter outputs the application name. 

Since Pgpool-II 4.2, %a allows Pgpool-II i

[...]