PostgreSQL
The world's most advanced open source database
Top posters
Number of posts in the past two months
Top teams
Number of posts in the past two months
Feeds
Planet
  • Policy for being listed on Planet PostgreSQL.
  • Add your blog to Planet PostgreSQL.
  • List of all subscribed blogs.
  • Manage your registration.
Contact
  • Get in touch with the Planet PostgreSQL administrators at planet at postgresql.org.
Posted by Bruce Momjian in EDB on 2021-05-12 at 13:15

Hash aggregation has always confused me. I know what hash joins are, and even wrote about it, and I know what aggregates are. So, what is hash aggregation? Is it a hash join that uses an aggregate function?

Well, no. Hash aggregates allow accumulation/aggregation of duplicate values or removal of duplicate values. There are four ways to perform this aggregation, as listed in AggStrategy:

  1. Plain: simple agg across all input rows
  2. Sorted: grouped agg, input must be sorted
  3. Hashed: grouped agg, use internal hashtable
  4. Mixed: grouped agg, hash and sort both used

The first option does not retain or "pass upward in the executor plan" any individual values. The second option processes the rows in order, allowing easy accumulation or removal of duplicate values. The last two use hash aggregation, that is accumulation or removal of duplicates using hashing. Hashing aggregation is used to perform GROUP BY, DISTINCT, and UNION (without ALL) processing.

Continue Reading »

There are two types of replication available in PostgreSQL at the moment: Streaming replication & Logical replication. If you are looking to set up streaming replication for PostgreSQL 13, this is the page you have been looking for. This tutorial will show you how to configure PostgreSQL replication and how to set up your database servers quickly.

PostgreSQL replication: What we want to achieve

Before we get started with configuring PostgreSQL, it makes sense to take a look at what we want to achieve. The goal of this tutorial is to create a primary server replicating data to a secondary one, using asynchronous replication.

Here is what the desired setup will look like:

PostgreSQL streaming replication

The entire setup will be done using CentOS 8.3. The process on RHEL (Redhat Enterprise Linux) is expected to be the same. Simply follow the same procedures.

To show how the setup works, we are using two virtual machines with the following IPs:

  • Primary: 10.0.3.200 (node1)
  • Secondary: 10.0.3.201 (node2)

Let’s prepare these systems step-by-step.

Installing PostgreSQL

Once you have installed CentOS / RHEL you can already prepare the installation of PostgreSQL itself. The way to do that is to go to the PostgreSQL website and follow the instructions. The following script shows how things work. You can simply copy / paste the script:


sudo dnf install -y https://download.postgresql.org/pub/repos/yum/reporpms/EL-8-x86_64/pgdg-redhat-repo-latest.noarch.rpm
sudo dnf -qy module disable postgresql
sudo dnf install -y postgresql13-server

# can be skipped on the 2nd node
sudo /usr/pgsql-13/bin/postgresql-13-setup initdb
sudo systemctl enable postgresql-13

# can be skipped on the 2nd node
sudo systemctl start postgresql-13

Let’s check what we should now see on node1:


[root@node1 ~]# ps axf | grep post
5542 pts/1 S+ 0:00 \_ grep --color=auto post
5215 ? Ss 0:00 /usr/pgsql-13/bin/postmaster -D /var/lib/pgsql/13/data/
5217 ? Ss 0:00 \_ postgres: logger
5219 ? Ss 0:00 \_ postgres: checkpointer
5220 ? Ss 
[...]
Posted by Jonathan Katz in Crunchy Data on 2021-05-11 at 10:05

A recent (well depending on when you read this) Twitter discussion mentioned the topic of creating the quintessential "read-only Postgres user" that can, well, only read info from a database, not write to it. A simple way to handle this case is to create a read-only Postgres replica, but that may not make sense based on your application.

So, how can you simply create a read-only Postgres user (and note that I will use "user" and "role" interchangeably)? Let's explore!

Posted by Bruce Momjian in EDB on 2021-05-11 at 01:30

Having written over 600 blog entries, I thought I would have already covered the complexities of the CLUSTER command, but it seems I have not, so let's do that now.

CLUSTER is an unusual SQL command because, like non-unique CREATE INDEX, it only affects performance. In fact, CLUSTER requires the existence of an index. So, what does CLUSTER do? Well, what does CREATE INDEX do? Let's look at how storage works in Postgres.

User data rows are stored in heap files in the file system, and those rows are stored in an indeterminate order. If the table is initially loaded in INSERT/COPY order, later inserts, updates, and deletes will cause rows to be added in unpredictable order in the heap files. CREATE INDEX creates a secondary file with entries pointing to heap rows, and index entries are ordered to match the values in the columns specified in the CREATE INDEX command. By quickly finding desired values in the index, index pointers can be followed to quickly find matching heap rows.

Continue Reading »

Posted by Andreas 'ads' Scherbaum on 2021-05-10 at 13:30
PostgreSQL Person of the Week Interview with Laurenz Albe: I live in Vienna, Austria, the city of my birth. I am working for Cybertec as a PostgreSQL consultant, trainer and developer.
In part 1 of this blog mini-series we looked at how to setup PostgreSQL so that we can perform regression analysis on our data using TensorFlow from within the database server using the pl/python3 procedural language. In part 2 we looked at how to understand and pre-process the data prior to training a neural network. [Continue reading...]
Posted by Luca Ferrari on 2021-05-10 at 00:00

A possible system that differs from sudo.

A glance at doas & pg_ctl

doas(1) is a replacement for sudo(1), a program that allows you to execute commands as a different user. The main advantage of using sudo(1) and hence doas(1) is that you can gain different privileges without the need to know the authentication tokens (e.g., a password) to do that.
I use sudo(1) on pretty much every machine I use, both Linux and FreeBSD.
In this post I glance at doas(1) and how it can be quickly configured to run PostgreSQL commands, mainly pg_ctl.

doas introduction

doas(1) is a program that was born in the OpenBSD ecosystem as a replacement for sudo(1) because, in short, the latter is too big and cannot be easily integrated into the base system.
doas is now available on FreeBSD and Linux too, so it is worth spending some time to learn how it works.
doas(1) is based on a configuration file, namely doas.conf (in FreeBSD /usr/local/etc/doas.conf), that has a syntax a lot clearer than that of sudo, at least in my opinion.

Rules are pretty simple:

  • every line in the configuration file is a rule, and rules are read from top to the bottom;
  • a rule can be either permit or deny, allowing a user to run a command or not;
  • a command is prefix by the special keyword cmd;
  • a target user, that is the user you want to run the command as, is prefix by the keyword as;
  • the special keyword nopass does not ask for password (same as NOPASSWD option for sudo);
  • it is possible to specify or keep the environment or change it.


The usage of doas(1) is pretty much the same of sudo(1), and mainly;

  • doas is the entry command;
  • -u specifies the user to run the command as;
  • the command is the remaining part of the command line.



doas has a lot less features (and thus syntax cluttering) than sudo, and therefore it is a lot faster and easy to setup, and according to me a lot less prone to errors.

Using doas to control a PostgreSQL cluster

Assuming you want t

[...]

Getting a single random row, or a few rows, from a table in order to get representative data for example is a frequent need. The most common way to do this in PostgreSQL is using ORDER BY random() like:

SELECT id FROM data ORDER BY random() LIMIT 1

But when run on a large table this can be very slow because it will have to scan the entire table to find the rows. Jonathan Katz mentioned a different way to do it on Twitter, which reminded me that people keep coming up with different (and sometimes very complicated) ways of trying to solve this problem.

And while Jonathan's method (he has the super simple sample code and results up on a gist) is still about twice as fast as ORDER BY random() on my test (with his data), it comes with some problems. For example, it requires a contiguous set of id values, that have to be integers. And it still takes about a second to run on my machine with his sample of 5 million rows -- and will keep getting slower as the table grows.

And it turns out, if you don't need your row to be perfectly random, just mostly random, and can deal with some caveats, PostgreSQL has built-in functionality that does the job about 20,000 times faster than Jonathan's version and 40,000 times faster than ORDER BY random(). Enter TABLESAMPLE.

Posted by Bruce Momjian in EDB on 2021-05-07 at 14:15

Postgres has always run well on commodity hardware, and works well on Raspberry Pi and consumer devices too. However, we have always lagged in optimizing Postgres on very larger hardware, mostly because our community has limited access to such hardware. In recent years, Postgres support companies have given the community access to large hardware, and run performance test themselves, which has allowed the community to enhance Postgres for such hardware. However, since typical workloads do not use large hardware, there is always a lag in how quickly Postgres can adjust to ever-larger hardware platforms.

This email report from July of 2020 is about a server actively using 5k tables. When there are only a few hundred connections, basically 2–3x the number of CPUs, the system works fine, but the system experiences a steep performance decline as thousands of active connections try to issue queries. Whether the thousands of connections happen because of increased user demand or increased slowness due to the number of active connections is unclear. The email thread suggests various lock table and autovacuum improvements that might improve the situation, but there is no clear conclusion. One bright spot is that two weeks after this report, another report complained about the performance of autovacuum in a database with many tables, and supplied a patch which was applied in December and will be in Postgres 14.

Ideally we would have been able to test this patch against the workload reported in the July email to see if it fixes the problem. However, it is difficult to get the patch into production use, or get a test server of suitable size and produce a workload that matches the report. This is why it is much slower to improve the performance of large systems compared to simple workloads on commodity hardware — we make progress, but it is slower.

PostgreSQL has an extension to jsonpath: ** operator, which explores arbitrary depth finding your values everywhere. At the same time, there is a lax mode, defined by the standard, providing a “relaxed” way for working with json. In the lax mode, accessors automatically unwrap arrays; missing keys don’t trigger errors; etc. In short, it appears that the ** operator and lax mode aren’t designed to be together :)

The story started with the bug report. The simplified version is below. Jsonpath query is intended to select the value of key "y" everywhere. But it appears to select these values twice.

1
2
3
4
5
6
7
8
9
# SELECT * FROM jsonb_path_query([{“x: a, y: [{“x:b”}]}]::jsonb,
                                 ‘$.**.x);
 jsonb_path_query
——————
 a
 a
 b
 b
(4 rows)

This case looks like a bug. But is it? Let’s dig into details. Let’s split the jsonpath query into two parts: one containing the ** operator and another having the key accessor.

1
2
3
4
5
6
7
8
9
10
11
12
13
# SELECT var,
         jsonb_path_query_array(var, ‘$.x) key_x
  FROM jsonb_path_query([{“x: a, y: [{“x:b”}]}]::jsonb,
                        ‘$.**) var;
               var               | key_x
———————————+——-
 [{“x: a, y: [{“x: b”}]}] | [a]
 {“x: a, y: [{“x: b”}]}   | [a]
 a                             | []
 [{“x: b”}]                    | [b]
 {“x: b”}                      | [b]
 b                             | []
(6 rows)

As you can see, the ** operator selects every child in the json document as expected. The key accessor extracts corresponding values from both objects themselves and their wrapping arrays. And that’s also expected in the lax mode. So, it appears there is no bug; everything works as designed, although it’s surprising for users.

Finally, I’ve committed a paragraph to the

[...]

What happens to table that are not logged into WALs when a physical replication is in place?

To WAL or not to WAL? When unlogged becomes logged…

Creating and populating a database to test

First of all, let’s create a clean database just to keep the test environment separated from other databases:



testdb=# CREATE DATABASE rep_test WITH OWNER luca;
CREATE DATABASE



Now let’s create and populate three tables (one temporary, one unlogged and one normal):



rep_test=> CREATE TABLE t_norm( pk int GENERATED ALWAYS AS IDENTITY,
                  t text,
                  primary key( pk ) );

rep_test=> CREATE UNLOGGED TABLE 
           t_unlogged( like t_norm including all );

rep_test=> CREATE TEMPORARY TABLE 
           t_temp( like t_norm including all );
           

rep_test=> INSERT INTO t_norm( t )
               SELECT 'Row #' || v
               FROM generate_series( 1, 1000000 ) v;
INSERT 0 1000000
Time: 4712.185 ms (00:04.712)

rep_test=> INSERT INTO t_temp( t )
            SELECT 'Row #' || v
            FROM generate_series( 1, 1000000 ) v;
INSERT 0 1000000
Time: 1789.473 ms (00:01.789)

rep_test=> INSERT INTO t_unlogged( t )
               SELECT 'Unlogged #' || v
               FROM generate_series( 1, 1000000 ) v;
INSERT 0 1000000
Time: 1746.729 ms (00:01.747)



The situation now is as follows:

Table Status Insertion time
t_norm Ordinary table 4.7 secs
t_temp Temporary table 1.8 secs
t_unlogged Unlogged table 1.7 secs


As you can see, timing for temporary and unlogged tables is pretty much the same, and this is because both are not inserted into WAL records, and therefore there is no crash-recovery machinery involved. This al

[...]

Analysing data within PostGIS is just one side of the coin. What about publishing our datasets as maps that various clients can consume and profit from? Let’s have a look at how this can be accomplished with Geoserver. Geoserver acts in this regard as a full-fledged open source mapping server, supporting several OGC compliant services, such as OGC Web Mapping (WMS) or Web Feature Service (WFS). OGC is an abbreviation for Open GIS Consortium, which is a community working on improving access to spatial information by developing standards in this area.

Here’s a quick overview of this article’s structure:

  1. Geoserver setup
  2. Dataset and layer configuration
  3. Service configuration
  4. Service consumption

Geoserver setup

Let’s first start by setting up Geoserver as a single, non-clustered instance. For our demo, I recommend using one of the available Docker images to start up quickly. The steps listed below show how to pull our geoserver image (kartoza) first, and subsequently start the container with its default configuration within a tmux session, which allows us to easily monitor geoserver’s output.

docker pull kartoza/geoserver:2.18.2
tmux
docker run -it --name geoserver  -p 8080:8080 -p 8600:8080 kartoza/geoserver:2.18.2

If everything runs smoothly, Apache Tomcat reports back with

org.apache.catalina.startup.Catalina.start Server startup in XX ms

Dataset and layer configuration

In one of my last blog posts, I imported a csv containing locations of airports (https://ourairports.com/data/airports.csv). Let’s build on this and use these point geometries as the main vector layer for our visualization.

By using kartoza’s image defaults, Apache Tomcat will listen on port 8080 serving Geoserver’s web-interface with http://XX.XX.XX.XX:8080/geoserver. So, let’s continue and login with geoserver/admin.

Figure 1 Geoserver Web Administrator

Figure 1 represents Geoserver’s main web screen after logging in. To publish our airport dataset, we must register our PostGIS

[...]
Posted by Paul Ramsey in Crunchy Data on 2021-05-05 at 20:43
Spatial indexes are used in PostGIS to quickly search for objects in space. Practically, this means very quickly answering questions of the form:
  • "all the things inside this this" or
  • "all the things near this other thing"

Because spatial objects are often quite large and complex (for example, coastlines commonly are defined with thousands of points), spatial indexes use "bounding boxes" as index and search keys:

  • Bounding boxes are of a small, fixed size, only 4 floats for a 2D box; and,
  • Bounding boxes are very inexpensive to compare to test things like containment.

So the simple story of spatial indexes is: if you are planning to do spatial queries (which, if you are storing spatial objects, you probably are) you should create a spatial index for your table.

Posted by Bruce Momjian in EDB on 2021-05-05 at 15:00

I previously covered the use of jsonb_to_record(), which uses record types to cast multiple JSON values. Record types are effectively special SQL rows.

There is an even more complicated SQL type called SET OF RECORD. It is a set of records, or set of rows. While the RECORD type allows you to create a row with multiple columns, SET OF RECORD allows you to create multiple rows, each having multiple columns. If you do \dfS in psql, you will see many functions defined as accepting or returning RECORD values, and many returning SET OF RECORD values, particularly database system information functions, e.g., pg_get_keywords().

As an example, the unnest function converts an array to a list of rows, effectively a set of records:

Continue Reading »

Rocky Linux 8 RC is out. This is a great step for the final release. Rocky Linux is a "community enterprise operating system designed to be 100% bug-for-bug compatible with Red Hat Enterprise Linux". Thus, PostgreSQL packages built for RHEL will also work on Rocky Linux 8 (and already verified this!) Continue reading "How to install PostgreSQL 13, 12, 11, 10 and 9.6 to Rocky Linux 8"
Posted by Paul Ramsey in PostGIS on 2021-05-04 at 08:00

So, this happened:

Tweet about Indexes

Basically a GeoDjango user posted some workarounds to some poor performance in spatial queries, and the original query was truly awful and the workaround not a lot better, so I snarked, and the GeoDjango maintainer reacted in kind.

Sometimes a guy just wants to be a prick on the internet, you know? But still, I did raise the red flag of snarkiness, so it it seems right and proper to pay the fine.

I come to this not knowing entirely the contract GeoDjango has with its users in terms of “hiding the scary stuff”, but I will assume for these purposes that:

  • Data can be in any coordinate system.
  • Data can use geometry or geography column types.
  • The system has freedom to create indexes as necessary.

So the system has to cover up a lot of variability in inputs to hide the scary stuff.

We’ll assume a table name of the_table a geometry column name of geom and a geography column name of geog.

Searching Geography

This is the easiest, since geography queries conform to the kind of patterns new users expect: the coordinates are in lon/lat but the distances are provided/returned in metres.

Hopefully the column has been spatially indexed? You can check in the system tables.

SELECT * 
FROM pg_indexes 
WHERE tablename = 'the_table';

Yes, there are more exact ways to query the system tables for this information, I give the simple example for space reasons.

If it has not been indexed, you can make a geography index like this:

CREATE INDEX the_table_geog_x 
  ON the_table
  USING GIST (geog);

And then a “buffered” query, that finds all objects within a radius of an input geometry (any geometry, though only a point is shown here) looks like this.

SELECT *
FROM the_table
WHERE ST_DWithin(
    the_table.geog,
    ST_SetSRID(ST_MakePoint(%lon, %lat), 4326),
    %radius
    );

Note that there is no “buffering” going on here! A radius search is logically equivalent and does not pay the cost of building up buffers, which is an expensive

[...]
Posted by Paul Laurence in Crunchy Data on 2021-05-03 at 20:45

Whether you are starting a new development project, launching an application modernization effort, or engaging in digital transformation, chances are you are evaluating Kubernetes.  If you selected Kubernetes, chances are you will ultimately need a database

Kubernetes provides many benefits for running applications including efficiency, automation, or infrastructure abstraction. These features allow you to deploy highly availability databases and scale, making it easier to manage hardware for databases as they grow. 

More users are adopting databases on Kubernetes. The Cloud Native Computing Foundation (CNCF) provides great data on Kubernetes adoption.  The Cloud Native Survey 2020 that 55% of respondents are using stateful applications in containers in production.  Crunchy Data has many customers who successfully run Postgres on Kubernetes. 

So how do you start with databases on Kubernetes? 

Posted by Halil Ozan Akgul in CitusData on 2021-05-03 at 16:22

Citus is an extension to Postgres that lets you distribute your application’s workload across multiple nodes. Whether you are using Citus open source or using Citus as part of a managed Postgres service in the cloud, one of the first things you do when you start using Citus is to distribute your tables. While distributing your Postgres tables you need to decide on some properties such as distribution column, shard count, colocation. And even before you decide on your distribution column (sometimes called a distribution key, or a sharding key), when you create a Postgres table, your table is created with an access method.

Previously you had to decide on these table properties up front, and then you went with your decision. Or if you really wanted to change your decision, you needed to start over. The good news is that in Citus 10, we introduced 2 new user-defined functions (UDFs) to make it easier for you to make changes to your distributed Postgres tables.

Before Citus 9.5, if you wanted to change any of the properties of the distributed table, you would have to create a new table with the desired properties and move everything to this new table. But in Citus 9.5 we introduced a new function, undistribute_table. With the undistribute_table function you can convert your distributed Citus tables back to local Postgres tables and then distribute them again with the properties you wish. But undistributing and then distributing again is… 2 steps. In addition to the inconvenience of having to write 2 commands, undistributing and then distributing again has some more problems:

  1. Moving the data of a big table can take a long time, undistribution and distribution both require to move all the data of the table. So, you must move the data twice, which is much longer.
  2. Undistributing moves all the data of a table to the Citus coordinator node. If your coordinator node isn’t big enough, and coordinator nodes typically don’t have to be, you might not be able to fit the table into your coordinator node.

[...]
Posted by Bruce Momjian in EDB on 2021-05-03 at 16:15

I already covered JSONB type casting of single values. However, if you are extracting multiple values from JSONB that casting method can be cumbersome, and it also has limitations. Fortunately, there is another way — a way to retrieve multiple JSONB or JSON values and cast everything in one location.

Here is a simple table taken from my previous blog post:

CREATE TABLE test(x JSONB);
 
INSERT INTO test VALUES ('{"a": "xyz", "b": 5}');

Continue Reading »

Posted by Andreas 'ads' Scherbaum on 2021-05-03 at 14:00
PostgreSQL Person of the Week Interview with Dave Page: I’m Dave Page, also known as pgSnake on IRC, Twitter and so on. I grew up and continue to live near Oxford in the UK. I’ve been working on PostgreSQL for over 20 years, contributing (or hindering, depending on your point of view!) in many different areas of the project. I currently serve on the project’s Core and Sysadmin teams, and on the board of directors for PostgreSQL Europe and the PostgreSQL Community Association of Canada.

In PostgreSQL, every database connection is a server-side process. This makes PostgreSQL a robust multi-process rather than a multi-threaded solution. However, occasionally people want to terminate database connections. Maybe something has gone wrong, maybe some kind of query is taking too long, or maybe there is a maintenance window approaching.

In this blog you will learn how to terminate queries and database connections in PostgreSQL.

How to cancel PostgreSQL queries

In PostgreSQL there are two functions we need to take into consideration when talking about cancellation or termination:

  • pg_cancel_backend(pid): Terminate a query but keep the connection alive
  • pg_terminate_backend(pid): Terminate a query and kill the connection

pg_cancel_backend ist part of the standard PostgreSQL distribution and can be used quite easily:


test=# \x
Expanded display is on.
test=# \df+ pg_cancel_backend
List of functions
-[ RECORD 1 ]-------+---------------------------------------
Schema              | pg_catalog
Name                | pg_cancel_backend
Result data type    | boolean
Argument data types | integer
Type                | func
Volatility          | volatile
Parallel            | safe
Owner               | hs
Security            | invoker
Access privileges   |
Language            | internal
Source code         | pg_cancel_backend
Description         | cancel a server process' current query

As you can see, all that’s necessary is to pass the process ID (pid) to the function. The main question is therefore: How can I find the ID of a process to make sure that the right query is cancelled?

The solution to the problem is a system view: pg_stat_activity.


test=# \d pg_stat_activity
View "pg_catalog.pg_stat_activity"
Column            | Type                     | Collation | Nullable | Default
------------------+--------------------------+-----------+----------+---------
datid             | oid                      |           |          |
datname           | name           
[...]

If you have ever wanted OpenStreetMap data in Postgres/PostGIS, you are probably familiar with the osm2pgsql tool. Lately I have been writing about the osm2pgsql developments with the new Flex output and how it is enabling improved data quality. This post changes focus away from the flex output and examines the performance of the osm2pgsql load itself.

One challenge with osm2pgsql over the years has been generic recommendations have been difficult to make. The safest recommendation for nearly any combination of hardware and source data size was to use osm2pgsql --slim --drop to put most of the intermediate data into Postgres instead of relying directly on RAM, which it needed a lot of. This choice has offsetting costs of putting all that data into Postgres (only to be deleted) in terms of disk usage and I/O performance.

A few days ago, a pull request from Jochen Topf to create a new RAM middle caught my eye. The text that piqued my interest (emphasis mine):

When not using two-stage processing the memory requirements are much much smaller than with the old ram middle. Rule of thumb is, you'll need about 1GB plus 2.5 times the size of the PBF file as memory. This makes it possible to import even continent-sized data on reasonably-sized machines.

Wait... what?! Is this for real??

Posted by Bruce Momjian in EDB on 2021-04-30 at 14:30

I have already covered the use of clusters, databases, and schemas, but a recent email thread highlighted the complexity of restoring database attributes and the objects inside of databases. In summary, CREATE DATABASE copies the contents of databases, but not their attributes, except for database attributes that are tied to the database contents, like collation and encoding. The thread ends with application of a documentation patch that mentions that database-level permissions are also not copied. Thanks to Tom Lane for clarifying this issue.

Posted by Bo Peng in SRA OSS, Inc. on 2021-04-30 at 14:00

black and yellow computer mouse on white desk

Logging and debugging help to monitor and identify issues or problems occurring in your program. Sometimes we need to log debug information to figure out the problems during software development and testing. However, if debug is enabled,  a large number of debug messages are generated and it is hard to read. Proper logging and debugging configurations are important.

There are a number of ways to retrieve debug information from Pgpool-II. In this post, I will describe the various ways for logging and debugging Pgpool-II. 

Logging

Before Pgpool-II 4.1, some log processing tools (e.g. rsyslog) are required to store Pgpool-II logs and rotate them.  

For example, below are the relevant configuration parameters for logging to syslog.

log_destination = 'syslog'
syslog_facility = 'LOCAL1'
syslog_ident = 'pgpool' 

Since Pgpool-II 4.2, logging collector process has been implemented. The logging collector process collects log messages sent to stderr and redirects them into the specified log file. And the built-in log rotation functionality is available by setting the configuration parameter log_rotation_age and log_rotation_size.

Below are the relevant configuration parameters:

log_destination = 'stderr'
logging_collector = on
log_directory = '/tmp/pgpool_log'
log_filename = 'pgpool-%Y-%m-%d_%H%M%S.log'
log_file_mode = 0600
log_truncate_on_rotation = off
log_rotation_age = 1d
log_rotation_size = 10MB

Useful logging parameters for debugging

Debugging messages can be enabled by starting Pgpool-II with "-d" option or configuring log_min_messages. However, it generates a large number of debug massages.

Proper logging and debugging configurations are important. in this section I will describe some useful configuration parameters for troubleshooting.

log_line_prefix

log_line_prefix outputs a printf-style string at the beginning of each log line. %a in log_line_prefix parameter outputs the application name. 

Since Pgpool-II 4.2, %a allows Pgpool-II i

[...]

The world changes. ARM architecture breaks into new areas of computing. An only decade ago, only your mobile, router, or another specialized device could be ARM-based, while your desktop and server were typically x86-based. Nowadays, your new MacBook is ARM-based, and your EC2 instance could be ARM as well.

In the mid-2020, Amazon made graviton2 instances publically available. The maximum number of CPU core there is 64. This number is where it becomes interesting to check PostgreSQL scalability. It’s exciting to check because ARM implements atomic operations using pair of load/store. So, in a sense, ARM is just like Power, where I’ve previously seen a significant effect of platform-specific atomics optimizations.

But on the other hand, ARM 8.1 defines a set of LSE instructions, which, in particular, provide the way to implement atomic operation in a single instruction (just like x86). What would be better: special optimization, which puts custom logic between load and store instructions, or just a simple loop of LSE CAS instructions? I’ve tried them both.

You can see the results of read-only and read-write pgbench on the graphs below (details on experiments are here). pg14-devel-lwlock-ldrex-strex is the patched PostgreSQL with special load/store optimization for lwlock, pg14-devel-lse is PostgreSQL compiled with LSE support enabled.

You can see that load/store optimization gives substantial positive effect, but LSE rocks here!

So, if you’re running PostgreSQL on graviton2 instance, make sure you’ve binaries compiled with LSE support (see the instruction) because the effect is dramatic.

BTW, it appears that none of these optimizations have a noticeable effect on the performance of Apple M1. Probably, M1 has a smart enough inner optimizer to recognize these different implementations to be equivalent. And it was surprising that LSE usage might give a small negative effect on Kunpeng 920. It was discouraging for me to know an ARM processor, where single instruction operation is s

[...]

Introduction

A materialized view is a view that stores results of SELECT query in the database, and this enables you to get the query results quickly comparing to executing the query every time. On the other hand, when a table underlying the view is updated, the data in the materialized view gets out of date, so you need to update the materialized view.

PostgreSQL provides REFRESH MATERIALIZED VIEW command as a way to update materialized views . This command refreshes the materialized view by re-running the query.

However, this is not very efficient because it recalculates all the data. If only a small part of the table is updated and only a part of the materialized view can be updated, this must be more efficient than recalculation.

This method is called Incremental View Maintenance (IVM). However, this feature is not implemented in the current PostgreSQL. So we are developing this feature in PostgreSQL and proposing it to the development community.

In this article, I will explain the IVM feature under development.

Note that the feature descried in this article is about the “under development” function, and not available in the current PostgreSQL . If you have any comments or requests for this feature, please give us your feedback!

Features

Overview

The proposed IVM feature allows materialized views to be updated automatically and incrementally just after a underlying table is modified. Because this is an incremental update, it is faster than running a regular REFRESH. And, of course, you don’t have to write trigger functions for your self.

Currently, the SQL queries used to define views are SELECT ... FROM ... WHERE ..., joins (inner joins, outer joins, self-joins), some built-in aggregate functions (count, sum, avg, min, max), GROUP BY clause, DISTINCT clause, simple sub-queries in FROM clause, simple CTEs ( WITH clauses), and EXISTSsub-queries inWHEREclause, where “simple” means that a sub-query or CTE does not include aggregates, outer joins, or DISTINCT. The view c

[...]
Posted by Luca Ferrari on 2021-04-30 at 00:00

pg_dump supports a few useful options to export data as a list of INSERTs

pg_dump and inserts

pg_dump(1) is the default tool for doing backups of a PostgreSQL database.
I often got answers about how to produce a more portable output of the database dump, with portable meaning truly “loadable into another PostgreSQL version or even a different database”.
In fact, pg_dump defaults to use COPY for bulkd loading data:



% pg_dump -a  -t wa -U luca testdb 
...
COPY luca.wa (pk, t) FROM stdin;
9200673 Record #1
9200674 Record #2
9200675 Record #3
9200676 Record #4
9200677 Record #5
9200678 Record #6
9200679 Record #7
9200680 Record #8
9200681 Record #9
...



As you can guess, COPY is usable only in PostgreSQL and not into other database. So, how to handle a text dump that can be used into other databases?
No need to worry: pg_dump has a few features to handle such need.
In particular, the following options can be useful:

  • --inserts removes the COPY and substitutes it with INSERT statements, one per tuple;
  • --column-inserts similar to the previous, but each INSERT has the list of named columns;
  • rows-per-inserts a number of tuples a single INSERT statement can handle, useful for a better bulk loading (but could be less portable).


There are also some other useful options:

  • --quote-all-identifiers force the quoting of the identifiers, and this is useful when preparing data for a different database;
  • --use-set-session-authorization when dealing with ownership of objects, use SQL standard commands;`
  • --no-comments, this is not a very “technical” aspect, but when you are going to load your dump into another database you probably do not want to import comments since they could be handled differently. Similarly, there are other --no options that are specific to PostgreSQL, like --no-publications to avoid replicating publications, and so on.


In the following I will use the same example table wa table with just two colum

[...]

In this post, we want to give you a brief introduction to TLS (Transport Layer Security), which is a technology widely used in combination with PostgreSQL to encrypt client / server connections.

What is TLS?

TLS is short for “Transport Layer Security“, which is a means of making sure that whatever data you are sending via TCP connections is secured against attackers.

To understand how TLS works, you first need to understand a few things about encryption.

Symmetric Encryption

symmetric key encryption

 

If you have a text “I like cheese” and replace every character with the next one in the English alphabet (replacing a with b, b with c, … and z with a), you have basic symmetric encryption. You can transfer the message “j mjlf diddtf” to your friend, and they can go through the message and replace every character with the preceding one (replacing b with a, c with b, … and a with z), and they’ll know your favourite snack.

Mathematically, symmetric encryptions can be expressed like this:
if enc(x, k)=e is the function that encrypts x, using key k, into e, then the inverse function enc^-1(e, k)=x is used to decrypt e, using k, back into x.

This is an extremely simplified example, and easy to crack. There are way more sophisticated ways of encrypting data symmetrically, which cannot be cracked unless you know the key. A well-known example of symmetric encryption is the Enigma machine, used during the second world war. While the enigma (III) encryption was crackable for the later part of World War Two, the more advanced enigma (IV) is still difficult to crack, even to this day.

There are two practicability issues with symmetric encryption. On one hand, you’ll need many different keys (one for each connection, essentially), so client B cannot decrypt whatever the server was sending to client A, for example. On the other hand, there is no simple way of securely communicating the key itself to new clients.

This is not an issue when you’re managing two servers that talk to each other, because they bo

[...]

Currently, the global PostgreSQL development team, led by Amit Kapila, continues to contribute to the community as part of Fujitsu's work on PostgreSQL.