PostgreSQL
The world's most advanced open source database
Top posters
Number of posts in the past month
Top teams
Number of posts in the past month
Feeds
Planet
  • Policy for being listed on Planet PostgreSQL.
  • Add your blog to Planet PostgreSQL.
  • List of all subscribed blogs.
  • Manage your registration.
Contact
  • Get in touch with the Planet PostgreSQL administrators at planet at postgresql.org.
In the famous fairy tale the hedgehog wins the race against the hare because he uses his brain to outwit the much faster hare: Brain beats muscle. But is that always the case? And what if we combine the two virtues?

The case at hand: Screening large sets of molecules for chemical simliarity.

Since (sub)graph isomorphism searching faces some mathematical challenges because of nonpolynomial O - even if you can use a specialized index, like pgchem::tigress does - fast similarity searching based on binary fingerprints has gained popularity in recent years.

I was tasked with evaluating a solution to the problem of similarity screening large sets of molecules with PostgreSQL where the fingerprints are generated externally, e.g. with the CDK.

This is, what I came up with...

Preparing the Racetrack


CREATE TABLE cdk.externalfp (
id int4 NOT NULL,
smiles varchar NOT NULL,
pubchemfp varbit NULL,
"cardinality" int4 NULL,
CONSTRAINT externalfp_pk PRIMARY KEY (id)
);

Above is the table definition of the final table. The cardinality column will be not used now, but since it is calculated by the fingerprint generator anyway, keeping it will save some work later. If you want to copy my example code 1:1, please use a database named chemistry and a schema named cdk.

First we need to load some data into the table. I used the free NCISMA99 dataset  from the National Cancer Institute, containing 249081 chemical structures in SMILES notation.

COPY cdk.externalfp (id, smiles) FROM '/tmp/NCISMA99' 
WITH (DELIMITER ' ', HEADER false, FORMAT csv);

And a few seconds later you should have 249081 rows in the table. Now we need to generate the fingerprints. The generator code is here, additionally you need the CDK 2.2 and a PostgreSQL JDBC driver. After changing the code to reflect your JDBC URL you are good to go.

Running the FingerprintGenerator should show no errors and takes about 30 Minutes on my Core i5 Linux Notebook. The fingerprint used is the PubChem fingerprint as described here.
Now we can put an index on the cardinality column[...]
Yesterday I released next version of plpgsql_check.

With this release a developer can check some well known patterns of SQL injection vulnerabilities. The code of stored procedures of native languages like PL/SQL, T-SQL or PL/pgSQL is secure, and there is not a risk of SQL injection until dynamic SQL is used (the EXECUTE command in PL/pgSQL). The safe programming requires sanitization of all string variables. Anybody can use functions: quote_literal, quote_ident or format. This check can be slow, so it should be enabled by setting security_warnings parameter:

CREATE OR REPLACE FUNCTION public.foo1(a text)
RETURNS text
LANGUAGE plpgsql
AS $function$
DECLARE result text;
BEGIN
-- secure
EXECUTE 'SELECT $1' INTO result USING a;
-- secure
EXECUTE 'SELECT ' || quote_literal(a) INTO result;
-- secure
EXECUTE format('SELECT %L', a) INTO result;
-- unsecure
EXECUTE 'SELECT ''' || a || '''' INTO result;
-- unsecure
EXECUTE format(e'SELECT \'%s\'', a) INTO result;
RETURN result;
END;
$function$

postgres=# select * from plpgsql_check_function('foo1');
┌────────────────────────┐
│ plpgsql_check_function │
╞════════════════════════╡
└────────────────────────┘
(0 rows)

postgres=# select * from plpgsql_check_function('foo1', security_warnings => true);
┌─────────────────────────────────────────────────────────────────────────────┐
│ plpgsql_check_function │
╞═════════════════════════════════════════════════════════════════════════════╡
│ security:00000:11:EXECUTE:text type variable is not sanitized │
│ Query: SELECT 'SELECT ''' || a || '''' │
│ -- ^ │
│ Detail: The EXECUTE expression is SQL injection vulnerable. │
│ Hint: Use quote_ident, quote_literal or format function to secure variable. │
│ security:00000:13:EXECUTE:text type variable is not sanitized │
│ Query: SELECT format(e'SELECT \'%s\'', a)
[...]

Data checksums

Added in PostgreSQL 9.3, data checksums can help to detect data corruption happening on the storage side.

Checksums are only enabled if the instance was setup using initdb --data-checksums (which isn’t the default behavior), or if activated afterwards with the new pg_checksums tool also added in PostgreSQL 12.

When enabled, checksums are written each time a block is written to disk, and verified each time a block is read from disk (or from the operating system cache). If the checksum verification fails, an error is reported in the logs. If the block was read by a backend, the query will obviously fails, but if the block was read by a BASE_BACKUP operation (such as pg_basebackup), the command will continue its processing . While data checkums will only catch a subset of possible problems, they still have some values, especially if you don’t trust your storage reliability.

Up to PostgreSQL 11, any checksum validation error could only be found by looking into the logs, which clearly isn’t convenient if you want to monitor such error.

New counters available in pg_stat_database

To make checksum errors easier to monitor, and help users to react as soon as such a problem occurs, PostgreSQL 12 adds new counters in the pg_stat_database view:

commit 6b9e875f7286d8535bff7955e5aa3602e188e436
Author: Magnus Hagander <magnus@hagander.net>
Date:   Sat Mar 9 10:45:17 2019 -0800

Track block level checksum failures in pg_stat_database

This adds a column that counts how many checksum failures have occurred
on files belonging to a specific database. Both checksum failures
during normal backend processing and those created when a base backup
detects a checksum failure are counted.

Author: Magnus Hagander
Reviewed by: Julien Rouhaud

 

commit 77bd49adba4711b4497e7e39a5ec3a9812cbd52a
Author: Magnus Hagander <magnus@hagander.net>
Date:   Fri Apr 12 14:04:50 2019 +0200

    Show shared object statistics in pg_stat_database

    This adds a row to the pg_stat_database view with datoid 0 and datname
    NULL fo
[...]
On 3rd of April 2019, Alvaro Herrera committed patch: Log all statements from a sample of transactions   This is useful to obtain a view of the different transaction types in an application, regardless of the durations of the statements each runs.   Author: Adrien Nayrat Commit message makes it pretty clear, so let's see … Continue reading "Waiting for PostgreSQL 12 – Log all statements from a sample of transactions"
On 2nd of April 2019, Alvaro Herrera committed patch: Report progress of CREATE INDEX operations     This uses the progress reporting infrastructure added by , adding support for CREATE INDEX and CREATE INDEX CONCURRENTLY.   There are two pieces to this: one is index-AM-agnostic, and the other is AM-specific. The latter is fairly elaborate … Continue reading "Waiting for PostgreSQL 12 – Report progress of CREATE INDEX operations"
On 30th of March 2019, Peter Eisentraut committed patch: Generated columns   This is an SQL-standard feature that allows creating columns that are computed from expressions rather than assigned, similar to a view or materialized view but on a column basis.   This implements one kind of generated column: stored (computed on write). Another kind, … Continue reading "Waiting for PostgreSQL 12 – Generated columns"
Posted by Ibrar Ahmed in Percona on 2019-04-17 at 10:17
committed

PostgreSQL has a very unique way to review the code submitted by developers. Most open source software uses Github pull requests to accommodate users’ code. PostgreSQL has a Github page but doesn’t  manage pull requests using Github. Many years ago, PostgreSQL introduced CommitFest to manage its patches, where postgres opens for a three to four month “festival” to accept patches. The CommitFest is set up to keep track of patches’ statuses. In practice, CommitFest is mainly an admin page used to manage patches. Volunteers and community committers review the submitted code, after which committers are able commit those patches that have been approved. There have been almost 22 CommitFest events since 2014.

The process

When a patch is submitted, a volunteer can choose to review it. A review can be done by more than one person. At any point, a patch will be in one of the following statuses:

  • Committed
  • Moved to next CF (CommitFest)
  • Needs Review
  • Ready for Committer
  • Rejected
  • Returned with Feedback
  • Waiting on Author
  • Withdrawn

Needs review indicates that the patch is waiting for a reviewer to analyze it. Once a reviewer has confirmed that the patch in question is good to be committed, a community committer will pick up the patch, review it and commit it if all is well, otherwise, they reject it or return it with feedback.

Patch categories

Patches are categorized in various ways including:

  • Bug Fixes
  • Clients
  • Code Comments
  • Documentation
  • Miscellaneous
  • Monitoring & Control
  • Performance
  • Procedural Languages
  • Refactoring
  • Replication & Recovery
  • SQL Commands
  • Server Features
  • System Administration

A complete history of a patch, along with a discussion on email, is maintained with each revision which makes it very simple to track the complete process from submission of a patch to its eventual conclusion.

Here are some graphs that show various stats for the past CommitFests.

Total new patches in PostgreSQL commitfest by date

 

Total patches in PostgreSQL Commitfests including not new

 

Total patches reviewed during Commitfest versus those committed to PostgreSQL

Now it’s time to end the 22nd CommitFest, the first such festival for 2019. This event saw 207 patches, almost 52% of which were committed to the master b

[...]

Have you ever heard about cursors in PostgreSQL or in SQL in general? If not you should definitely read this article in depth and learn how to reduce memory consumption in PostgreSQL easily. Cursors have been around for many years and are in my judgement one of the most underappreciated feature of all times. Therefore it makes sense to take a closer look at cursors and see what they can be used for.

The purpose of a cursor in PostgreSQL

Consider the following example:

test=# CREATE TABLE t_large (id int);
CREATE TABLE
test=# INSERT INTO t_large 
        SELECT * FROM generate_series(1, 10000000);
INSERT 0 10000000

I have created a table containing 10 million rows so that we can play with the data. Let us run a simple query now:

test=# SELECT * FROM t_large;
    id    
----------
        1
        2
        3
…

The first thing you will notice is that the query does not return immediately. There is a reason for that: PostgreSQL will send the data to the client and the client will return as soon as ALL the data has been received. If you happen to select a couple thousand rows, life is good and everything will be just fine. However, what happens if you do a “SELECT * …” on a table containing 10 billion rows? Usually the client will die with an “out of memory” error and your applications will simply die. There is no way to keep such a large table in memory. Throwing ever more RAM at the problem is not feasible either (and pretty stupid too).

Using DECLARE CURSOR and FETCH

DECLARE CURSOR and FETCH can come to the rescue. What is the core idea? We can fetch data in small chunks and only prepare the data at the time it is fetched – not earlier. Here is how it works:

test=# BEGIN;
BEGIN
test=# DECLARE mycur CURSOR FOR 
        SELECT * FROM t_large WHERE id > 0;
DECLARE CURSOR
test=# FETCH NEXT FROM mycur;
 id 
----
  1
(1 row)

test=# FETCH 4 FROM mycur;
 id 
----
  2
  3
  4
  5
(4 rows)

test=# COMMIT;
COMMIT

The first important thing to notice is that a cursor can only be declared inside a transaction. However,

[...]
Posted by Robins Tharakan on 2019-04-17 at 02:59
The other day, I remembered an old 9.0-era mail thread (when Streaming Replication had just launched) where someone had tried to daisy-chain Postgres Replicas and see how many (s)he could muster.

If I recall correctly, the OP could squeeze only ~120 or so, mostly because the Laptop memory gave way (and not really because of an engine limitation).

I couldn't find that post, but it was intriguing to know if we could reach (at least) a thousand mark and see what kind of "Replica Lag" would that entail; thus NReplicas.

On a (very) unscientific test, my 4-Core 16G machine can spin-up a 1000 Replicas in ~8m (and tear them down in another ~2m). Now am sure this could get better, but am not complaining since this was a breeze to setup (in that it just worked without much tinkering ... besides lowering shared_buffers).

For those interested, a single UPDATE on the master, could (nearly consistently) be seen on the last Replica in less than half a second, with top showing 65% CPU idle (and 3.5 on the 1-min CPU metric) during a ~15 minute test.

So although (I hope) this isn't a real-world use-case, I still am impressed that without much tweaking, we're way under the 1 second mark, and that's right out of the box.

Am sure there's more to squeeze here, but still felt this was worthy of a small post nonetheless!
Posted by Luca Ferrari on 2019-04-17 at 00:00

A few months ago I worked to improve the great pgenv tool by theory. Today, I try to spread the word in the hope this tool can grow a little more!

An article about pgenv

tl;dr

I proposed a talk about pgenv, a Bash tool to manage several PostgreSQL instances on the same local machine. My talk has been rejected, and I hate to waste what I have already prepared, so I decided to transform my talk in an article, that has been quickly accepted on Haikin9 Devops Issue!



I should have written about this a couple of months ago, but I did not had time to.
My hope is that pgenv gets more and more users, so that it can grow and become someday a widely used tool. Quite frankly, I don’t see this happening while being in Bash, for both portability and flexibility, and I suspect Perl is much more the language for a more flexible implementation. However, who knows? Gathering users is also a way to gather contributors and bring therefore new ideas to this small but very useful project.

In the meantime, if you have time and will, try testing the build from git patch, that allows you to build and manage a development version of our beloved database.

Posted by Stefan Fercot in Dalibo on 2019-04-17 at 00:00

As of 15 April 2019, there is only one repository RPM per distro, and it includes repository information for all available PostgreSQL releases.

This change, announced by Devrim on the pgsql-pkg-yum mailing list, has some impacts.


Announce

The announce from Devrim may be found here.

  • Instead of having separate repo RPMs per PostgreSQL major version, we now have one single repo RPM that supports all supported PostgreSQL releases. The new packages obsolete the current ones.

  • The repo RPM version has been bumped to 42. Hopefully that will be the end of the “The repo RPM is 10-4, how can I find 10-7 repo rpm, so that I can install PostgreSQL 10.7?” type questions.

  • The “latest” suffix has been added to all repo RPMs.


Installation

Let’s see some impacts of those changes on CentOS 7.

As usual, go to https://www.postgresql.org/download/linux/redhat/ and chose the version (11), the platform (CentOS 7) and the architecture (x86_64) you want to install.

Today, you still get the link to the pgdg-centos11-11-2 rpm.

Let’s install it:

# yum install https://download.postgresql.org/pub/repos/yum/11/redhat/rhel-7-x86_64/pgdg-centos11-11-2.noarch.rpm
Loaded plugins: fastestmirror
pgdg-centos11-11-2.noarch.rpm
Examining /var/tmp/yum-root-5eSWGp/pgdg-centos11-11-2.noarch.rpm: pgdg-redhat-repo-42.0-4.noarch
Marking /var/tmp/yum-root-5eSWGp/pgdg-centos11-11-2.noarch.rpm to be installed

Resolving Dependencies
--> Running transaction check
---> Package pgdg-redhat-repo.noarch 0:42.0-4 will be installed
--> Finished Dependency Resolution

Dependencies Resolved
========================================================================================================
 Package                   Arch            Version            Repository                           Size
========================================================================================================
Installing:
 pgdg-redhat-repo          noarch          42.0-4             /pgdg-centos11-11-2.noarch          6.8 k

Transaction Summary
======================
[...]
pg_upgrade to upgrade postgresql

PostgreSQL logoWhen you need to upgrade your PostgreSQL databases, there are a number of options available to you. In this post we’ll take a look at how you can upgrade PostgreSQL versions using pg_upgrade, a built-in tool that allows in-place upgrade of your software. Using pg_upgrade allows you, potentially, to minimize your downtime, an essential consideration for many organizations. It also allows you to perform a postgres upgrade with very minimal effort.

In our previous posts, we discussed various methods and tools that can help us perform a PostgreSQL upgrade – (1) pg_dumpall, (2) pg_dump and pg_restore with pg_dumpall, (3) logical replication and pglogical, and (4) slony. Methods 1 and 2 can involve additional downtime compared to the approaches taken in 3 and 4. Whilst performing an upgrade using logical replication or slony may be time consuming and require a lot of monitoring, it can be worth it if you can minimize downtime. If you have large databases that are busy with a lot of transactions, you may be better served using logical replication or slony.

This post is the fifth of our Upgrading or Migrating Your Legacy PostgreSQL to Newer PostgreSQL Versions series. These posts lead up to a live webinar, where we’ll be exploring different methods available to upgrade your PostgreSQL databases. If it’s beyond the live webinar date when you read this, you’ll find the recording at that same link.

pg_upgrade

pg_upgrade (formerly pg_migrator – until PostgreSQL 8.4) is a built-in tool that helps in upgrading a legacy PostgreSQL server to a newer version without the need of a dump and restore. The oldest version from when you can upgrade your PostgreSQL using pg_upgrade is 8.4.x. It is capable of performing faster upgrades by taking into consideration that system tables are the ones that undergo the most change between two major versions. The internal data storage format is less often affected.

In fact, in one of our tests we were able to perform an upgrade of a 2 TB database server from PostgreSQL 9.6.5 to 11.1 in less than

[...]

The problem

If your database has a large number of small tables, you likely have a lot of wasted space. To demonstrate this, let’s create a table with a single record:

create table foo (str text);
insert into foo values ('a');
VACUUM foo;

Now let’s find out the path of the file containing our data, relative to the data directory:

select pg_relation_filepath('foo');
 pg_relation_filepath
----------------------
 base/16384/16405
(1 row)

(For more information on PostgreSQL files and directories, see Craig Ringer’s article on the topic.)

Notice what happens when we drop to the system command line within the data directory and list the files, adding a wild card to the path above:

$ ls base/16384/16405*

base/16384/16405
base/16384/16405_fsm
base/16384/16405_vm

The file named 16405 is the heap of our table, but what about the others? These are auxiliary files, called “relation forks” that contain additional information to help PostgreSQL access and maintain the table.

  • 16405_fsm is the free space map. Its job is to know which pages in the table have space available for inserting records.
  • 16405_vm is the visibility map. Its job is to know which heap pages may need VACUUM-ing or freezing, and also which heap pages must be visited during index-only scans.

There are other files associated with this table, but to find them we have to use a query:

select
pg_relation_filepath(c.reltoastrelid) as toast_table_path,
pg_relation_filepath(i.indexrelid) as toast_index_path
from pg_class c
left outer join pg_index i on c.reltoastrelid=i.indrelid
where c.relname = 'foo';

 toast_table_path | toast_index_path
------------------+------------------
 base/16384/16408 | base/16384/16410
(1 row)

This gives us the path of the toast table for our table, as well as that of the toast table’s index. If we insert any records with large enough strings, they will be compressed and stored here.

How much disk space do we need to store our one record?

\x
select
pg_relation_size(c.oid, 'main') as heap_size,
pg_relation_size(c.oid, 'fsm') as fsm_size,
[...]

Native streaming replication in PostgreSQL works only between servers running the same major version. We discussed about logical replication in our previous blog post. In that post, we saw how logical replication could help us set up migration between two different PostgreSQL versions. However, logical replication works only for the currently supported versions of PostgreSQL, for example between PostgreSQL 9.4 and PostgreSQL 11. So what about the legacy versions that are older than 9.4? Slony-I could help us meet this replication requirement.

Replication between different PostgreSQL versions with Slony-I is useful for migration from legacy database installations to the latest available version. So what is Slony and how does it work?

This post is the fourth of our Upgrading or Migrating Your Legacy PostgreSQL to Newer PostgreSQL Versions series where we’ll be exploring different methods available to upgrade your PostgreSQL databases.

Slony

Slony is an application-level logical replication implementation for PostgreSQL. Rather, we could say that it is an external replication tool that requires a separate installation and configuration. Slony has been around for a long time. The latest version supports PostgreSQL versions from 8.4 and up to 11.

PostgreSQL logoThe main goal for replication is to ship changes from one database server to another. To better understand the architecture, you should know the terms such as Slon, Events and Slonik in Slony-I.

An aside: Slony means elephants in Russian, and elephants are indeed reputed to have a great memory. A slightly angry, but nevertheless pretty elephant, “Slonik”, looks at you from the PostgreSQL logo image.

Slon

Slon is a daemon that runs on each PostgreSQL node in Slony-I replication. These daemons are used for processing configuration and replication events for each PostgreSQL server. Each PostgreSQL server is called a “node”. All nodes together form a Slony “cluster”.

The “publisher node” is a source for replicated changes. While “subscriber” nodes receive and apply changes from

[...]
Posted by Michael Paquier on 2019-04-09 at 10:14

pg_checksums is a renaming of the tool called pg_verify_checksums which has been introduced in Postgres 11. Version 12 is introducing new options and possibilities which explain the renaming, as the tool has become much more multi-purpose.

First, it is now possible to enable and disable checksums for an offline cluster:

commit: ed308d78379008b2cebca30a986f97f992ee6122
author: Michael Paquier <michael@paquier.xyz>
date: Sat, 23 Mar 2019 08:12:55 +0900
Add options to enable and disable checksums in pg_checksums

An offline cluster can now work with more modes in pg_checksums:
- --enable enables checksums in a cluster, updating all blocks with a
correct checksum, and updating the control file at the end.
- --disable disables checksums in a cluster, updating only the control
file.
- --check is an extra option able to verify checksums for a cluster, and
the default used if no mode is specified.

When running --enable or --disable, the data folder gets fsync'd for
durability, and then it is followed by a control file update and flush
to keep the operation consistent should the tool be interrupted, killed
or the host unplugged.  If no mode is specified in the options, then
--check is used for compatibility with older versions of pg_checksums
(named pg_verify_checksums in v11 where it was introduced).

Author: Michael Banck, Michael Paquier
Reviewed-by: Fabien Coelho, Magnus Hagander, Sergei Kornilov
Discussion: https://postgr.es/m/20181221201616.GD4974@nighthawk.caipicrew.dd-dns.de

Here is how it works. The tool is able to do three modes now in total:

  • –check, the default if nothing is specified and what pg_verify_checksums was already able to do. This mode scans all the relation file blocks, reporting any mismatch.
  • –enable, which enables data checksums. This rewrites all the relation file blocks, and finishes the operation by updating the control file. Note that this can be take time depending on the size of the instance, and that the tool has no parallel mode.
  • –disables which disables data checksums by only updat
[...]
Posted by Andrew Dunstan in 2ndQuadrant on 2019-04-08 at 22:38

If you’ve visited almost any web page on the PostgreSQL Build Farm server in the last few days you might have noticed that it is sporting a new RSS feed, of changes in status. This is similar to the information on the buildfarm-status-green-chgs mailing list, except that it has all status changes, not just to and from green. This new feature fulfills a long outstanding request.

PostgreSQL has been my livelihood since 2004 – so I am naturally biased in its favor. I think it is the coolest piece of software on the planet, and I am not alone.

DB-Engines

PostgreSQL DBMS of the year 2018PostgreSQL DBMS of the year 2017

See those 2 badges up there? That’s 2 years in a row. DB-Engines monitors a total of 343 databases and their trends. For each of 2017 and 2018, DB-Engines states that PostgreSQL gained more popularity in their rankings than any other database – hence the award.

This is no small feat. DB-Engines has a complex methodology of calculating popularity including search engine queries, technical discussions on popular forums, related job offers, and social media trends. Coming out on top 2 years in a row shows how much technologists love Postgres.

Stack Overflow

You Google for the solution of a problem you are facing. Chances are the first hit you are going to get will be a link from Stack Overflow. Millions of registered users interact hundreds of thousands of times every day answering questions, sharing knowledge, and making technology so much more easy to use for fellow enthusiasts.

Stack Overflow runs a survey each year asking the developer community about their favorite technologies. Want to guess what I am driving at?

Stackoverflow Most Loved Database PostgreSQL

That’s the result of over 100,000 technology enthusiasts from across the world coming together and loving PostgreSQL.

Hacker News

Hacker news – run by Y Combinator, one of the top USA accelerators since 2005 – is another forum that brings together technology enthusiasts from all around the world.

Hacker News PostgreSQL Trends

Do I really need to label the blue line? ;-)

The graph above posts trends over the past 8 years and compares popularity between MySQL, SQL Server, MongoDB, and our rising star – PostgreSQL. And just look at how it is rising!

Postgres is definitely the coolest database out there, and this is only reason #1 :-)

Stay tuned for more!

Posted by Shaun M. Thomas in 2ndQuadrant on 2019-04-05 at 16:00

A while back, 2ndQuadrant notified a few of us that we should get more involved in Postgres Development in some capacity. Being as I’ve essentially fallen off the map in corresponding with the mailing lists in general, it would be a good way to get back into the habit.

But wait! Don’t we want more community involvement in general? Of course we do! So now is also a great opportunity to share my journey in the hopes others follow suit. Let’s go!

In the Beginning

So how does one start contributing? Do I have to know C? Must I poke around in the internals for weeks to form an understanding, however so tenuous, of the Black Magic that animates Postgres? Perhaps I should chant incantations to summon some dark entity to grant otherworldly powers necessary to comprehend the mind-bleedingly high-level conversations regularly churning within the deeply foreboding confines of the Hackers mailing list.

No.

God no.

If those were the prerequisites for getting involved, there would be approximately one person pushing Postgres forward, and his name is Tom Lane. Everyone else would be too terrified to even approach the process. I certainly count myself among those more timid individuals.

Instead, let’s start somewhere simple, and with something small. Sometimes all it takes to make a difference is to piggyback on the coattails of someone else who knows what they’re doing. If we’re too inexperienced to submit a patch, maybe we can review one instead.

Getting the Party Started (In Here)

Postgres development marches forward steadily in a form of punctuated equilibrium. Every few months, we throw a bunch of patches against the wall, and see what sticks. That cadence is currently marshaled by a master list of past and present commit fests going back to late 2014. It’s hardly an exhaustive resource dating back to antiquity, but it doesn’t need to be.

All we really need is the most recent iteration that’s in progress. At the time of this writing, that’s 2019-03.

Decisions, Decisions

Now for what might just be the most difficult portion

[...]
postgresql extensions

PostgreSQL is a powerful open source relational database management system. It extends the SQL language with additional features. A DBMS is not only defined by its performance and out of the box features, but also its ability to support bespoke/additional user-specific functionality. Some of these functionalities may be in the form of database constructs or modules, like stored procedures or functions, but their scope is generally limited to the functionality being exposed by the DBMS. For instance, how will you write a custom query-analyzing application that resides within your DBMS?

To support such options, PostgreSQL provides a pluggable architecture that allows you to install extensions. Extensions may consist of a configuration (control) file, a combination of SQL files, and dynamically loadable libraries.

This means you can write your own code as per the defined guidelines of an extension and plug it in a PostgreSQL instance without changing the actual PostgreSQL code tree. An extension by very definition extends what PostgreSQL can do, but more than that, it gives you the ability to interact with external entities. These external entities can be other database management systems like ClickHouse, Mongo or HDFs (normally these are called foreign data wrappers), or other interpreters or compilers (thus allowing us to write database functions in another language like Java, Python, Perl or TCL, etc.). Another potential use case of an extension can be for code obfuscation which allows you to protect your super secret code from prying eyes.

Build your own

To build your own extension, you don’t need a complete PostgreSQL code base. You can build and install an extension using installed PostgreSQL (it may require you to install a devel RPM or Debian package). Details about extensions can be found in PostgreSQL’s official documentation[1]. There many extensions available for different features in the contrib directory of PostgreSQL source. Other than the contrib directory, people are also writing extensions readily

[...]
Inspired by commit support for partial TOAST decompression
"When asked for a slice of a TOAST entry, decompress enough to return the slice instead of decompressing the entire object."

I and Nikita Glukhov made a quick experiment to see how jsonb could get benefit from this commit. The idea is simple, let's short values (more valueable) stores before long one. Currently, access time is independent on key, but with support of partial decompression we can get benefit for front keys.

Since jsonb stores values of keys in sorted (by key) order, we generate values depending on key name.

{
  "key1": "aaaa", /* 4 ^ 1 */
  "key2": "aaaaaaaaaaaaaaaa", /* 4 ^ 2 = 16 */
  ...
  "key10": "aaa ... aaa" /* 4 ^ 10 = 1M */
}

create table t(jb jsonb);
insert into t select (
  select jsonb_object_agg('key' || i, repeat('a', pow(4, i)::int)) from generate_series(1,10) i
) from generate_series(1, 1000);



We applied the partial decompression for '->' operator and tested performance with this simple query
select jb->'key1' from t;


The result is as expected - access time depends on a key:
key1-key5   key7    key8     key9     key10
  10 ms     48 ms   152 ms   548 ms   2037 ms

Access time for non-optimized operator '->>' is the same for all keys and roughly is 2000 ms.

So, this is what we can get for now. Ideally we want to have access time for all keys equal for time of accessing the first (fastest) key, currently we have the opposite.

I hope TOAST will be improved and we could decompress any slice using data type specific algorithm.

Needless to say, security is a topic that nobody in the wider IT industry can ignore nowadays, with a constant flow of reports on data breaches of various scales. Most of such cases don’t result from direct attacks against databases though, but more from targeting Application / API / Webserver problems as database servers are usually just not directly exposed to the Internet. And even if they were, PostgreSQL installations at least have a good starting point there, as the default settings are sane enough to only listen to local connections by default, preventing from most embarrassing kind of headlines. But of course PostgreSQL setups can also be compromised in many ways, so here’s a short listing of possible threats and some simple suggestions to mitigate those threats if possible.

Unnecessary PostgreSQL superuser access

This my friends, is the #1 point of danger in my opinion, so an appropriate time to repeat the classics – ”with great power comes great responsibility”. Superuser accounts are of course needed for maintenance so cannot be really disabled…but mostly they are vastly overused, even for simple things like normal schema evolution, which can be (mostly) perfectly managed with standard “owner privileges”. But Ok, about the threats – there’s the obvious: pumping out / dropping all table data. But the most important thing to remember – superusers can execute random commands on the OS level, under PostgreSQL process owner privileges. The easiest way to do that is to use the relatively unknown PROGRAM variation of the COPY command, which is also nicely documented as such. Some other methods to achieve the same that are quite hidden and sneaky:

  • Achieving physical access to the machine with COPY PROGRAM

This attack assumes that the server has been setup for passwordless communication / data transfer over SSH using default settings (which is very common). The trick itself is disturbingly easy – just create a table with a single text field and execute something like that:

krl@postgres=# create table ssh(dat
[...]

Announcing Release 10 of the PostgreSQL Buildfarm client

Principal feature: support for non-standard repositories:

  • support multi-element branch names, such as “dev/featurename” or “bug/ticket_number/branchname”
  • provide a get_branches() method in SCM module
  • support regular expression branches of interest. This is matched against the list of available branches
  • prune branches when doing git fetch.

This feature and some server side changes will be explored in detail in my presentation at pgCon in Ottawa next month. The feature doesn’t affect owners of animals in our normal public Build Farm. However, the items below are of use to them.

Other features/ behaviour changes:

  • support for testing cross version upgrade extended back to 9.2
  • support for core Postgres changes:
    • extended support for USE_MODULE_DB
    • new extra_float_digits regime
    • removal of user table oid support
    • removal of abstime and friends
    • changed log file locations
  • don’t search for valgrind messages unless valgrind is configured
  • make detection of when NO_TEMP_INSTALL is allowed more bulletproof

There are also various minor bug fixes and code improvements.

The release can be downloaded from https://github.com/PGBuildFarm/client-code/archive/REL_10.tar.gz or https://buildfarm.postgresql.org/downloads/latest-client.tgz

postgres read replica from primary

When you need to upgrade PostgreSQL, there are a variety of approaches that you can use. To avoid application downtime, then not all of the options to upgrade postgres are suitable. When avoiding downtime is essential, then you might consider using replication as a means of upgrade, and depending on your scenario, you can choose to approach this task using either logical replication or physical (streaming) replication. Here, we take a look at the difference between logical and physical replication in PostgreSQL. Then we explore how to accomplish an upgrade using logical replication in more detail, and by doing so, avoid application downtime. In a subsequent article, we’ll investigate physical replication.

We have already discussed about a few methods available to perform PostgreSQL upgrades in our previous blog posts – PostgreSQL Upgrade Using pg_dumpall and PostgreSQL Upgrade Using pg_dump/pg_restore – as part of our Upgrading or Migrating Your Legacy PostgreSQL to Newer PostgreSQL Versions series. However, both of these methods involve downtime to application.

Types of logical replication

Here we’ll cover two types of replication you could implement:

  1. Replication between PostgreSQL 10 and 11 versions using built-in logical replication.
  2. Replication between PostgreSQL 9.4 or (<  PG 11) to PostgreSQL 11 using an extension named pglogical .

We might opt to implement replication as a method of upgrade to minimize downtime for the application. Once all the data to date has been replicated to another PostgreSQL server on the latest version, you can simply switch your application to the new server with a minimal downtime… though of course this does depends on the complexity of your application stack.

Logical replication in PostgreSQL allows users to perform a selective replication of a few tables and open a standby for writes. Whereas physical replication in PostgreSQL is a block level replication. In this case, each database in the master is replicated to a standby, and the standby is not open for writes. Going forw

[...]
Posted by Craig Kerstiens in CitusData on 2019-04-04 at 16:56

A few days ago a CVE was announced for Postgres. To say this CVE is a bit overblown is an understatement. The first thing to know is you’re likely completely safe. If you run on a managed service provider you are not going to be affected by this, and if you’re managing your own Postgres database all chances are you are equally as safe. This CVE received a note from Tom Lane on the pgsql-announce mailing list in response to it getting a broad amount of awareness and attention.

But, we thought this might be a good time to talk about a few principles and concepts that underly how Postgres works.

For quick context the CVE states that the copy program within Postgres when run as super user will allow arbitrary code execution. The copy program within Postgres is a great tool for bulk loading of data. The utility essentially loads data to/from disk.

Shifting to the Postgres super user… Postgres is an extremely powerful database, well really it is more of a data platform. As super user you can perform all types of powerful operations. Those can include things that touch the underlying system itself, or enabling extensions which allow for low level system access. These types of operations should always be treated with care an caution.

In Tom’s email to the mailing list he states:

We encourage all users of PostgreSQL to follow the best practice that is to never grant superuser access to remote or otherwise untrusted users. This is a standard security operating procedure that is followed in system administration and extends to database administration as well.

If you’re handing out superuser access to your database or connecting to your application with a superuser role, consider changing that immediately.

Superuser essentially has the full power of the system user it is running as. Postgres does not allow running itself as the root user for this very reason. We actually commonly receive the request to grant super user access for customers running on Citus Cloud. Despite it being commonly requested it is not something we s

[...]
Posted by Laurenz Albe in Cybertec on 2019-04-03 at 08:00
count(*) in a children's rhyme
© Laurenz Albe 2019

 

It is a frequent complaint that count(*) is so slow on PostgreSQL.

In this article I want to explore the options you have get your result as fast as possible.

Why is count(*) so slow?

Most people have no trouble understanding that the following is slow:

SELECT count(*)
FROM /* complicated query */;

After all, it is a complicated query, and PostgreSQL has to calculate the result before it knows how many rows it will contain.

But many people are appalled if the following is slow:

SELECT count(*) FROM large_table;

Yet if you think again, the above still holds true: PostgreSQL has to calculate the result set before it can count it. Since there is no “magical row count” stored in a table (like it is in MySQL’s MyISAM), the only way to count the rows is to go through them.

So count(*) will normally perform a sequential scan of the table, which can be quite expensive.

Is the “*” in count(*) the problem?

The “*” in SELECT * FROM ... is expanded to all columns. Consequently, many people think that using count(*) is inefficient and should be written count(id) or count(1) instead. But the “*” in count(*) is quite different, it just means “row” and is not expanded at all.

Writing count(1) is the same as count(*), but count(id) is something different: It will only count the rows where id IS NOT NULL, since most aggregates ignore NULL values.

So there is nothing to be gained by avoiding the “*”.

Using an index only scan

It is tempting to scan a small index rather then the whole table to count the number of rows.
However, this is not so simple in PostgreSQL because of its multi-version concurrency control strategy. Each row version (“tuple”) contains the information to which database snapshot it is visible. But this information is not (redundantly) stored in the indexes. So it usually isn’t enough to count the entries in an index, because PostgreSQL has to visit the table entry (“heap tuple”) to make sure an index entry is visible.

To mitigate this problem, PostgreSQL has introduced the visibility ma

[...]

Recently, references to a "new PostgreSQL vulnerability" has been circling on social media (and maybe elsewhere). It's even got it's own CVE entry. The origin appears to be a blogpost from Trustwave.

So is this actually a vulnerability? (Hint: it's not) Let's see:

 Crunchy Data has recently announced an update to the CIS PostgreSQL Benchmark by the Center for Internet Security, a nonprofit organization that provides publications around standards and best practices for securing technologies systems. This newly published CIS PostgreSQL 10 Benchmark joins the existing CIS Benchmarks for PostgreSQL 9.5 and 9.6 while continuing to build upon Crunchy Data's efforts with the PostgreSQL Security Technical Implementation Guide (PostgreSQL STIG).

What is a CIS Benchmark?

As mentioned in an earlier blog post, a CIS Benchmark is a set of guidelines and best practices for securely configuring a target system.  The benchmark contains a series of recommendations that help test the security of the system: some of the recommendations are "scored" (where a top score of 100 is the best), while others are are provided to establish best practices for security.

Posted by Dave Conlin on 2019-04-02 at 13:45

Indexes are really important for Postgres performance, but they’re often misunderstood and misapplied. This post aims to give you a good grounding in indexes to avoid a lot of beginner mistakes.

Step one: understand what you want to achieve

Because indexes are such a powerful tool, a new index is often viewed as “the answer” to whatever performance problems people are experiencing. Wading straight in and creating an index for every sequential scan in sight is the simplest thing to do, but indexes have costs as well as benefits.

Not only do indexes take up memory, they raise the cost of writing to the table in question. Any speed-up an index may provide for reads isn’t free — it’s offset by more work to keep the index up to date when the data in the table change. So an unused index isn’t just useless — it’s actively harmful to your database’s performance.

First, take the time to understand which bits of your query are running slowly (use the query plan), make a hypothesis as to why they’re slow, and then validate that hypothesis by attempting to speed them up.

In order to understand when the answer might be an index, it’s important to understand the difference between sequential scans and index scans in Postgres.

Sequential scans

Sequential scans are the simplest, most obvious way of reading data from a table. Postgres jumps to the first block of memory (“page”) that holds rows for the table in question and reads in all the data, row by row, page by page, and passes it on.

Sequential scans can get a bit of a bad rap. One of the things we often hear from people when we ask them about their current performance analysis is “the first thing I do is look for sequential scans”.

It’s true that using an index on a table can make a big difference to query performance, but it’s also true that if your query just needs to get all of the data from a table in an unordered mass of rows, then things aren’t going to get any more efficient than just reading those rows in directly from consecutive pages.

Index scans

An index is jus

[...]
Today it is no longer necessary to argue why central logging makes sense or is even necessary. Most medium-sized companies now have a central logging system or are just introducing it. Once the infrastructure has been created, it must be used sensibly and efficiently! Especially as an...
Alexander Sosna