PostgreSQL
The world's most advanced open source database
Top posters
Number of posts in the past month
Top teams
Number of posts in the past month
Feeds
Planet
  • Policy for being listed on Planet PostgreSQL.
  • Add your blog to Planet PostgreSQL.
  • List of all subscribed blogs.
  • Manage your registration.
Contact
  • Get in touch with the Planet PostgreSQL administrators at planet at postgresql.org.
Posted by Stefan Fercot in Dalibo on 2019-07-19 at 00:00

pgBackRest is a well-known powerful backup and restore tool.

While the documentation describes all the parameters, it’s not always that simple to imagine what you can really do with it.

In this post, I will introduce some of the parameters needed to configure the access to an Amazon S3 bucket.


MinIO

For the purpose of this demo setup, we’ll install a MinIO server, which is an Amazon S3 Compatible Object Storage.

To do so, I followed the guide of centosblog.com.


Installation

On a fresh CentOS 7 server:

$ sudo useradd -s /sbin/nologin -d /opt/minio minio
$ sudo mkdir -p /opt/minio/bin
$ sudo mkdir -p /opt/minio/data
$ sudo yum install -y wget
$ sudo wget https://dl.min.io/server/minio/release/linux-amd64/minio -O /opt/minio/bin/minio
$ sudo chmod +x /opt/minio/bin/minio
$ cat<<EOF | sudo tee "/opt/minio/minio.conf"
MINIO_VOLUMES=/opt/minio/data
MINIO_DOMAIN=minio.local
MINIO_OPTS="--certs-dir /opt/minio/certs --address :443 --compat"
MINIO_ACCESS_KEY="accessKey"
MINIO_SECRET_KEY="superSECRETkey" 
EOF
$ sudo chown -R minio:minio /opt/minio

MinIO is installed in /opt/minio with a specific system user. The domain name we’ll use will be minio.local and the bucket name will be pgbackrest.

The data will be stored in /opt/minio/data.

We then have to setup the hosts file accordingly:

$ cat<<EOF | sudo tee "/etc/hosts"
127.0.0.1 pgbackrest.minio.local minio.local s3.eu-west-3.amazonaws.com
EOF

Https

Since we’ll need to run MinIO in https mode to be able to work with pgBackRest, let’s create some self-signed certificates:

$ mkdir ~/certs
$ cd ~/certs
$ openssl genrsa -out ca.key 2048
$ openssl req -new -x509 -extensions v3_ca -key ca.key -out ca.crt -days 99999 -subj "/C=BE/ST=Country/L=City/O=Organization/CN=some-really-cool-name"
$ openssl genrsa -out server.key 2048
$ openssl req -new -key server.key -out server.csr -subj "/C=BE/ST=Country/L=City/O=Organization/CN=some-really-cool-name"
$ openssl x509 -req -in server.csr -CA ca.crt -CAkey ca.key -CAcreate
[...]

The Crunchy PostreSQL Operator supports various forms of storage for provisioning PostgreSQL clusters in a Kubernetes environment. One such provider is Rook, which provides an abstract layer around multiple storage systems available in Kubernetes, which makes it even more convenient to choose between multiple storage engines. One storage engine that Rook supports is Ceph, which provides several types of distributed storage platforms including block-level storage, which is very helpful for expanding cloud-based workloads.

This post explores the use of the Rook storage engine with the PostgreSQL Operator, specifically demonstrating how the PostgreSQL Operator can be utilized to create a PostgreSQL cluster that is backed by Rook Ceph blockstorage.

For this example the rook-ceph-block storage class will be created and utilized in conjunction with the PostgreSQL Operator to dynamically provision Ceph block storage for use by a PostgreSQL cluster and it's supporting services. This will effectively demonstrate how Rook can be utilized to deploy a Ceph cluster in your Kubernetes environment, therefore allowing you to leverage the power of Ceph storage, e.g. highly-available and scalable block storage, in your PostgreSQL clusters.

Many thanks to Andrew L'Ecuyer for helping with the methodology and testing that this post presents. For more information about PostgreSQL Operator storage configuration, please see the  documentation.

Standby in production: scaling application in the second largest classified site in the world

Hi. My name is Konstantin Evteev, I’m a DBA Unit Leader of Avito. Avito is the biggest Russian classified site, and the second largest classified site in the world (after Craigslist of USA). Items offered for sale on Avito can be brand new or used. The website also publishes job vacancies and CVs.

Via its web and mobile apps, the platform monthly serves more than 35 million users. They add approximately a million new ads a day and close over 100,000 transactions per day. The back office has accumulated more than a billion ads. According to Yandex, in some Russian cities (for example, in Moscow), Avito is considered a high load project in terms of page views. Some figures can give a better idea of the project’s scale:

  • 600+ servers;
  • 4.5 Gbit/sec TX, 2 Gbit/sec RX without static;
  • about a million queries per minute to the backend;
  • 270TB of images;
  • >20 TB in Postgres on 100 nodes:
  • 7–8K TPS on most nodes;
  • the largest — 20k TPS, 5 TB.

At the same time, these volumes of data need not only to be accumulated and stored but also processed, filtered, classified and made searchable. Therefore, expertise in data processing is critical for our business processes.

The picture below shows the dynamic of pageviews growth.

Our decision to store ads in PostgreSQL helps us to meet the following scaling challenges: the growth of data volume and growth of number of requests to it, the scaling and distribution of the load, the delivery of data to the DWH and the search subsystems, inter-base and internetwork data synchronization, etc. PostgreSQL is the core component of our architecture. Reach set of features, legendary durability, built-in replication, archive, reserve tools are found a use in our infrastructure. And professional community helps us to effectively use all these features.

In this report, I would like to share Avito’s experience in different cases of s

[...]

Nowadays, it’s common to see a large amount of data in a company’s database, but depending on the size, it could be hard to manage and the performance could be affected during high traffic if we don’t configure or implement it in a correct way. In general, if we have a huge database and we want to have a low response time, we’ll want to scale it. PostgreSQL is not the exception to this point. There are many approaches available to scale PostgreSQL, but first, let’s learn what scaling is.

Scalability is the property of a system/database to handle a growing amount of demands by adding resources.

The reasons for this amount of demands could be temporal, for example, if we’re launching a discount on a sale, or permanent, for an increase of customers or employees. In any case, we should be able to add or remove resources to manage these changes on the demands or increase in traffic.

In this blog, we’ll look at how we can scale our PostgreSQL database and when we need to do it.

Horizontal Scaling vs Vertical Scaling

There are two main ways to scale our database...

  • Horizontal Scaling (scale-out): It’s performed by adding more database nodes creating or increasing a database cluster.
  • Vertical Scaling (scale-up): It’s performed by adding more hardware resources (CPU, Memory, Disk) to an existing database node.

For Horizontal Scaling, we can add more database nodes as slave nodes. It can help us to improve the read performance balancing the traffic between the nodes. In this case, we’ll need to add a load balancer to distribute traffic to the correct node depending on the policy and the node state.

To avoid a single point of failure adding only one lo

[...]
Posted by Luca Ferrari on 2019-07-18 at 00:00
Unable to clean HTML

Personally I’m a big fan of email, just like blogging. To me a good email thread can be like a good novel where you’re following along always curious for what comes next. And no, I don’t mean the ones where there is an email to all-employees@company.com and someone replies all, to only receive reply-all’s to not reply-all. I mean ones like started last week internally among the Azure Postgres team.

The first email was titled: Random Citus development and psql tips, and from there it piled on to be more and more tips and power user suggestions for Postgres. Some of these tips are relevant if you’re working directly on the Citus codebase, others relevant as anyone that works with Postgres, and some useful for debugging Postgres internals. While the thread is still ongoing here is just a few of the great tips:

In psql, tag your queries and use Ctrl+R

Psql supports Ctrl+R to search previous queries you ran. For demos and when testing complex scenarios, I like adding a little comment to queries that then becomes the tag by which I can later find the query:

# SELECT count(*) FROM test; -- full count
┌───────┐
 count 
├───────┤
     0 
└───────┘
(1 row)

Time: 127.124 ms
(reverse-i-search)`f': SELECT count(*) FROM test; -- full count

In most cases, 2-3 letters is going to be enough to find the query.

Better psql output

I find \x lacking, but pspg is great. It is available from PGDG via sudo yum install -y pspg or the equivalent on your system. I have the following .psqlrc which sets up pspg with a very minimalistic configuration:

$ cat > ~/.psqlrc
\timing on
\pset linestyle unicode 
\pset border 2
\setenv PAGER 'pspg --no-mouse -bX --no-commandbar --no-topbar'
\set HISTSIZE 100000

Get a stack trace for an error

In psql:

# SELECT pg_backend_pid();
┌────────────────┐
 pg_backend_pid 
├────────────────┤
         156796 
└────────────────┘
(1 row)

In another shell:

$ gdb -p 156796 
(gdb) b errfinish
Breakpoint 1 at 0x83475b: file elog.c, line 251.
(gdb) c
Continuing.

Ba

[...]

The above-mentioned PostgreSQL server configuration parameter was introduced already some time ago, in version 9.6, but has been flying under the radar so to say and had not caught my attention previously. Until I recently was pasted (not being on Twitter) a tweet from one of the Postgres core developers Andres Freund, that basically said – if your workload is bigger than Shared Buffers, you should enable the “ backend_flush_after” parameter for improved throughput and also jitter. Hmm, who wouldn’t like an extra boost on performance for free? FOMO kicked in… but before adding this parameter to my “standard setup toolbox” I hurried to test things out – own eye is king! So here a small test and my conclusion on effects of enabling (not enabled by default!) “backend_flush_after”.

What does this parameter actually do?

Trying to interpret the documentation (link here) in my own wording – “backend_flush_after” is basically designed to enable sending “hints” to the OS, that if user has written more than X bytes (configurable from 0 to max. 2MB) it would be very nice if the kernel could already do some flushing of recently changed data files in the background, so that when the “checkpointer” comes or the kernel’s “dirty” limit is reached, there would be less bulk “fsyncing” to do – meaning less IO contention (spikes) for our user sessions, thus smoother response times.

Be warned though – unlike most Postgres settings this one actually is not guaranteed to function, and currently only can work on Linux systems, having sync_file_range() functionality available – which again depends on kernel version and used file system. So in short this explains why the parameter has not gotten too much attention. Similar story actually also with the “sister” parameters – “bgwriter_flush_after”, “checkpoint_flush_after”, “wal_writer_flush_after”…with the difference that they are already enabled by default!

NB! Also note that this parameter, being controlled and initiated by Postgres, might be the only way to influence

[...]
Unable to clean HTML
BRIN Index for PostgreSQL

PostgreSQLBRIN Index was introduced in PostgreSQL 9.5, but many users postponed the usage of it in their design and development just because it was “new”. But now we understand that it has stood the test-of-time! It is time to reconsider BRIN if you have not done it yet. I often see users who forget there is a provision to select the type of Index by specifying USING clause when creating an index.

BRIN Index is a revolutionary idea in indexing first proposed by PostgreSQL contributor Alvaro Herrera. BRIN stands for “Block Range INdex”. A block range is a group of pages adjacent to each other, where summary information about all those pages is stored in Index.  For example, Datatypes like integers – dates where sort order is linear – can be stored as min and max value in the range. Other database systems including Oracle announced similar features later. BRIN index often gives similar gains as Partitioning a table.

BRIN usage will return all the tuples in all the pages in the particular range. So the index is lossy and extra work is needed to further filter out records. So while one might say that is not good, there are a few advantages.

  1. Since only summary information about a range of pages is stored, BRIN indexes are usually very small compared to B-Tree indexes. So if we want to squeeze the working set of data to shared_buffer, this is a great help.
  2. Lossiness of BRIN can be controlled by specifying pages per range (discussed in a later section)
  3. Offloads the summarization work to vacuum or autovacuum. So the overhead of index maintenance on transaction / DML operation is minimal.

Putting BRIN into a test

Let’s take a simple example to examine the benefits of BRIN index by creating a simple table.

postgres=# CREATE TABLE testtab (id int NOT NULL PRIMARY KEY,date TIMESTAMP NOT NULL, level INTEGER, msg TEXT);
CREATE TABLE

Now let’s Insert some data into this table.

postgres=# INSERT INTO testtab (id, date, level, msg) SELECT g, CURRENT_TIMESTAMP + ( g || 'minute' ) :: interval, random
[...]
Declarative partitioning got some attention in the PostgreSQL 12 release, with some very handy features. There has been some pretty dramatic improvement in partition selection (especially when selecting from a few partitions out of a large set), referential integrity improvements, and introspection. In this article, we’re going to tackle the referential integrity improvement first. This […]

In a lot of PostgreSQL environments, it’s common practice to protect user accounts with a password. Starting with PostgreSQL 10, the way PostgreSQL manages password-based authentication got a major upgrade with the introduction of SCRAM authentication, a well-defined standard that is a significant improvement over the current system in PostgreSQL. What’s better is that almost all PostgreSQL drivers now support this new method of password authentication, which should help drive further adoption of this method.

While it may be easy to take advantage of SCRAM authentication in new PostgreSQL deployments, there are a few steps involved in upgrading your existing systems to utilize this method. This article will briefly explain how SCRAM works in PostgreSQL (to try to encourage you to upgrade!) and then walk you through the steps of how to upgrade your existing PostgreSQL clusters to use SCRAM authentication.

A Very Brief Overview of SCRAM

Having recently witnessed quite an ingenious hack to implement some good old “println” style development debugging from stored procedures into a file, it prompted me to post knowledge to the Interwebs on two other ways how such a goal can be implemented more transparently. Also with help of some other good old legacy technology in one case. By the way, the main reason for them going for the hack was that being relatively new to databases they didn’t know how to name this thing that they were implementing, so Google wasn’t able to help – once again proof that naming things is one of the hardest problems of computing 🙂

What’s an “autonomous transaction”?

But to start with let’s explain what is an “autonomous transaction” as most people working with databases probably haven’t needed / heard of them and actually they’re not a standard or anything also, thus not too supported by various database engines. The phrase itself comes from the Oracle world I believe and it basically denotes “fire and forget” (sub) transactions that are not connected to the main transactions. And sadly also Postgres does not have direct built-in support for that – you’re always in a real transaction and need to invent a bit if the need arises.

And what would be the common use case? Mostly some logging / auditing / progress tracking into tables, in such a way that the information on the attempt would persist even when the main transaction is rolled back due to an error for example. Remember – in a standard transaction everything is thrown away in case no special measures (savepoints or exception handling sub-blocks in stored procedures) are taken.

The hacky way

So how did the initial implementation that wowed me a bit, looked like?

CREATE FUNCTION public.log(appid  text, msg text)
 RETURNS void
 LANGUAGE plpgsql
 SECURITY DEFINER
AS $function$
BEGIN
execute $$ copy (select $$ || quote_literal(msg) || $$) to program 'tee -a /tmp/$$ ||appid|| $$.log' $$;
END;
$function$;

Not bad – it works and is relatively short and

[...]

PostgreSQL Conference Europe 2019 in Milan, Italy, on October 15-18 is now open for registrations.

The Tuesday training sessions have also been finalized, and are now available for registration. Six sessions in a mix of full and half day sessions are available. Training sessions are available at an extra cost of €150 per half day. Attendees of training sessions will also receive a €90 discount on the regular conference fee. Seats to these sessions are limited to smaller groups, so make sure you register early!

Our call for papers is still open! If you have already submitted a talk, or are planning to submit one, we suggest you wait to register until you have received a confirmation if the talk was accepted or not. If your talk is accepted, attendance is of course free!

See you in Milan!

Postgres 12 is around the corner, and as always is packed with new features. One of them being the option to allow the user to control the behaviour of CTE materialization. You can find the commit from Tom Lane here, which explains everything about this new change but TLDR is that so far, CTE's where fully materialized, so restrictions from the query that uses it won't apply to the CTE. Which is the right way doing this when you are using CTEs to INSERT/UPDATE/DELETE or when they are recursive. This means that when a CTE is side-effect-free and non-recursive it's safe to push the restrictions from the outer query.
So, from postgres 12, when it's safe or when the CTE is called only once, postgres will inline the CTE to the outer query, removing the optimization fence. User will be able to override this behaviour by using  MATERIALIZED / NOT MATERIALIZED keywords.

Here's an example :

drop table if exists test_cte;
create table test_cte as
select generate_series(1,1000000) as id,
floor(random() * 10 + 1)::int as random;
-- OLD WAY
EXPLAIN analyze
WITH a_cte AS MATERIALIZED
(
SELECT random,count(*) FROM test_cte group by random
)
SELECT * FROM a_cte WHERE random = 5;
-- NEW
EXPLAIN analyze
WITH a_cte AS NOT MATERIALIZED
(
SELECT random,count(*) FROM test_cte group by random
)
SELECT * FROM a_cte WHERE random = 5;

It's not really important to show all the differences in explain plans but rather to see how many rows it had to process in order to create the CTE by applying the condition directly:

-- Materialized (old):
-> Parallel Seq Scan on test_cte (cost=0.00..8591.88 rows=416688 width=4) (actual time=0.031..19.077 rows=333333 loops=3)

-- Not Materialized (New):
-> Parallel Seq Scan on test_cte (cost=0.00..9633.59 rows=2083 width=4) (actual time=0.021..24.469 rows=33222 loops=3)
Filter: (random = 5)
Rows Removed by Filter: 300112

A production system running a 2TB on 9.4 with a dev that has a lot of free diskspace that we plan upgrading soon  ma[...]

PostgreSQL WAL Retention pg_archivecleanupWAL retention is a very important topic for PostgreSQL database management. But very often we come across DBAs getting into surprise situations such as:

1. Several TBs of WALs piled up in archive destination

2. WALs filling up pg_wal/pg_xlog directory due to failing archive

3. Necessary WALs are no longer preserved

External backup projects for PostgreSQL are good in addressing retention policies. But there is a simple program named pg_archivecleanup which comes along with PostgreSQL binaries which might be useful in both maintaining the WAL retention as well as handling an emergency situation. pg_archivecleanup is probably the least talked about utility among standard PostgreSQL binaries. It is extremely simple and useful, and it can work in 2 scenarios:

  • Can be used as a standalone program to clean up old WAL files from any file system location.
  • It can also be used in Standby side for cleaning up WAL files which are no longer required for Standby.

This program can operate in dryrun mode (-n option) or delete mode (-d option).

Dryrun (-n)

This feature is useful to list all WALs older than a specific WAL. In the following demonstration, I am listing all WALs which are still in the default wal location (pg_wal) inside my data directory and older than a specific WAL:

$ ./pg_archivecleanup -n ~/bigsql/data/pg11/pg_wal 00000001000000000000001E
/home/jobin/bigsql/data/pg11/pg_wal/00000001000000000000001B
/home/jobin/bigsql/data/pg11/pg_wal/000000010000000000000017
...

Delete Mode (-d)

In this mode, pg_archivecleanup does the cleanup by removing all the candidate WALs:

$ pg_archivecleanup -d /home/postgres/archive 00000002000000000000006B
pg_archivecleanup: keeping WAL file "/home/postgres/archive/00000002000000000000006B" and later
pg_archivecleanup: removing file "/home/postgres/archive/000000020000000000000069"
pg_archivecleanup: removing file "/home/postgres/archive/00000002000000000000006A"
...

Understanding WAL retention criteria

In order to do a clean up

[...]
Posted by Dave Page in EnterpriseDB on 2019-07-10 at 12:33
Reverse proxying requests to a pgAdmin server is becoming more and more popular if posts to the mailing lists are to be taken as an indicative measure; more often than not when using pgAdmin in a container (of which there have now been over 10 million pulls)! Typically users will deploy a reverse proxy for a couple of reasons; to host multiple applications in different subdirectories under the same domain, or to add SSL/TLS support independently of the application.

Because of the number of questions asked, I spent a little time over the last couple of days doing some testing and updating the documentation with some examples. Here's a blog-ified version of that work.


Nginx

Nginx is winning the battle of the web servers these days, beating out Lighttpd (which is still a fine piece of software) and the ageing and arguably bloated Apache HTTPD. All of these servers support reverse proxying, and whilst I've looked at Nginx, the examples shown below can easily be translated to the other servers if you prefer to run them instead.

In the following examples, we have pgAdmin running in a Docker container (in which it's hosted under Gunicorn). For simplicity, the examples have Nginx running on the host machine, but it can also be easily run in another container, sharing a Docker network with pgAdmin. In such a configuration there is no need to map the pgAdmin container port to the host.

The container is launched as shown below. See the documentation for information on other useful environment variables you can set and paths you can map.

The commands below will pull the latest version of the container from the repository, and run it with port 5050 on the host mapped to port 80 on the container. It will set the default username and password to user@domain.com and SuperSecret respectively.

docker pull dpage/pgadmin4
docker run -p 5050:80 \
-e "PGADMIN_DEFAULT_EMAIL=user@domain.com" \
-e "PGADMIN_DEFAULT_
[...]

One of the most popular features of PostGIS 2.5 was the introduction of the "vector tile" output format, via the ST_AsMVT() function.

Elephant with charater troubles wishes for an ICU collation
© Laurenz Albe 2019

 

As announced by the glibc mailing list and discussed by Daniel Vérité, version 2.28 of the GNU C library will ship with many changes to the collations.

PostgreSQL uses the operating system’s collations by default. When this version hits your Linux operating system as part of a routine bugfix upgrade, and you don’t use the “C” or “POSIX” collation, you stand a good chance that some indexes on strings will become corrupted.

Since version 10, PostgreSQL does not only support “libc collations” but also “ICU collations” which are not affected by this particular problem. This article describes ICU collations and how you can swicth your database to using them.

Why is it a problem?

Collations determine how strings are compared, so they affect everything in the database that has to do with the order of strings:

  • the result of the ORDER BY clause in SELECT statements
  • the order of the keys in b-tree indexes

Now changes in the behavior of ORDER BY can be annoying, but that doesn’t threaten data integrity.
But if your index is suddenly ordered in the wrong fashion, that is data corruption. It can lead to incorrect query results or to duplicate entries in UNIQUE indexes.

What can you do to avoid the problem

The straightforward remedy is to rebuild all indexes on strings.

The simple way to do this is to run REINDEX INDEX for all affected indexes. However, that will block all attempts to use the index and will prevent data modifications on the table.

In PostgreSQL v12, you will be able to use REINDEX CONCURRENTLY to avoid these disadvantages. In older versions of PostgreSQL you can work around this by using CREATE INDEX CONCURRENTLY to create a copy of the index, then drop the original one. I’ll demonstrate that further down.

But I wanted to tell you about ICU collations and their benefits.

ICU collations

ICU is a set of libraries that provide, among other things, standardized Unicode collations. They are available on many op

[...]

What if your database design is so poor that you need to refactor tables in order to add primary keys?

Generate Primary Keys (almost) Automatically

While playing on quite large database (in terms of number of tables) with a friend of mine, we discovered that almost all tables did not have a primary key!
Gosh!
This is really baaaad!

Why is that bad? Well, you should not ask, but let’s keep the poor database design alone and focus on some more concrete problems: in particular not having a primary key prevents a lot of smart softwares and middlewares to work on your database. As you probably know, almost every ORM requires each table to have at least one surrogate key in order to properly identify each row and enable persistence (that is, modification of rows).

Luckily, fixing tables for such software is quite simple: just add a surrogate key and everyone will be happy again. But unluckily, while adding a primary key is a matter of issuing an ALTER TABLE, doing so for a long list of tables is boring.

Here comes the power of PostgreSQL again: thanks to its rich catalog, it is possible to automate the process.



In this post you will see how to build from a query to a whole procedure that does the trick.

A query to generate the ALTER TABLE commands

A first example is the following query, that searches for every table in the schema public that does not have a constraint of type p (primary key) and issue an ALTER TABLE for such table:

testdb=# WITH to_be_fixed...
Posted by Luca Ferrari on 2019-07-08 at 00:00

The coming version of PostgreSQL, 12, will loose the recovery.conf file. It will get some time to get used to!

PostgreSQL & recovery.conf

According to the documentation for the upcoming version 12, the recovery.conf file has gone! The release note states it clearly: the server will not start if recovery.conf is in place and all the configuration parameters have moved to the classic postgresql.conf (or included files).

The change proposal is quite old, but represents a deep change in the way PostgreSQL handles the server startup and recovery and could take a while to get all the software out there to handle it too.

Please note that since PostgreSQL 12 is still in beta, things could change a little, even if the discussion and the implementation is nearly ended.

Two files can be created to instrument a standby node:

  • standby.signal if present in the PGDATA directory the host will work as a standby, that is it will wait for new incoming WAL segments and replay them for the rest of its life;
  • recovery.signal if present will stop the WAL replaying as soon as all the WALs have been consumed or the recovery_target parameter has been reached.


It is interesting to note that standby.signal takes precedence on recovery.signal, meaning that if both file exists the node will act as a standby. Both files may be empty, they act now as as triggering files rather than configuration files (here the change in the suffix).

So, what is the rationale for this change? There are several reasons, including the not needing for a duplication of configuration files. But...

Posted by Daniel Vérité on 2019-07-06 at 12:50

Exporting query results in CSV has been possible for a long time (since version 8.0), either with
COPY (SELECT ...) TO STDOUT CSV as an SQL command, or with the \copy meta-command in psql, which invokes COPY under the hood and handles the flow of data on the client side.

But there are still a few cases not covered by this functionality, which is why in PostgreSQL 12, CSV has been added to the output formats. It means that we can now issue \pset format csv, so that all commands producing tabular results can output them in the CSV format. It can also be opted for on the command line with the --csv option.

In practice, we need to use an output format instead of \copy:

  • when data must be fetched through a query that is not supported by COPY, for instance a cursor, since fetch from c is valid but copy (fetch from c) to stdout csv is not.

  • when the result is produced by a meta-command instead of a query: \crosstabview, \l, \d etc…

CSV as a replacement for the unaligned format (-A)

Aside from \copy, simple exports in tabular format are often done with the “unaligned” format with its field separator given by the fieldsep parameter (-F option).

But this format has two weaknesses:

  • when the separator appears inside the data, there’s no way to distinguish it (no escape mechanism).
  • when line feeds are present inside fields (in multi-line text), there’s no way to distinguish a line feed within a field from a record separator.

Example:

$ query="select 'ab,cd' as col1, 'ef'||chr(10)||'gh' as col2"
$ psql -AtF',' -c "$query"
ab,cd,ef
gh

In the above example, when trying to interpret the output, we can’t know where the fields start and end, nor how many records there were in the source data.

Although it comes from a single record of two columns, this output could just as well represent a single column with ab,cd,ef as the value of the first record, and gh for the second one.

The CSV format solves this problem:

$ psql  --csv -t
[...]
Does anyone really know what time it is? First we have the bare bones of the PostgreSQL Interval, Date and Timestamp Data types. Here are the questions: What types are they? And what options do they have? What postgresql.conf variables affect date and time i/o? Available Date and Time Data types Date day of year […]
Posted by Louise Grandjonc in CitusData on 2019-07-05 at 14:02

Recently, I started working on the django-multitenant application. The main reason we created it was to to help django developers use citus in their app. While I was working on it, I wrote unit tests. And to be able to reproduce a customer’s production environment, I wanted the tests to use citus and not a single node postgres. If you are using citus as your production database, we encourage you to have it running in your development environment as well as your staging environments to be able to minimise the gap between dev and production. To understand better the importance of dev/prod parity, I recommend reading the Twelve-Factor app that will give you ideas to lower the chances of having last minute surprising when deploying on prod.

The goal of this article is to explain how I set it up as I think that if you are using citus on production, it is better to have tests running with citus.

Setting up docker

I will consider that you already have docker installed, otherwise I recommend taking a look at our article[https://docs.citusdata.com/en/v8.1/installation/singlemachinedocker.html#single-machine-docker]

Once you have docker running, create in your app directory a docker-compose.yml containing:

version: '2.1'

services:
  master:
    image: 'citusdata/citus:8.2.1'
    ports: ['5600:5432']
    labels: ['com.citusdata.role=Master']
    volumes: ['/var/run/postgresql']
  manager:
    container_name: "${COMPOSE_PROJECT_NAME:-citus}_manager"
    image: 'citusdata/membership-manager:0.1.0'
    volumes: ['/var/run/docker.sock:/var/run/docker.sock']
    depends_on: { master: { condition: service_healthy } }
  worker1:
    image: 'citusdata/citus:8.2.1'
    ports: ['5601:5432']
    labels: ['com.citusdata.role=Worker']
    depends_on: { manager: { condition: service_healthy } }
  worker2:
    image: 'citusdata/citus:8.2.1'
    ports: ['5602:5432']
    labels: ['com.citusdata.role=Worker']
    depends_on: { manager: { condition: service_healthy } }
  healthcheck:
    image: busybox
    depends_on:
   
[...]
Posted by Kirk Roybal in 2ndQuadrant on 2019-07-05 at 13:26
Computed columns in PostgreSQL 12 are here, but still have a ways to go.

I’ve written already a blog post “Building PostgreSQL with MSYS2 and MinGW under Windows” a year ago. But this time I want to show you two important things:

  1. cross compiling (x86, x64) using the same run-time;
  2. compiling with the latest OpenSSL support.

That’s true nowadays you won’t find binaries or installations of PostgreSQL without OpenSSL support. It used not only for encryption but for compression as well.

Even though it’s Windows, let’s make all steps in command prompt only. Ya, I told you Linux guys, we can run console too!

Setting up the environment

First things first. I launch my msys2 shell. I assume you have it on your system, if not check installation instructions from my previous post.

c:\msys64\ is the msys2 installation folder.

C:\Users\pasha>c:\msys64\msys2_shell.cmd -mingw64 -conemu

I’m telling msys2 to run shell and use -mingw64 as a default chain. That means we will build x64 binaries. -conemu is an optional parameter, you may want to pass it in case you’re a fan of ConEmu as I am. If not, just skip it.

pasha@PG480 MINGW64 ~
$ mkdir /src

pasha@PG480 MINGW64 ~
$ cd /src

pasha@PG480 MINGW64 /src
$ git clone https://github.com/openssl/openssl.git --branch OpenSSL_1_1_1-stable --depth 1
Cloning into 'openssl'...
remote: Enumerating objects: 18297, done.
remote: Counting objects: 100% (18297/18297), done.
remote: Compressing objects: 100% (15182/15182), done.
remote: Total 18297 (delta 884), reused 14976 (delta 582), pack-reused 0
Receiving objects: 100% (18297/18297), 13.48 MiB | 1.38 MiB/s, done.
Resolving deltas: 100% (884/884), done.
Checking out files: 100% (18166/18166), done.

pasha@PG480 MINGW64 /src
$ git clone https://github.com/postgres/postgres.git --branch REL_11_STABLE --depth 1
Cloning into 'postgres'...
remote: Enumerating objects: 5868, done.
remote: Counting objects: 100% (5868/5868), done.
remote: Compressing objects: 100% (5365/5365), done.
remote: Total 5868 (delta 754), reused 1296 (delta 317), pack-reused 0
Receiving objects: 100% (5
[...]

Crunchy Data has recently announced an update to the CIS PostgreSQL Benchmark by the Center for Internet Security, a nonprofit organization that provides publications around standards and best practices for securing technologies systems. This newly published CIS PostgreSQL 11 Benchmark joins the existing CIS Benchmarks for PostgreSQL 9.5, 9.6, and 10 while continuing to build upon Crunchy Data's efforts with the PostgreSQL Security Technical Implementation Guide (PostgreSQL STIG).

Posted by Louise Grandjonc on 2019-07-01 at 17:20
Introduction In the previous article, I introduced the internal data structure of a postgres GIN index. Now it’s time to talk about searching and maintaining them. Searching in a GIN index When postgres uses the GIN index, you’ll notice in the EXPLAIN that it’s always doing a bitmap heap scan. It means that the search algorithm returns a map of tids (pointers to rows) in physical memory order. Considering the way the data is stored inside the posting lists and trees, it makes a lot of sense to use this method.
Posted by Louise Grandjonc on 2019-07-01 at 17:20
Introduction GIN (Generalized Invertes Index) indexes are often used to index arrays, jsonb, and tsvector (for fulltext search) columns. When it comes to array, they are, for example, used to verify if an array contains an other array or elements (the <@ operator). In the documentation you can see the full list of operators. But the question that I want to answer in this article is: “why should we use GIN indexes for this data types and operators ?
PostgreSQL Indexes

I want to thank everybody who attended my session “Deep Dive in PostgreSQL Indexes” in Austin. It was quite a wonderful experience! To cover a bigger audience, I also did a webinar on the same topic. There were many questions in that webinar, but unfortunately, there was not enough time to cover each and every question, so I decided to have a followup Q&A session with a blog post on the topic.

The recording and slides are available here. Below is the list of your questions.

Q1: Is there a concept of the functional index in PostgreSQL?

The functional index is when an index is defined on the result of a function. This is present in PostgreSQL 7. In later releases of PostgreSQL, you can define the index on the result of an expression. The expression can be any PostgreSQL expression. The function is a subset of expression.

Q2: Can I create an index using a table row data that is attached to the index? Does my query have to use index-only-scan?

Yes, if all selected columns are in the index, then index-only-scans will be used.

Q3: It’s a little bit hard to choose the right index type. Is there any offer mechanism in Postgres for index types?

Yes, that’s true that is hard to choose, but it all depends on your data and queries types. I have mentioned recommendations in my slides. There is no mechanism for that in PostgreSQL.

Q4: Is it possible to manipulate which index is used by a query if you have multiple indexes?

No, it’s optimizer’s responsibility to choose between defined the indexes. Believe that the PostgreSQL optimizer will choose the best option for you.

Q5: If there’s a need to index a date column, would it be better to use B-TREE or BRIN index?

A date data type is mostly aligned with physical storage. If you are not making many updates and delete operations, then go for the BRIN. But if you have too many updates and deletes, then B-TREE is best for you.

Q6: For time-based filters what kind of index is preferable?

Time is also aligned with physical storag

[...]

Machine Learning, Deep Learning, Artificial Intelligence, and Data Science have become really important topics these days. Everybody is looking for new technologies and tries to make use of those recent developments. PostgreSQL can help to achieve those goals, and it offers reliable backend storage for your data.

PostgreSQL vs. CSV files vs. commercial databases

In the past we have seen that many in the Data Science and AI community are still using CSV files to process data. While there is a lot of tooling available (e.g. the Python pandas library) it might not be the best of all decision to build your models on CSV files.

So what is wrong with a CSV file? Well, actually nothing as long as …

  • the format of your file is correct
  • the amount of data is reasonable
  • ALL the data in the file is needed
  • nothing changes

All those points might seem obvious BUT my practical experience shows that they are not and that those issues are seriously underappreciated. What do I mean by that?

Recently we have seen a customer, who wanted to store interest rates for some financial models. What did we find in the data? Let me quote: “3 percent” (as a string). Now: How do you expect your AI models to work if the data is so wrong? To do data mining you will need high quality data – not some crap. Which leads to an important observation:

“If you are doing Artificial Intelligence 80% of your effort has to go into data preparation – only 20% are the real model”

In short: You cannot seriously expect a model to learn something about “interest rates” if the raw data does not even ensure that the interest rate is actually a number.

The second remark is about the amount of data: Suppose you got a TB of data. If you want to process or maybe sort your data. What do you think? Who is better at sorting? A database, which has been optimized to sort data for 30 years or some hand written Python program, which you have written over a weekend? From experience I can tell you: You got no chance. The d

[...]