The world's most advanced open source database
Top posters
Number of posts in the past month
Top teams
Number of posts in the past month
  • Policy for being listed on Planet PostgreSQL.
  • Add your blog to Planet PostgreSQL.
  • List of all subscribed blogs.
  • Manage your registration.
  • Get in touch with the Planet PostgreSQL administrators at planet at

PostgreSQL is one of the most advanced open source databases in the world with a lot of great features. One of them is Streaming Replication (Physical Replication) which was introduced in PostgreSQL 9.0. It is based on XLOG records which get transferred to the destination server and get applied there. However, it is cluster based and we cannot do a single database or single object (selective replication) replication. Over the years, we have been dependent on external tools like Slony, Bucardo, BDR, etc for selective or partial replication as there was no feature at the core level until PostgreSQL 9.6. However, PostgreSQL 10 came up with a feature called Logical Replication, through which we can perform database/object level replication.

Logical Replication replicates changes of objects based on their replication identity, which is usually a primary key. It is different to physical replication, in which replication is based on blocks and byte-by-byte replication. Logical Replication does not need an exact binary copy at the destination server side, and we have the ability to write on destination server unlike Physical Replication. This feature originates from the pglogical module.

In this blog post, we are going to discuss:

  • How it works - Architecture
  • Features
  • Use cases - when it is useful
  • Limitations
  • How to achieve it

How it Works - Logical Replication Architecture

Logical Replication implements a publish and subscribe concept (Publication & Subscription). Below is a higher level architectural diagram on how it works.

Basic Logical Replication Architecture

Publication can be defined on the master server and the node on which it is defined is referred to as the "publisher". Publication is a set of changes from a single table or group of tables. It is at database level and each publication exists in one database. Multiple tables


Last week, I was at Nordic PGDay 2018 and I had quite a few conversations about the tool that I wrote, namely pglupgrade, to automate PostgreSQL major version upgrades in a replication cluster setup. I was quite happy that it has been heard and some other people in different communities giving talks at meetups and other conferences about near-zero downtime upgrades using logical replication. Given that there is a talk that I gave at PGDAY’17 Russia, PGConf.EU 2017 in Warsaw and lastly at FOSDEM PGDay 2018 in Brussels, I thought it is better to create a blog post to keep this presentation available to the folks who could not make it to any of the conferences aforementioned. If you would like to directly go the talk and skip reading this blog post here is your link: Near-Zero Downtime Automated Upgrades of PostgreSQL Clusters in Cloud

Near-Zero Downtime Automated Upgrades of PostgreSQL Clusters in Cloud

The main motivation behind developing this tool was to propose a solution to minimize the downtime during major version upgrades which unfortunately affect almost everyone who uses PostgreSQL. Currently, we don’t have a tool that allows PostgreSQL users to upgrade their databases without downtime and it is clearly a pain point for many users, especially businesses. And, if we are to solve the upgrade problem we should think of more than one server (will be referred as a cluster from now on), simply because not many people use only one database server anymore. The most common scenario is having physical streaming replication setup either for high-availability purposes or scaling the read queries.

Database Upgrades

Before diving into the solution, let’s discuss a bit about how database upgrades work in general. There are four main possible approaches to database upgrades:

  1. The first approach would be for databases to keep their storage format same or at least compatible across versions. However, this is hard to guarantee long-term as new features might require changes in how data is stored or add more metadata information to work properly. Also, performance is often improved by optimi

PostgreSQL is the world’s most advanced open source database, and per the PostgreSQL Wikipedia page it is an object-relational database management system (ORDBMS) with an emphasis on extensibility and standards compliance.

In this article, we try to understand why would PostgreSQL be named an object-relational thing. What is Object Oriented Programming and how does that apply to a database system?

Posted by Gulcin Yildirim in 2ndQuadrant on 2018-03-22 at 12:11

Last month, I was at 2ndQuadrant booth at P2D2, an annual conference organized in Prague by local user group called CSPUG (Czech and Slovak PostgreSQL User Group). The conference was certainly a huge success and more than 200 people showed up! Meeting a lot of PostgreSQL enthusiasts in town, I wondered why there are no regular meetups in Prague. At the conference after-party, I started to question folks and learned that given a chance some of them would give talks and many more would be happy to attend. The initial reactions verified my observation and I felt motivated enough to start organizing meetups myself!

Even though I’ve been living in Prague for some time now, it has been a little difficult to manage the myself all by myself. My colleague Tomas Vondra stepped forward to help and gave access to our already existing meetup page (surprise!) founded by him.

I immediately started to work on the meetup page and designed a cover photo using the existing CSPUG logo. Now, it has a brand new identity and looks good to the eye. The kickstart for the page already paid off, we have more than 100 members and the number is growing almost every day! You can visit and join our group here: Prague PostgreSQL Meetup

Having set up the meetup page, we needed a place for announcements so we created a Twitter account. You may follow us @PrahaPostgreSQL  and help us to grow by following/retweeting!

After reviving the meetup page and created a social channel, the tougher part was organizing the meetups. Luckily, we found some volunteers (thank you so much!) to help us both from Prague and abroad! We will be organizing a meetup last Monday of each month, and we plan to have one talk in English and another in Czech to attract both locals and non-Czech speakers at the same time. Please feel free to write to me if you have any suggestions/feedback.

Upcoming Meetups

We have already secured our meetups for next two months and are inviting speakers for our session in June. You can have a look at our upcoming meetups here:

Posted by Ashutosh Bapat in EnterpriseDB on 2018-03-21 at 12:54
PGConf India 2018 attracted a large number of PostgreSQL users and developers. I talked about "Query optimization techniques for partitioned tables" (slides). Last year, I had held an introductory talk about PostgreSQL's declarative partitioning support (slides). Many conference participants shared their perspective on partitioning with me. One particular query got me experimenting a bit.

The user had a huge table, almost 1TB in size, with one of the columns recording the data-creation time. Applications added MBs of new data daily and updated only the recent data. The old data was retained in the table for reporting and compliance purposes. The updates bloated the table, autovacuum wasn't clearing the bloat efficiently. Manual vacuum was out of scope as that would have locked the table for much longer. As a result queries were slow, and performance degraded day by day. (Read more about bloats and vacuum here and here.) The user was interested in knowing if partitioning would help.

Hot and Cold partitioning

The concept of hot and cold partitioning isn't new. The idea is to separate data being accessed and modified frequently (Hot data) from the data which is accessed and modified rarely (Cold data). In the above case, that can be achieved by partitioning the data by the creation timestamp. The partitions should be sized such that the updates and inserts access only a handful Hot partitions (ideally at most two). The Cold partitions containing the stale data would remain almost unchanged. Since the updates are taking place in the Hot partitions, those get bloated, but their sizes are much smaller than the whole table. Vacuuming those doesn't take as much time as the whole table. Once they become Cold, they hardly need any Vacuuming. Thus containing the bloat effectively.

In PostgreSQL autovacuum, if enabled on the given table, runs its job when the number of inserted, deleted or updated rows are above certain thresholds (See details). Since all the action happens in the Hot partitions, only those partitions can

You all probably heard about Meltdown and Spectre vulnerabilities that are exploiting bugs in the modern CPU architecture. The biggest problem with these vulnerabilities is that they are hardware bugs, so they cannot easily be fixed. This means the fixes need to be done in the OS kernel itself (for each OS individually). These flaws exist in the first place because of CPU performance optimizations and because mitigations are fixed on the software level, which means that we should expect some performance penalties.

At the time these vulnerabilities were released to the public, we were in the process of testing some new SSDs. That’s why we used that server to test how these mitigations would impact the performance of our most used database, Postgres. Before we go into the results, we have to explain three mitigation features that were added to the Linux kernel:

  • PTI – is the Meltdown mitigation
  • IBRS – Spectre v1 mitigation
  • IBPB – Spectre v2 mitigation

All the news was saying that PTI should have the biggest performance impact, while the other two should have a minor effect. Also, was said that if the CPU supports PCID (Process-Context Identifier) and INVPCID (Invalidate Process-Context Identifier), these mitigations should have a smaller impact on performance. Our CPU does support both of these features.

Here is the testing server:

  • CPU: 2x Intel Xeon E5-2630 v4 (20 cores total)
  • RAM: 29GB limited with mem=32G on boot
  • DISK: RAID10 4x SM863a 1.9TB (Dell H730p controller)
  • OS: CentOS 7.4.1708
  • Kernel: kernel-3.10.0-693.11.6.el7.x86_64
    (microcode_ctl-2.1-22.2, linux-firmware-20170606-57.gitc990aae)
  • DB: PostgreSQL 9.5.10 (postgresql.conf used for these benchmarks)
  • FS: XFS (nobarrier, noatime)

Tests were done using pgbench with two different sizes of the database (different scale parameter -s):  9000 (DB of size ~140GB) and 1000 (DB of size ~15GB). The reason for this is that we wanted to test two different scenarios:

  • DB cannot fit into the OS page cache, so there would be a lot more reading directly from the SSDs, and

As part of their enterprise monitoring system, organizations rely on alerts and notifications as their first line of defense to achieving high availability and consequently lowering outage costs.

Alerts and notifications are sometimes used interchangeably, for example we can say “I have received a high load system alert”, and replacing “alert” with “notification” will not change the message meaning. However, in the world of management systems it is important to note the difference: alerts are events generated as a result of a system trouble and notifications are used to deliver information about system status, including trouble. As an example the Severalnines blog Introducing the ClusterControl Alerting Integrations discusses one of the ClusterControl’s integration features, the notification system which is able to deliver alerts via email, chat services, and incident management systems. Also see PostgreSQL Wiki — Alerts and Status Notifications.

In order to accurately monitor the PostgreSQL database activity, a management system relies on the database activity metrics, custom features or monitor advisors, and monitoring log files.

In this article I review the tools listed in the PostgreSQL Wiki, the Monitoring and PostgreSQL GUI sections, skipping those that aren’t actively maintained, or do not provide alerting and notifications either within the product or with a free trial account. While not an exhaustive review, each tool was installed and configured up to the point where I could understand its alerting and notification capabilities.


Nagios is a popular on-premise, general purpose monitoring system that offers an wide range of plugins. While Nagios Core is open source, the recommended solution for monitoring PostgreSQL is Nagios XI.

Notification settings are per user, and in order to change them the administrator must “login as” the user — Nagios uses the term masquerade as. Once on the account setting page, the user can choose to enable or disable the notification methods:

Nagios XI Notification Preferences
Nagios XI Notification
Posted by Pavel Stehule on 2018-03-20 at 17:55
I sent new patch to mailing list - can be composite, scalar or array - and it is working like plpgsql variables - that was my target:
postgres=# create variable foo as numeric default 0;
postgres=# select foo;
│ foo │
│ 0 │
(1 row)

postgres=# let foo = pi();
postgres=# select foo;
│ foo │
│ 3.14159265358979 │
(1 row)

postgres=# create variable boo as (x numeric default 0, y numeric default 0);
postgres=# let boo.x = 100;
postgres=# select boo;
│ boo │
│ (100,0) │
(1 row)

postgres=# select boo.x;
│ x │
│ 100 │
(1 row)

Many of us have known how great Postgres was for years.

In fact I recall a conversation with some sales engineers about 6 years ago that previously worked for a large database vendor that really no one likes down in Redwood City. They were remarking how the biggest threat to them was Postgres. At first they were able to just brush it off saying it was open source and no real database could be open source. Then as they dug in they realized there was more there than most knew about and they would have to continually be finding ways to discredit it in sales conversations. Well it doesn’t look like those SEs or the rest of that company was too successful.

Postgres is certainly having it’s moment, and I personally don’t expect it to fade soon.

An equally interesting shift I’ve watched from the outside has been Microsoft’s shift to support and engage with the open source movement. Personally that shift is extremely exciting to see, especially today as they announce their general availability of their Postgres offering. And with their announcement it looks they’re not just dabbling but shipping a very compelling offering, notably high availability is built-in which means they’re very much targetting production workloads. With their GA release there are a number of interesting boxes checked:

  • HIPPA, SOC, ISO compliances
  • 99.99% uptime SLA
  • Available in 22 regions

Personally I’m looking forward to the new competition for Postgres users as it’ll make Postgres better and better for all. When I was at Heroku and was running product for Heroku Postgres 7 years ago we were the only large major provider. Today that landscape looks a lot different and it just means more choice and more quality if you want to run Postgres. So welcome Microsoft, I look forward to giving Azure Postgres a try.

The slides from my talk at Nordic PGDay, Why PostgreSQL is Terrible, are now available.

Posted by Umair Shahid in 2ndQuadrant on 2018-03-20 at 07:28

1. Introduction

EclipseLink was announced in 2008 as the JPA 2.0 implementation from the Eclipse Foundation. It is based on the TopLink project from which then Oracle contributed code to the EclipseLink project. The project delivers an open source runtime framework supporting the Java Persistence API standards. The EclipseLink project provides a proven, commercial quality persistence solution that can be used in both Java SE and Java EE applications.

EclipseLink is open source and is distributed under the Eclipse Public License.

2. Implementation Details

Like Hibernate, EclipseLink is also fully JPA 2.0 compliant. This makes the implementation details quite similar to those already described in a preceding Hibernate blog.

For the illustration below, we will continue using the ‘largecities’ example with table structure:

 postgres=# \d largecities

         Table "public.largecities"

 Column |          Type          | Modifiers 


 rank   | integer                | not null

 name   | character varying(255) | 


    "largecities_pkey" PRIMARY KEY, btree (rank)

2.1 Creating an Entity

You can create the entity using JPA’s standard @Entity and @Id (and other) annotations:

import javax.persistence.Entity;
import javax.persistence.Id;

public class LargeCities {
    private int rank;
    private String name;

    public int getRank() {
        return rank;
    public String getName() {
        return name;
    public void setRank(int rank) {
        this.rank = rank;
    public void setName(String name) { = name;

2.2 Creating the Configuration XML

EclipseLink’s configuration XML file goes by the name of persistence.xml and should be placed under the META-INF folder at the root of the persistence unit or on the classpath. A sample XML is given below:

<?xml version="1.0" encoding="UTF-8"?>
<persistence version="2.0"
    xmlns="" xmlns:xsi="
From time to time, you may need to figure out which file in a PostgreSQL data directory corresponds to a particular table or index in the database. For example, pg_hexedit users sometimes need this information, since pg_hexedit is a frontend utility that works by reading relation files from the filesystem. In practice, a pg_hexedit convenience script can usually be used instead. Users need only give the name of the table or index that is to be examined. The convenience scripts call the built-in function pg_relation_filepath() via an SQL query.

This approach won't always work, though. pg_hexedit is a tool for investigating corruption, and sometimes corruption can affect system catalogs in a way that makes it impossible to even establish a connection to the database. You may find that you're greeted with an arcane error any time you attempt to connect to the database. The error may look something like this:

ERROR:  catalog is missing 3 attribute(s) for relid 827

In this example, the issue that prevents us from connecting must have something to do with the system catalog pg_attribute, and/or an index on pg_attribute. The catalog with relid/pg_class OID 827 (pg_default_acl_role_nsp_obj_index) appears to lack pg_attribute entries, making the built-in/catalog index pg_default_acl_role_nsp_obj_index unusable (note that there is no reason to think that the underling relfile for pg_default_acl_role_nsp_obj_index is itself corrupt). To confirm this theory, we'll need to directly examine the pg_attribute relation for the database. Of course, there is no way to query pg_attribute, because we cannot connect. Moreover, there is no easy way to know where the files associated with pg_attribute are, so that we can at least examine pg_attribute using pg_hexedit. The "relfilenode" number that corresponds to pg_attribute (or any other table) isn't hard-coded or stable. For example, the relfilenode of a table will change anytime VACUUM FULL is used on the table.

I've written a tool called pg_filenodemap that can help when someth[...]

Thanks to Thomas Vondra’s presentation CREATE STATISTICS – What is it for? at Nordic PGDay 2018, my Trustly colleague Claes Jakobsson came up with the idea for a new extension.

The presentation explained how to manually compare the cost rows with the actual rows output from manually executing EXPLAIN ANALYZE.
Claes came up with the idea to do this comparison automatically, right after a query has been executed.
During the conference, Claes implemented the idea and emailed me some code for me to test.

Below is a demo of this extension, using the same data set and queries as in the presentation.

git clone
psql -X -f create-statistics-talk/create.sql

Let’s first try the query without pg_badplan using EXPLAIN ANALYZE:

\copy zip_codes from ~/src/create-statistics-talk/no_postal_codes_utf.csv with csv header;
COPY 4574

SELECT * FROM zip_codes WHERE city = 'Oslo' AND county = 'Oslo';
                                      QUERY PLAN                                       
 Seq Scan on zip_codes  (cost=0.00..108.61 rows=90 width=36) (actual rows=642 loops=1)
   Filter: ((city = 'Oslo'::text) AND (county = 'Oslo'::text))
   Rows Removed by Filter: 3932
 Planning time: 0.357 ms
 Execution time: 0.679 ms
(5 rows)

As you can see, we can manually see the ratio is approx 7.133 (=642/90).

Next, let’s see how we can automate this manual process using pg_badplan.

git clone
cd pg_badplan
make install
echo "shared_preload_libraries = 'pg_badplan'" >> /usr/local/var/postgres/postgresql.conf
echo "pg_badplan.min_row_threshold = 10" >> /usr/local/var/postgres/postgresql.conf
brew services restart postgresql

The pg_badplan.min_row_threshold has been set to 10 just for testing,
the default is 1000, but the query in the example returns fewer rows than that.

We can now try to run the same query without EXPLAIN ANALYZE,


As a professional PostgreSQL support company we see a lot of SQL performance stuff, which is often worth sharing. One of those noteworthy things, which might be interesting to you, the reader, happened this week when I was out on a business trip to Berlin, Germany. This (excellent) customer was making extensive use of windowing functions and analytics. However, there is always room to speed things up.

“Improving SQL performance” or “how to trick the optimizer”

Most people simply write their SQL code and execute it assuming that the optimizer will take care of things on its own. While this is usually the case there are still corner cases, where some clever optimization – done by a professional – can give you an edge and better database performance.

One way to improve speed is by rearranging windowing functions in a clever way. Consider the following example:

test=# CREATE TABLE data (id int);
test=# INSERT INTO data SELECT * FROM generate_series(1, 5);

Our example is pretty simple: All we need is a table containing 5 rows:

test=# SELECT * FROM data;
(5 rows)

Let us take a look at a simple example now:

test=# SELECT *, array_agg(id) OVER (ORDER BY id) FROM data;
 id | array_agg
 1  | {1}
 2  | {1,2}
 3  | {1,2,3}
 4  | {1,2,3,4}
 5  | {1,2,3,4,5}
(5 rows)

What we have here is a simple aggregation. For the sake of simplicity I have used array_agg, which simply shows, how our data is aggregated. Of course we could also use min, max, sum, count or any other window function.
Let me add a second column to the example:

test=# SELECT *,
        array_agg(id) OVER (ORDER BY id),
        array_agg(id) OVER (ORDER BY id DESC)
FROM    data;
 id | array_agg   | array_agg
 5  | {1,2,3,4,5} | {5}
 4  | {1,2,3,4}   | {5,4}
 3  | {1,2,3}     | {5,4,3}
 2  | {1,2}       | {5,4,3,2}
 1  | {1}         | {5,4,3,2,1}
(5 rows)

In this case there are two columns with two different OVER-clauses. Note that those two aggregations are u

We compared SQL/JSON Standard-2016 conformance in the latest versions of the major relational databases and it is clearly seen from the table below, that PostgreSQL support is the best ! Nikita Glukhov and I have started this project a year ago, a couple of months after the Standard was published. Our initial intention was to have it in PG 10, but community afraid of the size of the patch. A year of development, reading the Standard, a lot of chatting, were really helpful to us and we confirmed now, that SQL/JSON standard is really useful and our implementation is solid.

We have a lot items in our TODO, for example, Smart indexing of json (PDF).

There is still a chance it will come to PG11, but looking on activity in -hackers I hardly believe to this (I still hope Andrew Dunstan will help, as he did 4 years ago with jsonb) . I hope we will commit SQL/JSON to the Postgres Professional products.

Description of implementation of SQL/JSON in PostgreSQL and our extensions.

Patches on Commitfest:


SQL/JSON feature PostgresSQL 11.0 Oracle 18c MySQL 8.0 MS SQL Server 2017 Comments
JSON_OBJECT(k : v)              
MySQL: JSON_OBJECT(k, v, ...) syntax, duplicate keys removed
Oracle: duplicate keys removed
JSON_OBJECT(k VALUE v)          
Oracle: duplicate keys removed
MySQL: NULL ON NULL by default
PostgreSQL: FORMAT JSON is not supported
MySQL: k, v
MySQL: NULL ON NULL by default

Whether migrating a database or project from MySQL to PostgreSQL, or choosing PostgreSQL for a new project with only MySQL knowledge, there are a few things to know about PostgreSQL and the differences between the two database systems.

PostgreSQL is a fully open source database system released under its own license, the PostgreSQL License, which is described as "a liberal Open Source license, similar to the BSD or MIT licenses.” This has allowed The PostgreSQL Global Development Group (commonly referred to as PGDG), who develops and maintains the open source project, to improve the project with help from people around the world, turning it into one of the most stable and feature rich database solutions available. Today, PostgreSQL competes with the top proprietary and open source database systems for features, performance, and popularity.

PostgreSQL is a highly compliant Relational Database System that’s scalable, customizable, and has a thriving community of people improving it every day.

What PostgreSQL Needs

In a previous blog, we discussed setting up and optimizing PostgreSQL for a new project. It is a good introduction to PostgreSQL configuration and behavior, and can be found here:

If migrating an application from MySQL to PostgreSQL, the best place to start would be to host it on similar hardware or hosting platform as the source MySQL database.

On Premise

If hosting the database on premise, bare metal hosts (rather than Virtual Machines) are generally the best option for hosting PostgreSQL. Virtual Machines do add some helpful features at times, but they come at the cost of losing power and performance from the host in general, while bare metal allows the PostgreSQL software to have full access to performance with fewer layers between it and the hardware. On premise hosts would need an administrator to maintain the databases, whether it’s a full time employee or contractor, whichever makes more sense for the application needs.

In The

I released version 1.0.0 of pspg pager. It supports psql, mysql, vertica, pgcli output formats, and can be used with these databases.

There are times within Postgres where you may want to generate sample data or some consistent series of records to join in order for reporting. Enter the simple but handy set returning function of Postgres: generate_series. generate_series as the name implies allows you to generate a set of data starting at some point, ending at another point, and optionally set the incrementing value. generate_series works on two datatypes:

  • integers
  • timestamps

Let’s get started with the most basic example:

FROM generate_series(1, 5);
(5 rows)

So generate_series pretty straight-forward, but what interesting ways can it be used?

Generating fake data

By putting our generate_series inside a CTE we can easily now generate a set of numbers and then perform some operation against each value. If we want to generate some fake number we can use random() which generates a random number between 0.0 and 1.0.

WITH numbers AS (
  FROM generate_series(1, 5)

SELECT generate_series * random()
FROM numbers;

(5 rows)

Pretty weekly reporting with joins

Aggregating across some time dimension is a fairly common report. A good example might be new users per week. The simplest way to get this would be by leveraging Postgres date_trunc function:

SELECT date_trunc('week', created_at)
FROM users

The issue with the above query arises when two cases are true, first you’re charting your data over time and then two you have a week with no sign-ups. In the case of no sign-ups in a week you’d simply miss the 0 on your graph leaving a misleading impression. To smooth this out we go back to generate series and do an outer join on the week:

WITH range_values AS (
  SELECT date_trunc('week', min(created_at)) as minval,
         date_trunc('week', max(c

Key Things to Monitor in PostgreSQL - Analyzing Your Workload

In computer systems, monitoring is the process of gathering metrics, analyzing, computing statistics and generating summaries and graphs regarding the performance or the capacity of a system, as well as generating alerts in case of unexpected problems or failures which require immediate attention or action. Therefore, monitoring has two uses: one for historic data analysis and presentation which help us identify medium and long term trends within our system and thus help us plan for upgrades, and a second one for immediate action in case of trouble.

Monitoring helps us identify problems and react to those problems concerning a wide range of fields such as:

  • Infrastructure/Hardware (physical or virtual)
  • Network
  • Storage
  • System Software
  • Application Software
  • Security

Monitoring is a major part of the work of a DBA. PostgreSQL, traditionally, has been known to be “low-maintenance” thanks to its sophisticated design and this means that the system can live with low attendance when compared to other alternatives. However, for serious installations where high availability and performance are of key importance, the database system has to be regularly monitored.

The role of the PostgreSQL DBA can step up to higher levels within the company’s hierarchy besides strictly technical: apart from basic monitoring and performance analysis,  must be able to spot changes in usage patterns, identify the possible causes, verify the assumptions and finally translate the findings in business terms. As an example, the DBA must be able to identify some sudden change in a certain activity that might be linked to a possible security threat. So the role of the PostgreSQL DBA is a key role within the company, and must work closely with other departmental heads in order to identify and solve problems that arise. Monitoring is a great part of this responsibility.

PostgreSQL provides many out of the box tools to help us gather and analyze data. In addition, due to its extensibility,



Unlock the tremendous energy of the vacuum!
© xkcd.xom (Randall Munroe) under the Creative Commons Attribution-NonCommercial 2.5 License

Whenever rows in a PostgreSQL table are updated or deleted, dead rows are left behind. VACUUM gets rid of them so that the space can be reused. If a table doesn’t get vacuumed, it will get bloated, which wastes disk space and slows down sequential table scans (and – to a smaller extent – index scans).

VACUUM also takes care of freezing table rows so to avoid problems when the transaction ID counter wraps around, but that’s a different story.

Normally you don’t have to take care of all that, because the autovacuum daemon built into PostgreSQL does that for you.

The problem

If your tables get bloated, the first thing you check is whether autovacuum has processed them or not:

SELECT schemaname, relname, n_live_tup, n_dead_tup, last_autovacuum
FROM pg_stat_all_tables
ORDER BY n_dead_tup
    / (n_live_tup
       * current_setting('autovacuum_vacuum_scale_factor')::float8
          + current_setting('autovacuum_vacuum_threshold')::float8)

If your bloated table does not show up here, n_dead_tup is zero and last_autovacuum is NULL, you might have a problem with the statistics collector.

If the bloated table is right there on top, but last_autovacuum is NULL, you might need to configure autovacuum to be more aggressive so that it gets done with the table.

But sometimes the result will look like this:

 schemaname |    relname   | n_live_tup | n_dead_tup |   last_autovacuum
 laurenz    | vacme        |      50000 |      50000 | 2018-02-22 13:20:16
 pg_catalog | pg_attribute |         42 |        165 |
 pg_catalog | pg_amop      |        871 |        162 |
 pg_catalog | pg_class     |          9 |         31 |
 pg_catalog | pg_type      |         17 |         27 |
 pg_catalog | pg_index     |          5 |         15 |
 pg_catalog | pg_depend    |       9162 |        471 |
 pg_catalog | pg_trigger   |          0 |       
Posted by Simon Riggs in 2ndQuadrant on 2018-03-13 at 17:01

The 2018 StackOverflow survey has just been published, with good news for PostgreSQL.

StackOverflow got more than 100,000 responses from people in a comprehensive 30 minute survey.

PostgreSQL is the third most commonly used database, with 33% of respondents, slightly behind MySQL and SQLServer, yet well ahead of other options. Early in January, the DBEngines results showed PostgreSQL in 4th place behind Oracle, yet here we see that actually Oracle heads up the Most Dreaded list along with DB2, leaving PostgreSQL to power through to 3rd place.

PostgreSQL at 62% is the second most loved database, so close behind Redis (on 64%) that they’re almost even. But then Redis is only used by 18.5% of people and its very much a different beast anyway – yes, its a datastore, but not a full functioned database like PostgreSQL and others.

Notice that neither MySQL nor SQLServer are well loved, yet enough people use them that we can be pretty certain of that as a collective opinion.

Later we learn that SQLServer has a strong correlation with C# and that MySQL has a strong correlation with PHP/HTML/CSS/WordPress, so they are both the main database choice for those software stacks. What’s interesting there is that PostgreSQL doesn’t have any correlation towards Java, Python, Ruby etc. Or if I might interpret that differently, it is equally popular amongst developers from all languages who aren’t already using LAMP or MS stacks.

SQL is the 4th most pervasive language in use, behind Javascript, HTML and CSS. At 58.5% it is way ahead of 5th place Java at 45%.

Later we learn that 57.5% of people love SQL, which is pretty much everyone that uses it.

We’ll do some more analysis when the anonymized data is available, just to double check these analyses.

In my previous blogs, I wrote about Hibernate Query Language (HQL) and Querydsl in detail, now I’m going to talk about MyBatis.

While ORMs typically map Java objects to database tables (or vice versa), MyBatis takes a different approach by mapping Java methods to SQL statements. This gives you complete control over writing SQL and its subsequent execution. With the help of a mapper, MyBatis also allows automatic mapping of database objects to Java objects.

Like all other Java persistence frameworks, the main aim of MyBatis is to reduce the time and coding requirements of talking to a database using raw JDBC. It is licensed as Apache License 2.0 and is free to use.

Why Use MyBatis?

MyBatis design has a database-centric approach, so if your application is driven by relational design, MyBatis is a very good option. It is also a good option if you are developing a new application or extending an existing one on top of an existing database infrastructure.

MyBatis can very quickly and neatly execute READ operations, so it comes in handy for applications that are oriented towards analytics and reporting. Because it is designed to use SQL directly, it gives you low level & complete control over the queries being executed against the database. On top of that, with the help of MyBatis data mapper, the object model within Java and the data model within your database are allowed to be different. This gives greater flexibility in Java coding.

Prominent Features

Let’s continue using the ‘largecities’ table for MyBatis features.


To start using MyBatis, first you need to download its jar file, which you can get from: The file needs to be in the project’s classpath along with the PostgreSQL JDBC driver.

Next, you need to create the Java object class as follows:

package org.secondquadrant.javabook.mybatis;

public class LargeCities {

        private int rank;
        private String name;

        public int getRank() {
                return rank;
Posted by Hubert 'depesz' Lubaczewski on 2018-03-12 at 10:03
Some time ago I wrote a site to paste SQL queries with reformatting/pretty-printing using pgFormatter library. Today, I figured out that I should update the library since it has quite some changes recently, so it would be good to incorporate its fixes to paste site. Unfortunately – new version is not backward compatible, and I […]

What does PostgreSQL Full-Text-Search have to do with VACUUM? Many readers might actually be surprised that there might be a relevant connection worth talking about at all. However, those two topics are more closely related than people might actually think. The reason is buried deep inside the code and many people might not be aware of those issues. Therefore I decided to shade some light on the topic and explain, what is really going on here. The goal is to help end users to speed up their Full-Text-Indexing (FTI) and offer better performance to everybody making use of PostgreSQL.

Controlling VACUUM and autovacuum

Before digging into the real stuff it is necessary to create some test data. For that purpose I created a table. Note that I turned autovacuum off so that all operations are fully under my control. This makes it easier to demonstrate, what is going on in PostgreSQL.

test=# CREATE TABLE t_fti (payload tsvector) 
   WITH (autovacuum_enabled = off);

In the next step we can create 2 million random texts. For the sake of simplicity I did not import a real data set containing real texts but simply created a set of md5 hashes, which are absolutely good enough for the job:

test=# INSERT INTO t_fti 
    SELECT to_tsvector('english', md5('dummy' || id)) 
    FROM generate_series(1, 2000000) AS id;
INSERT 0 2000000

Here is what our data looks like:

test=# SELECT to_tsvector('english', md5('dummy' || id)) 
   FROM generate_series(1, 5) AS id;
(5 rows)

To make things more efficient, I decided to use the tsvector data type in the table directly. The advantage is that we can directly create a full text index (FTI) on the column:

test=# CREATE INDEX idx_fti 
         ON t_fti USING gin(payload);

In PostgreSQL a GIN index is usually used to take

Databases use relational integrity to enforce expected situations. A common scenario is duplicates. Case in point, I present the port_dependencies table: For those not familiar with FreeBSD ports, each port (you could also refer to them as a package or application) can have zero or more dependencies. The FreshPorts database extracts and lists these dependencies [...]
Posted by Oleg Bartunov in Postgres Professional on 2018-03-11 at 19:09
This is a technical post to illustrate the house data set example, which I use in sqljson documentation.

Streaming replication with PostgreSQL 10

In this post, i will explain how to setup a streaming replication with PostgreSQL 10. I will not explain how to install PostgreSQL 10 on your system.

Posted by Adrien Nayrat in Dalibo on 2018-03-10 at 11:19
Table of Contents Introduction Spills changes on disk Example with a single transaction Example with two transactions CPU Database-wide statistics Network Replication OLTP workload Conclusion Introduction I introduced replication through several posts: PostgreSQL 10 : Logical replication - Overview PostgreSQL 10 : Logical replication - Setup PostgreSQL 10 : Logical replication - Limitations This new post will dig a little deeper. We will see postgres internals about logical replication.

In our previous article we saw three classic Database Modelization Anti-Patterns. The article also contains a reference to a Primary Key section of my book Mastering PostgreSQL in Application Development, so it’s only fair that I would now publish said Primary Key section!

So in this article, we dive into Primary Keys as being a cornerstone of database normalization. It’s so important to get Primary Keys right that you would thing everybody knows how to do it, and yet, most of the primary key constraints I’ve seen used in database design are actually not primary keys at all.

The latest release of the Citus database brings a number of exciting improvements for analytical queries across all the data and for real-time analytics applications. Citus already offered full SQL support on distributed tables for single-tenant queries and support for advanced subqueries that can be distributed (“pushed down”) to the shards. With Citus 7.2, you can also use CTEs (common table expressions), set operations, and most subqueries thanks to a new technique we call “recursive planning”.

Recursive planning looks for CTEs and subqueries that cannot be distributed along with the rest of the query because their results first need to be merged in one place. To generate a (distributed) plan for executing these subqueries, the internal APIs in Postgres allow us to do something mind-blowingly simple: We recursively call the Postgres planner on the subquery, and we can push the results back into the Citus database cluster. We then get a multi-stage plan that can be efficiently executed in a distributed way. As a result, Citus is now getting much closer to full SQL support for queries across all shards, in a way that’s fast and scalable.

In this post, we’ll take a deeper dive into how the distributed query planner in Citus handles subqueries—both subqueries that can be distributed in a single round, and multi-stage queries that use the new recursive planning feature.

Pushing down subqueries that join by distribution column

Citus divides tables into shards, which are regular tables distributed over any number of worker nodes. When running an analytical query, Citus first checks if the query can be answered in a single round of executing a SQL query on all the shards in parallel, and then merging the results on the coordinator.

For example, a SELECT count(*) would be computed by taking the count(*) on each shard and then summing the results on the coordinator. Internally, the Citus planner builds a multi-relational algebra tree for the query and then optimises the query tree by “pushing down” computation to t