The world's most advanced open source database
Top posters
Number of posts in the past month
Top teams
Number of posts in the past month
  • Policy for being listed on Planet PostgreSQL.
  • Add your blog to Planet PostgreSQL.
  • List of all subscribed blogs.
  • Manage your registration.
  • Get in touch with the Planet PostgreSQL administrators at planet at
Posted by Abdul Yadi on 2014-09-01 at 04:58:56

When I read new release of Bucardo 5 with capability of asynchronous multi-master replication, I was eagered to wet my foot before swimming. Not really about multi-master features, just to implement simple master-slave-cascaded_slave replication, as I have done with Slony.

master table A => slave table B => cascaded slave table C.

So, I built Bucardo from source version 5.1.1 then configured databases, tables, syncs. I have turned on “makedelta” for slave table which is source for cascaded replication to another slave table. Master to slave (A to B) replication run well. But, unfortunately, the cascaded replication (B to C) does not work.

Then I dig into source code. In sub start_kid, the “does_makedelta hash” populated as follows:

sub start_kid {

for my $dbname (@dbs_source) {
$x = $sync->{db}{$dbname};

for my $g (@$goatlist) {

## Set the per database/per table makedelta setting now
if (defined $g->{makedelta}) {
if ($g->{makedelta} eq ‘on’ or $g->{makedelta} =~ /\b$dbname\b/) {
$x->{does_makedelta}{$S}{$T} = 1;






I suspect that “does_makedelta hash” reference was incorrectly taken from dbs_source. To my opinion, it should be taken from dbs_target. So, I move “does_makedelta hash” to another new dbs_target loop:

sub start_kid {

for my $dbname (@dbs_source) {
$x = $sync->{db}{$dbname};

for my $g (@$goatlist) {

## Set the per database/per table makedelta setting now
## if (defined $g->{makedelta}) {
## if ($g->{makedelta} eq ‘on’ or $g->{makedelta} =~ /\b$dbname\b/) {
## $x->{does_makedelta}{$S}{$T} = 1;
## …

## }

## }



for my $dbname (@dbs_target) {

$x = $sync->{db}{$dbname};
for my $g (@$goatlist) {
next if $g->{reltype} ne ‘table';
($S,$T) = ($g->{safeschema},$g->{safetable});
## Set the per database/per table makedelta setting now
if (defined $g->{makedelta}) {
if ($g->{makedelta} eq ‘on’ or $g->{makedelta} =~ /\b$dbname\b/) {
$x->{does_makedelta}{$S}{$T} = 1;
$self->glog(“Set table $dbname.$S.$T to makedelta”, LOG_NORMAL);






I have not ver

[continue reading]

Posted by Dimitri Fontaine in 2ndQuadrant on 2014-08-29 at 12:26:00

Next month, Postgres Open 2014 is happening in Chicago, and I'll have the pleasure to host a tutorial about PostgreSQL Extensions Writing & Using Postgres Extensions, and a talk aimed at developers wanting to make the best out of PostgreSQL, PostgreSQL for developers:

The tutorial is based on first hand experience on the PostgreSQL Extension Packaging System both as a user and a developer. It's a series of practical use cases where using extensions will simplify your life a lot, and each of those practical use case is using real world data (thanks to pgloader).

Most of the examples covered in the tutorial have a blog entry here that present the idea and the solution, so the tutorial is all about putting it all together. You can already read the blog posts under the YeSQL and Extensions for a preview.

The developer talk itself is based on the Reset Counter use case where we learn a lot about Common Table Expressions, or WITH queries and Window Functions, my favourite SQL clauses.

If you want to learn how to implement a modern search user interface for your own product, something that your users know how to use already, then the tutorial is for you, as we will cover PostgreSQL based approximate searches with suggestions ( did you mean ...?) and autocompletion.

The tutorial gives you the toolset you will use to avoid the situation depicted here.

See you all in Chicago!

Posted by Pavel Stehule on 2014-08-29 at 10:22:00
if you use it, please update

Adam Bartoszewicz sent instructions and compiled ddl for plpgsql_check for windows

Ooooh that smell! Can’t you smell that smell?“. That’s a classic rock song by legends Lynyrd Skynyrd, I know. But also a warning that your new Barman 1.3.3 installation can now emit.

Ooooh that smell! Can't you smell that smell?Consider the following scenario:

  • You have scheduled a weekly full backup of your Postgres server with Barman – the usual (and boring) 4AM on a Saturday
  • You have even configured Nagios/Icinga to correctly monitor the state of the barman check command, as well as any other standard metrics of your Linux system
  • You have also setup pre and post backup hook scripts that notify you via email when the backup starts and finishes


  • last Saturday your backup server had to go through a maintenance operation and had been switched off by IT operations for a few hours at 3AM
  • Activity was resumed at 6AM

It is Monday morning, you go back to work and start reading your email.

You do not (cannot?) realise that two emails are missing from your routine (the ones about start and stop of the backup operation) and you assume everything is fine. A warning email would just be better, wouldn’t it?

You are not aware that the weekly backup has not been taken and therefore, your recovery point objective (the time it takes for you to restore a database – in this case – after a disaster) could be compromised.

This is obviously a very simple (and mild) case.

Scenarios could be way more serious and painful, and generally they do fall under the “Inadequate Monitoring” category. A recurring case, unfortunately, is: the backup server is not properly monitored and barman backup repeatedly fails in the preceeding weeks due to insufficient disk space availability.

This is the reason behind the implementation of the last_backup_maximum_age option, a new feature of Barman 1.3.3 – going also under the name of smelly backups.

Let’s go back in time – we can, given that we work with Point In Time Recovery, can’t we? ;)

It is Friday morning, your read that Barman 1.3.3 has come out and you decide to:

  • update Barman to version 1.3.3
  • configure the last_backup_max_age

[continue reading]

Posted by Josh Berkus in pgExperts on 2014-08-26 at 22:24:00
Just so folks know:  due to unsurmountable issues, I will not make it to DjangoCon this year.  My apologies for anyone who was planning on attending my sessions there.  However, I will be at both the Portland pgDay and FOSS4G.  See you in Portland, just later.
Posted by Vasilis Ventirozos in OmniTI on 2014-08-26 at 12:42:00
Recently i had to evaluate ElasticSearch for a possible installation, elasticsearch is basically a search server that provides a distributed full-text search engine using a restful web interface, stores documents in json, it is written in java it is fast and really works out of the box with almost minimum effort. After the installation, it's just reading the documentation and adding / searching documents, I didn't really experiment much with searching but the API looks really good.
One interesting question that i had to answer was about connectivity with postgres, and how to maintain a table in both datastores and that's what this post is all about.

The first (fast and easy) answer here was rivers, it creates a jdbc connection with another datastore and based on a query it can pump data from any database table. It is available for postgres, twitter, mongo etc.. Because its jdbc its relatively slow and elasticsearch will (re)pump the data once every restart so pay extra attention if you use this and read the documentation first.

One other way is to use LISTEN/NOTIFY commands of postgres which is basically a message queueing system. The idea is to raise a notification on every insert, a deamon would grab that and insert the record into elasticsearch..

For a single postgres table it would work like this :

create table messages (
 id serial primary key,
 date timestamp without time zone,
 carrier text,
 message text

CREATE OR REPLACE FUNCTION table_message_notify() RETURNS trigger AS $$
  PERFORM pg_notify('table_messages_notifier',CAST( AS text));
$$ LANGUAGE plpgsql;

CREATE TRIGGER object_post_insert_notify AFTER insert ON messages FOR EACH ROW EXECUTE PROCEDURE table_message_notify();

This will simply send a notification on 'table_messages_notifier' channel after an insert that a new record has been inserted. Now you need something that would grab and handle these notifications, i tried with various ways like python request

[continue reading]

Posted by Josh Berkus in pgExperts on 2014-08-25 at 23:30:00
One thing we're waiting on to get 9.4 out the door is making a decision on JSONB storage.  Currently under debate are patches to the JSONB format which would, for users with large numbers (as in 150+) of top-level keys, cause JSONB to take up to 60% less space on disk in exchange for an up to 80% increase (that is slowdown) in response times on key extraction.  We need your feedback as potential users of the feature to make a good decision about what's best.

At stake is the question of whether the key locations should be saved as successive offsets, or as lengths.  The advantage of the former is that it speeds up extraction by making it only two lookups to locate a top-level key regardless of the number of keys you have.  The disadvantage is that a series of increasing offsets isn't very compressible.

For JSONB fields which consist of a few top-level keys and large values, this question makes no difference at all.  However, for the worst case ... 150+ top-level keys with short (under 10 bytes) values, the difference is quite dramatic.  For example, I constructed a test with 183 keys, of which 175 were NUMERIC.  I checked both table size and time to extract key #160 from 100,000 rows:

Metric Offsets Lengths
Table Size 1147 MB 541 MB
Extract 100K 2.8s 5.2s

This is a "worst case" scenario for the difference between these two designs.  Note that the extraction slowdown affects only retrieving the value to the client; it does not affect index lookups of JSONB rows, which are speedy no matter which patch is employed.

However, we're undecided on this "fix" because we don't know a couple things:

  • How likely are users to have 150+ top-level keys in one field (or keys on any single level together) with short values?
  • Is up to 60% space savings in return for up to 80% extraction slowdown a good tradeoff?  Or a bad one?
As such, I am turning to you, the users, to help us decide what tradeoff makes sense for 9.4.  Let me know in the comments.
Posted by Keith Fiske in OmniTI on 2014-08-25 at 19:16:07

One of the on-going issues with PostgreSQL that every administrator must deal with at some point is table and/or index bloat. The MVCC architecture of PostgreSQL lends itself to requiring some extra monitoring & maintenance, especially in large enterprise systems. I’ll save the description of why bloat happens until further down in this post (for those that are curious) and get to the point of this article first.

So far the best check for bloat I’ve come across is the one contained in the check_postgres suite from Bucardo. And in places where we can get check_postgres set up and used, it works great. But sometimes we just want to check the bloat on a system and also have a little more fine grained control of the output. So recently I pulled out the rather complex bloat check query from check_postgres and adapted it to a command-line python script:

General options:
  -m {tables,indexes}, --mode {tables,indexes}
                        Provide bloat report for the following objects:
                        tables, indexes. Note that the "tables" mode does not
                        include any index bloat that may also exist in the
                        table. Default is "tables".
  -c CONNECTION, --connection CONNECTION
                        Connection string for use by psycopg. Defaults to
                        "host=" (local socket).
  -f {simple,dict}, --format {simple,dict}
                        Output formats. Simple is a plaintext version suitable
                        for any output (ex: console, pipe to email). Dict is a
                        python dictionary object, which may be useful if
                        taking input into another python script or something
                        that needs a more structured format. Dict also
                        provides more details about object pages. Default is
  -a MIN_PAGES, --min_pages MIN_PAGES
                        Minimum number of pages an object must have to be

[continue reading]

Earlier this year we did compare compare Aggregating NBA data, PostgreSQL vs MongoDB then talked about PostgreSQL, Aggregates and histograms where we even produced a nice Histogram chart directly within the awesome psql console. Today, let's get that same idea to the next level, with pgcharts:

The new pgcharts application

The application's specifications are quite simple: edit an SQL query, set your categories and your data series, add in some legends, and get a nice chart. Currently supported are bar, column, pie and donut charts, and we should be able to add anything that has support for.

Currently, you need to compile the application yourself, and for that you need to install the SBCL compiler. Soon enough you will have a debian package to play with! The README at the pgcharts github place has the details to get you started. Enjoy!

Introduced in PostgreSQL 9.1, an unlogged table offers the possibility to create a table whose definition is permanent on server, but its content is not WAL-logged making it not crash-safe, with data that cannot be accessed on a read-only stannby continuously replaying WAL at recovery. Postgres 9.5 offers an improvement in this area with the possibility to switch the persistency of an unlogged to permanent and vice-versa:

commit: f41872d0c1239d36ab03393c39ec0b70e9ee2a3c
author: Alvaro Herrera <>
date: Fri, 22 Aug 2014 14:27:00 -0400

This enables changing permanent (logged) tables to unlogged and

(Docs for ALTER TABLE / SET TABLESPACE got shuffled in an order that
hopefully makes more sense than the original.)

Author: Fabrízio de Royes Mello
Reviewed by: Christoph Berg, Andres Freund, Thom Brown
Some tweaking by Álvaro Herrera

Something to be careful: running this command actually rewrites entirely the table, generating new WAL in consequence, while taking an exclusive lock on it. Hence the table cannot be accessed by other operations during the rewrite in order to recreate a new relfilenode for the relation whose persistence is changed.

Now, this command is rather simple to use. Let's use an unlogged table that has some data.

   AS SELECT generate_series(1,5) AS a;
=# SELECT oid,relfilenode FROM pg_class where oid = 'aa'::regclass;
  oid  | relfilenode
 16391 |       16397
(1 row)

This data cannot be requested on a standby and any query on it will fail like that:

=# SELECT pg_is_in_recovery();
(1 row)
=# SELECT * FROM tab_test;
ERROR:  0A000: cannot access temporary or unlogged relations during recovery
LOCATION:  get_relation_info, plancat.c:104

Now running the command ALTER TABLE .. SET LOGGED results in the relation to become persistent:

=# SELECT oid,relfileno

[continue reading]

Posted by Tomas Vondra on 2014-08-25 at 00:00:00

Last week, I briefly explained the recent improvements in count_distinct, a custom alternative to COUNT(DISTINCT) aggregate. I presented some basic performance comparison illustrating that count_distinct is in most cases significantly faster than the native COUNT(DISTINCT) aggregate. However, performance was not the initial goal of the changes, and the improvement in this respect is not that big (a few percent, maybe, which is negligible when compared to ~3x speedup against COUNT(DISTINCT)). The main issue was excessive memory consumption, and I promised to present the improvements. So here we go!

If I had to show you a single chart illustrating the improvements, it'd be this one. It compares the amount of memory necessary to aggregate the large tables from the previous post (100M rows). Remember, the lower the values the better, and red is the new implementation:


While the old implementation used 1.5GB - 2GB in most cases, the new implementation needs just ~600MB. I don't want to brag, but that's a pretty significant improvement (also you could also say I messed up the initial implementation and now I merely fixed it).

So, why did the hash table suck so much?

Clearly, the original count_distinct implementation was quite bad when dealing with memory, and there are two or three reasons why. If you're writing extensions in C, this might give you a few hints for improvements.

Simple hash table implementation

First, I have used a very simple hash table implementation, resulting in some unnecessary overhead for each value - you need the buckets, a pointer to the next item in the same bucket, etc. This could be lowered a bit by using a more elaborate hash table implementation (e.g. open addressing, but either the gains were not as significant as I expected, or the resulting implementation was more complex than I wished for, or slower.

After a lot of experiments, I switched to this simple array-based structure:

typedef struct element_set_t {

    uint32 item_size; /* length of the value (depends on the actual data ty

[continue reading]

Posted by Fabien Coelho on 2014-08-23 at 15:30:00

This post discusses the performance impact of PostgreSQL FILLFACTOR table storage parameter on an UPDATE OLTP load. Note that this FILLFACTOR is indeed the table storage parameter, although there is also an eponemous parameter for indexes.

How FILLFACTOR impacts UPDATE performance

By default, PostgreSQL packs as many tuples as possible within a page, that is FILLFACTOR=100 (percent).

When one UPDATE is performed, the old tuple is marked as deleted by the current transaction (xmax hidden MVCC attribute), a new tuple is inserted in another page (as the current page is full), and the indexes (at least the primary key) is modified to point to the new page location. This implies at least 3 page writes: the deleted-tuple page, the inserted-tuple page and the index modifications.

However, if the page is not full, the new tuple can be inserted in the same page as the old tuple, and the index is still valid because it points to the right page, so it does not need to be modified (hopefully?). We are down to 1 random page write instead of about 3 (not counting WAL) in the previous full page case.

When many updates are occuring on a reasonably large base (we assume that concurrent updates hit distinct pages), it is not so simple: first, new-tuples being added will probably be stored together in a page, so the page writes for new tuples would be shared between concurrent updates. Second, as the index is much smaller, distinct updates are more likely to modify the same index pages, thus index page writes would also be shared. As actual page writes are delayed to checkpoints, more sharing may occur. So the performance improvement would be less than an optimistic 3 to 1.


In order to measure this effect, I ran a simple pgbench UPDATE custom script:

\set naccounts 100000 * :scale
\setrandom aid 1 :naccounts
\setrandom delta -5000 5000
UPDATE pgbench_accounts
  SET abalance = abalance + :delta, filler = NOW()::TEXT
  WHERE aid = :aid;

The standard setup is scaled to 100 with a shorter (30 instead

[continue reading]

I've mentored a number of students in 2013 and 2014 for Debian and Ganglia and most of the companies I've worked with have run internships and graduate programs from time to time. GSoC 2014 has just finished and with all the excitement, many students are already asking what they can do to prepare and become selected in 2015.

My own observation is that the more time the organization has to get to know the student, the more confident they can be selecting that student. Furthermore, the more time that the student has spent getting to know the free software community, the more easily they can complete GSoC.

Here I present a list of things that students can do to maximize their chance of selection and career opportunities at the same time. These tips are useful for people applying for GSoC itself and related programs such as GNOME's Outreach Program for Women or graduate placements in companies.


There is no guarantee that Google will run the program again in 2015 or any future year.

There is no guarantee that any organization or mentor (including myself) will be involved until the official list of organizations is published by Google.

Do not follow the advice of web sites that invite you to send pizza or anything else of value to prospective mentors.

Following the steps in this page doesn't guarantee selection. That said, people who do follow these steps are much more likely to be considered and interviewed than somebody who hasn't done any of the things in this list.

Understand what free software really is

You may hear terms like free software and open source software used interchangeably.

They don't mean exactly the same thing and many people use the term free software for the wrong things. Not all open source projects meet the definition of free software. Those that don't, usually as a result of deficiencies in their licenses, are fundamentally incompatible with the majority of software that does use approved licenses.

Google Summer of Code is about both writing and publishing your code and it is also

[continue reading]

Posted by Andrew Dunstan in pgExperts on 2014-08-22 at 14:51:00
I mentioned to a client that I was going to write a function they had asked me to write using PLPerl, because it has good dynamic string handling capabilities that make it a better fit for this particular task than PLPgsql. "Oh," said the client, "we don't do much Perl. But ever since you hooked us up with PLV8 we do everything in JavaScript. We all know that, and use it every day." So I'm now writing it in Javascript. Proof once again that no good deed goes unpunished. It remains to be seen if it's going to be quite such a good fit as Perl would be, but at least it will give me a thorough refresher in JS.
Posted by Vasilis Ventirozos in OmniTI on 2014-08-22 at 13:54:00
A while ago I wanted to explore the options i had for multi master replication in postgres, I have previously worked with Oracle MMR and to be honest i don't like the idea of mmr, mostly because of the all conflicts that you sometime get, most of the times these conflicts are easy to resolve though and mmr, or in our case BDR can be a very nice way to scale up your database writes.
So, while searching i came across a project called postgresql BDR (bi-directional replication) developed by 2ndQuadrant, it allows users to create a geographically distributed asynchronous multi-master database using Logical Log Streaming Replication based on the changeset extraction feature introduced in PostgreSQL 9.4.
This post is about showing how to set this up.
For this setup i used 2 debian VMs (debian-7.5.0-amd64-netinst.iso)
the extra packages i installed were :
git, mc, joe,sudo, curl, libreadline-dev, zlib1g-dev, bison, flex
(by the way, comments about joe wont be tolerated !!)

With the vms ready and a user postgres1 created :

git clone git://
cd 2ndquadrant_bdr
git checkout bdr/0.6
./configure --prefix=$HOME/bdr
make install
cd ~/2ndquadrant_bdr/contrib/btree_gist && make && make install && cd ../../contrib/bdr && make && make install
add to user's profile:
export PATH=/home/postgres1/bdr/bin:$PATH

check version :
psql (PostgreSQL) 9.4beta1_bdr0601

sudo mkdir /opt/pgdata1
sudo mkdir /opt/pgdata2
sudo chown postgres1:postgres1 /opt/pgdata1/
sudo chown postgres1:postgres1 /opt/pgdata2/

as postgres1
initdb -D /opt/pgdata1/ -U postgres1 --auth-host=md5 --auth-local=peer

edit /opt/pgdata1/pg_hba.conf and add :
host    replication     postgres1          trust
edit /opt/pgdata1/postgresql.conf and change :
listen_addresses = '*'

wal_level = 'logical'
max_replication_slots = 3
max_wal_senders = 4
shared_preload_libraries = 'bdr'
bdr.connections = 'postgres1'
bdr.postgres2_dsn = 'dbname=postgres host= user=postgres p

[continue reading]

PostgreSQL has proven to be one of the most reliable Open Source databases around. In the past we have dealt with customers who have achieved superior uptimes and who are really happy with PostgreSQL. But even if PostgreSQL does its job day after day after day – there are some components, which can just fail […]
Posted by gabrielle roth on 2014-08-21 at 23:24:00

I updated our speaker and topic tag clouds for our 8th anniversary.

I've been looking for a way to edit mixed mode files in emacs, so I can edit a plperl function, and have the create statement part (and everything except the body) use standard SQL mode and the body use CPerl mode or whatever mode some magic marker tells it to use. I've taken a bit of a look at mmm-mode, but haven't been able to get it to work, and don't have more time to spend on it. If someone has a good recipe for this please let me know.

Some good news for packagers of PostgreSQL on Windows with many improvements in versioning that are showing up in 9.5. This is the result of a couple of months of work, concluded with the two following commits. The first commit is covering a major portion of executables and libraries:

commit: 0ffc201a51395ca71fe429ef86c872850a5850ee
author: Noah Misch <>
date: Mon, 14 Jul 2014 14:07:52 -0400
Add file version information to most installed Windows binaries.

Prominent binaries already had this metadata.  A handful of minor
binaries, such as pg_regress.exe, still lack it; efforts to eliminate
such exceptions are welcome.

This has added versioning for contrib modules, conversion_procs, most of the ecpg thingies, WAL receiver and PL languages. And then the final shot has been done with this commit, for utilities like regression tools or even zic.exe, part of the timezone code path:

commit: ee9569e4df1c3bdb6858f4f65d0770783c32a84d
author: Noah Misch <>
date: Mon, 18 Aug 2014 22:59:53 -0400
Finish adding file version information to installed Windows binaries.

In support of this, have the MSVC build follow GNU make in preferring
GNUmakefile over Makefile when a directory contains both.

Michael Paquier, reviewed by MauMau.

This work has basically needed the following things to get correct versioning coverage when building with either MinGW and MSVC:

  • Addition of PGFILEDESC to have a file description
  • Addition of some WIN32RES in the object list being built by make to compile the version number.
  • For MSVC, some refactoring of the scripts used for build to have them pick up correctly PGFILEDESC, and create version files.

Now, the result of this work is directly visible on the files themselves by looking at the file details in menu "Property" by left-clicking on a given file, tab "Details". With that, it is possible to see fields for:

  • File Description, for example for adminpack.dll, the description is "adminpack - support functions for pgAdmin".
  • Version number, made of 4 integer di

[continue reading]

There is a new release - version 4.14 - of the buildfarm client, now available at

The only change of note is that a bug which only affects MSVC clients (such that the client will not complete a run) and is present in releases 4.12 and 4.13 is fixed. Clients on other platforms do not need to upgrade.
Posted by Chris Travers on 2014-08-19 at 12:29:00
The SELECT statement is the workhorse of SQL.  Updates and inserts are necessary, but selects are where the power is.  One of the significant issues many people have in understanding these and using them is a clear understanding of the math involved, in part because there are a large number of implicit possibilities in the syntax.  In general folks learn to avoid the implicit aspects of the statement, but there are some implicit odditities that can't go away because they are baked into the language.  One of these is the confusion between selection and projection, and because SQL operates on bags instead of sets, neither of these work in SQL quite like they do in relational math.

In "Relational Theory and SQL," Chris Date says that tuples are unordered.  Tuples are strongly ordered, but what Date is getting at is that a relation has no natural ordering of columns, and therefore from any relation with a given ordering, a new relation can be created with a different ordering.  This process is called 'projection' and it is entirely implicit in SQL.  In the interest of being concise, in that book, Date does not stick with his typical approach of starting from clear definitions and discussing what these mean.

This being said, understanding projection, selection, and the difference between them makes understanding relational databases far easier in my view.

Because SQL operates on bags, I will use both SQL (for bag examples) and Set::Relation (for set examples) in order to discuss the math differences.  I will finally offer an idea of what a clearer English rendition of SQL's math semantics would be.

I:  Projection - Transform One Relation into Vertical Subset

Projection, represented by a Pi character (π) creates a set of ordered tuples with an ordering based on the operation.  What projection does, essentially, is take a subset (or the whole set) from each tuple, possibly re-ordering it in the process, in order to transform one relation into a derived relation.

In relational algebra, projection takes a set of tup

[continue reading]

Posted by gabrielle roth in EnterpriseDB on 2014-08-19 at 02:20:51
We had so many good submissions for Postgres Open this year, we had to make some very difficult choices. While I haven’t quite achieved Dan Langille’s (of PgCon fame) level of conference-running zen and rarely get to all the talks I’d like to see, here are my picks: Wednesday: The PostGIS tutorials. I am really […]
Posted by Tomas Vondra on 2014-08-19 at 00:00:00

Almost a year ago, I wrote a custom experimental aggregate replacing COUNT(DISTINCT). The problem with the native COUNT(DISTINCT) is that it forces a sort on the input relation, and when the amount of data is significant (say, tens of millions rows), that may be a significant performance drag. And sometimes we really need to do COUNT(DISTINCT).

Extensibility is one of the great strengths of PostgreSQL - users of most other databases can only dream about things like defining custom aggregates. And this extension point was exactly the foundation for my idea was - implementing a custom aggregate, counting the distinct items in a more efficient way.

That's how count_distinct was conceived, and the principle was really simple. For each group, a small hash table is maintained, making it trivial to keep track of distinct values (and counting them). And it worked quite well - instead of the COUNT(DISTINCT) query


you can call the customa aggregate


and in most cases it was much faster (usually ~3x), without doing the sort, etc. The main disadvantage was memory consumption - the overhead of the additional hash table structure and palloc overhead was bad (and in some cases quite terrible - consuming an order of magnitude more memory than the amount of data being processed). I experimented with various hash table variants, allocation schemes, but either the impact on performance was unacceptable, or the memory consumption was not much lower. Until about a month ago ...

Luckily, I've been messing with the hash join implementation recently, and it occured to me that a hash table may not be the right tool for this task. Hash tables are great for lookups, but that's not really the point of count_distinct - it only needs to detect and remove duplicates, and lookups are only one of the ways to achieve that goal.

Also, as the hash table grows, the L2/L3 cache hit ratio gets worse and worse. And after a bit of experimenting, I managed to come up

[continue reading]

Posted by Pavel Stehule on 2014-08-18 at 08:39:00
I released new version of plpgsql_check.

What is new?

lot of new checks:

  • assignment stmt
  • performance warnings
  • return stmts

create or replace function fx()
returns t2 as $$
return (10,20,30)::t1;
$$ language plpgsql;
select * from plpgsql_check_function('fx()', performance_warnings := true);
error:42804:3:RETURN:returned record type does not match expected record type
Detail: Returned type integer does not match expected type numeric in column 2.
(2 rows)

create or replace function f1()
returns setof int as $$
return next 1::numeric; -- tolerant, doesn't use tupmap
$$ language plpgsql;
select * from plpgsql_check_function('f1()', performance_warnings := true);
performance:42804:3:RETURN NEXT:target variable has different type then expression result
Detail: assign "numeric" value to "integer" variable
Hint: Hidden casting can be a performance issue.
(3 rows)

create or replace function f1()
returns int as $$
return current_date;
$$ language plpgsql;
select * from plpgsql_check_function('f1()', performance_warnings := true);
warning:42804:3:RETURN:target variable has different type then expression result
Detail: assign "date" value to "integer" variable
Hint: There are no possible explicit coercion between those types, possibly bug!
(3 rows)

Last month was a discussion about speed of plpgsql. There was example of synthetic tests - for these tests plpgsql is strongly penalized. There is zero optimization for intensive mathematics calculations :
DO LANGUAGE plpgsql $$ DECLARE n real;
DECLARE f integer;
FOR f IN 1..10000000 LOOP
if 0=0 then
n = SQRT (f);
end if;
RAISE NOTICE 'Result => %',n;
END $$;

There is two issues: a) useless test 0=0 (Oracle throws it, Postgres has no any optimiz

[continue reading]

Posted by Fabien Coelho on 2014-08-17 at 17:20:00

I have posted recently performance figures with varying page size using pgbench on SSD, which show a +10% improvement with smaller 4 kB page size over the default 8 kB page size.

Josh Berkus pointed out that pgbench test uses rather small 100-bytes rows, and that changing the tuple size might induce a different conclusion. To assess this point, I ran some tests with different row sizes: 1 kB, 3 kB, 7 kB and 15 kB.

Test Setup

I ran prepared transactions (-M prepared) to avoid cpu cycles and focus on write performance. The scale is 10 for 1 kB rows (about 1 GB base), 10 for 3 kB rows, 5 for 7 kB rows and 2 for 15 kB rows (about 4 GB bases): bases are small enough to fit in memory, so that reads should not require disk accesses. I ran 200 seconds tests with 4 clients (-c 4) on Postgresql 9.4b2. Although this is short, I am confident that it is enough because the rather good performance of SSDs implies dozens of checkpoints during that time, and also prior tests I did with 1000 vs 200 seconds on SSDs showed that the later were consistent with the former, although slightly on the optimistic side. I lowered autovacuum_naptime to 20 seconds and checkpoint_timeout to 1min and raised checkpoint_completion_target to 0.8 to shake things a bit, but I do not think that there is a great influence of these parameters given the overall performance. I used the default configuration for everything else. The hardware is the same as the previous test.

The row size is changed by updating the filler attributes of all tables with commands like:

ALTER TABLE pgbench_accounts
  DROP COLUMN filler,
VACUUM FULL pgbench_accounts;

The key point is to change the STORAGE setting, otherwise with the default EXTENDED value TOAST compression kicks in after 2 kB and with the default empty string rows end up smaller than with the standard test…


Here are the performance results for two series of test.

The first serie uses simple updates (-N) t

[continue reading]

Posted by Craig Kerstiens on 2014-08-15 at 07:00:00

As I followed along with the 9.4 release of Postgres I had a few posts of things that I was excited about, some things that missed, and a bit of a wrap-up. I thought this year (year in the sense of PG releases) I’d jump the gun and lay out areas I’d love to see addressed in PostgreSQL 9.5. And here it goes:


Merge/Upsert/Insert or Update whatever you want to call it this is still a huge wart that it doesn’t exist. There’s been a few implementations show up on mailing lists, and to the best of my understanding there’s been debate on if it’s performant enough or that some people would prefer another implementation or I don’t know what other excuse. The short is this really needs to happen, until that time you can always implement it with a CTE which can have a race condition.

Foreign Data Wrappers

There’s so much opportunity here, and this has easily been my favorite feature of the past 2-3 years in Postgres. Really any improvement is good here, but a hit list of a few valuable things:

  • Pushdown of conditions
  • Ability to accept a DSN to a utility function to create foreign user and tables.
  • Better security around creds of foreign tables
  • More out of the box FDWs


Today there’s madlib for machine learning, and 9.4 got support for ordered set aggregates, but even still Postgres needs to keep moving forward here. PL-R and PL-Python can help a good bit as well, but having more out of the box functions for stats can continue to keep it at the front of the pack for a database that’s not only safe for your data, but powerful to do analysis with.


This is definitely more of a dream than not. Full multi-master replication would be amazing, and it’s getting closer to possible. The sad truth is even once it lands it will probably require a year of maturing, so even more reason for it to hopefully hit in 9.5

Logical Replication

The foundation made it in for 9.4 which is huge. This means we’ll probably see a good working out of the box logical replication in 9.5. For those less familiar this m

[continue reading]

Posted by Josh Berkus in pgExperts on 2014-08-15 at 00:10:00
Some of you may be following the discussion on JSONB compression on -hackers, which is the remaining major issue for 9.4.  As part of this, I needed to run some statistics on average column sizes for JSON vs JSONB tables. This makes a good example of how incredibly useful WITHIN GROUP will be in 9.4.

First, I created two versions of the data, one called "jsonic" which has text-JSON data, and one called "jsonbish" which has the JSONB version.  Then I ran some comparisons using pg_column_size on the JSON columns.

with colsizes_b as (
    select pg_column_size(jsonbish.jcol) as colsize
    from jsonbish
distrib_b as (
    select percentile_cont(array [0,0.25,0.5,0.75,1])
    within group (order by colsize) as colsize_distribution
    from colsizes_b
colsizes_j as (
    select pg_column_size(jsonic.jcol) as colsize
    from jsonic
distrib_j as (
    select percentile_cont(array [0,0.25,0.5,0.75,1])
    within group (order by colsize) as colsize_distribution
    from colsizes_j
select 'json' as thetype,
    colsize_distribution from distrib_j
union all
select 'jsonb',
    from distrib_b;

So I'm taking the column size of each row, then sorting them by size, and then doing a percentile distribution using the new WITHIN GROUP.  There's no group there because the group is actually the whole table.  I've chosen the usual box plot percentages: minimum, 25%, median, 75%, and maximum.

And the results:

 thetype |    colsize_distribution   
 json    | {1741,1767,1854,1904,2292}
 jsonb   | {3551,5866,5910,5958,6168}

... which demonstrates that we do indeed have a major issue with compressability.

A reasonable understanding of the relational model requires understanding the basic data types which make it up, both in the idealized model and in real-world applications.  This post discusses both the idealized model and the accommodations the standard implementations of it make to the messiness of the real world.  We won't deal with NULLs here.  That will be the subject of a future entry.

I Relation and Set

At its root, the relational model is about relations.  Consequently a relation needs to be understood on a mathematical level.  Since a relation is a form of set, let's explore sets first.

A set is an unordered list of unique items.  Thus, {a, b} is a set, but {a, b, b} is not (it is a multi-set or bag, however, see below).  Additionally since it is unordered, {a, b} = {b, a}.  We can project ordering onto a list for our convenience but this does not affect the underlying mathematics.  Sets can be finite or infinite.  In databases, we focus on finite sets (infinite sets can be calculated, but cannot be stored in set form on a finite filesystem).  Infinite sets are for math libraries and don't really concern us here.

A relation is a set of propositions, of correlating facts.   These are represented in tuples (see below). At this point, just recognize that a tuple represents a correlation between two or more facts.  The individual facts are also trivial functions of the tuple as a whole, because each tuple is necessarily unique, each fact is a function of the correlation.

Consider a point in the format of (x, y). For any given point, x and y are trivial functions of it.  For (2, 2), x=2 is trivially true, as is y=2.  This seems tautological (and on one level it is) but it is also important because it gets to the basic operation of a relational database.  For more on this, see part 2 of this series.

II Row and Tuple

A tuple is an ordered set of items.  Unlike a set the order is important and repetition is allowed.  (1, 2, 3), (1, 3, 3), and (1, 1, 1) are all valid tuples.  In tuples, the order ascribes

[continue reading]

Posted by Hans-Juergen Schoenig in Cybertec on 2014-08-14 at 12:30:52
With PostgreSQL 9.4 just around the corner and 8.4 just retired I thought it is time to review those years of development to see how the size of the PostgreSQL code base has evolved over time. In the past couple of years (I even already got to say “decades”) a lot of development has happened […]
Posted by Denish Patel in OmniTI on 2014-08-14 at 01:07:08

I thought to share conference details that I’m going to attend next month. If you aren’t aware but interested attending any of them.

  1. PostgresOpen at Chicago
    • Postgres conference (Sept 17th – Sept 19th)
  2. Surge at National Harbor,MD
    • Scalability and Performance Conference  (Sept 24th – Sept  26th)

The speakers and talks lineup is looking interesting for both conferences. If you haven’t registered for them, it’s not too late yet!!  If you are planning to attend anyone of them, looking forward to meet you in-person.

See you soon!