This is the third and final post of the series intended to introduce PostgreSQL users to PL/R, a loadable procedural language that enables a user to write user-defined SQL functions in the R programming language. The information below provides sample use of R Functions against the NDVI dataset.
For many years MySQL and PostgreSQL were somewhat competing databases, which still addressed slightly different audiences. In my judgement (which is of course a bit biased) PostgreSQL always served my professional users, while MySQL had a strong standing among web developers and so on. But, after Oracle took over MySQL I had the feeling that […]
Modern systems offer several storage options and databases are very sensitive to the I/O characteristics of these options. The simplest is direct-attached storage (DAS), e.g. SATA, SAS. It is the simplest because it uses a dedicated connection between the server and storage.
A third options is network-attached storage (NAS), e.g. NFS. While NAS is networked storage like SAN, it offers a remote file system to the server instead of remote storage blocks. (The fact that the SAN/NAS acronym letters are just reversed only adds to the confusion. I remember it as (block) "storage" being the first letter of SAN.)
You should always be careful of what you ask for. A couple of months ago while I was feeling particularly brave, I submitted to present at the DataLayer Conference. The next thing I knew, I was speaking at the DataLayer Conference. The conference takes place in Austin on May 17th. Conferences like this one are an awesome channel for PostgreSQL advocacy. Of course I am partial to PgConf US but a conference such as DataLayer allows us to reach people who may be using one of those "other" databases.
Here is the synopsis of the presentation I will be giving:
PostgreSQL: The Center of Your Data Universe
Although there are still battles to be fought, the war has already been won. Find out how PostgreSQL answers all of your data layer needs. PostgreSQL is a one of the longest standing Open Source Database systems with legions of users leading the way to a sane, productive and performance driven data layer. This presentation will cover an overview of PostgreSQL technologies including:
If you feel like going, you can use the code JDROCKS (I promise, I didn't pick it) for a 15% discount. Let's have a large PostgreSQL contigent at the conference and show those "other" technologies what real community feels like!
…and why I’m glad I did.
It’s not all technical… Who knew?!
Last year, Pycon7 was held right about the time I joined 2ndQuadrant. Seeing as I was new to the technology AND the Italian language (note: there was an English track), I opted out of attending. Well, after attending Pycon8, I can say that I won’t make that mistake again!
Over the past year working in the Open Source community, I’ve learned more technical information than I could have ever imagined. Even then, attending a technical conference and understanding (completely) technical talks seemed a little far-fetched. Or so I thought!
Since being introduced to the wonderful world of Open Source, PostgreSQL, and numerous other technologies – I’m continuously fascinated by the way that the communities intertwine. These technical communities are more than just the technology they use – they’re full of collaborative community advocates! There was even a ‘Community’ track at Pycon8. There were interesting talks that were outside the technical box such as: Be(come) a Mentor! Help Others Succeed!, Don’t try to look smart. Be smart, and several more!
It’s also always fun to see your colleagues and friends give talks at conferences. Not to play favorites… but the talk presented by my colleagues Giulio Calacoci and Francesco Canovai – Disaster Recovery: A Series of Unfortunate Events – had to be my favorite.
Joined by our own maestro Leonardo Cecchi, their talk was informational, interesting AND fun! Giulio and Francesco expressed the importance of Disaster Recovery tools (such as Barman) by relating them to fairy tales we all know and love, while Leonardo played his guitar in the background.
Other interesting talks I attended were: PostgreSQL su NFS: miti e verità [PostgreSQL with NFS: Myths and Truths] given by Jonathan Battiato, Taking care of PostgreSQL with Ansible by Rubens Souza, and Python e PostgreSQL, un connubio perfetto [Python and PostgreSQL, a perfect blend] by Marco Nenciarini.
Oh, and who could forget.. Me.. </ embarassed_emoji > Just kidding! I w[...]
Connection poolers have two advantages:
But external connection poolers have disadvantages over internal ones:
Some weeks ago at pgDay Paris, during the evening social event, we got into a small „guestimation“ with another Postgres enthusiast, about the percepted speed of using integer based ID-s vs UUID based ones for bigger amounts of data. In short he reasoned that one should generally avoid the UUID-s for performance reasons as they […]
The post int4 vs int8 vs uuid vs numeric performance on bigger joins appeared first on Cybertec - The PostgreSQL Database Company.
I have released version 4.19 of the PostgreSQL Buildfarm client.
It can be downloaded from
Apart from some minor bug fixes, the following changes are made:
These changes mean that the client is more useful for testing development code, and also that testing config settings is much simpler. An initial test run on a fresh installation is now as simple as:
cp buildfarm.conf.sample build-farm.conf ./run_build.pl --test --verbose
To test development code, the from-source option is now much more flexible and friendly. For example, one might do something like:
./run_build.pl --from-source=/path/to/postgresql \ --config-set use_vpath=1 \ --config-set config_opts+=--enable-tap-tests \ --config-set locales+=en_US.utf8
If you run something like this[...]
With streaming replication, Postgres allows sophisticated setups of primary and standby servers. There are two ways to promote a standby to be the new primary. A switchover is when the change happens in a planned way:
A failover happens when the steps above can't be performed, usually because the primary has failed in some catastrophic way. The major difficulty with failover is the possibility that some of the final database changes contained in the WAL are not transferred to standbys, unless synchronous_standby_names was used. When a standby is promoted to primary after a failover, the final missing WAL records can cause problems:
Make sure you practice both methods of promoting a standby so, when you have to do the promotion in production, you are ready.
A few weeks ago I explained basics of autovacuum tuning. At the end of that post I promised to look into problems with vacuuming soon. Well, it took a bit longer than I planned, but here we go.
To quickly recap,
autovacuum is a background
process cleaning up dead rows, e.g. old deleted row versions. You
can also perform the cleanup manually by running
autovacuum does that
automatically depending on the amount of dead rows in the table, at
the right moment – not too often but frequently enough to keep the
amount of “garbage” under control.
autovacuum can’t be running too
often – the cleanup is only performed after reaching some number
dead rows accumulates in the table. But it may be delayed for
various reasons, resulting in tables and indexes getting larger
than desirable. And that’s exactly the topic of this post. So what
are the common culprits and how to identify them?
As explained in tuning
autovacuum workers are throttled to only
perform certain amount of work per time interval. The default
limits are fairly low – about 4MB/s of writes, 8MB/s of reads. That
is suitable for tiny machines like Raspberry Pi or small servers
from 10 years ago, but current machines are way more powerful (both
in terms of CPU and I/O) and handle much more data.
Imagine you have a few large tables and some small ones. If all
autovacuum workers start cleaning up the large
tables, none of the small tables will get vacuumed regardless of
the amount of dead rows they accumulate. Identifying this is not
particularly difficult, assuming you have sufficient monitoring.
Look for periods when all
autovacuum workers are busy
while tables are not vacuumed despite accumulating many dead
All the necessary information is in
pg_stat_activity (number of
worker processes) and
Increasing the number of
autovacuum workers is not
a solution, as the total amount of work remains the same. You can
specify per-table throttling lim
This is the second in a series of posts intended to introduce PostgreSQL users to PL/R, a loadable procedural language that enables a user to write user-defined SQL functions in the R programming language. This post builds on the example introduced in the initial post by demonstrating the steps associated with preprocessing the Normalized Difference Vegetation Index (NDVI) satellite raster data in preparation for spatial analytics.
PostgreSQL support cases are coming in on a regular basis. This week an especially noteworthy one reached our desks here at Cybertec, which is so interesting, that I decided to sit down and share some information. I guess many people out there have similar issues and therefore this post might be helpful to developers and […]
An important step in the SCRAM authentication is called SASLprep, a mandatory feature to be sure about the equivalence of two strings encoded with UTF-8. A first commit has added support for SCRAM-SHA-256 protocol with the full SASL exchange plugged on top of it, and this has been implemented by the following commit:
commit: 60f11b87a2349985230c08616fa8a34ffde934c8 author: Heikki Linnakangas <firstname.lastname@example.org> date: Fri, 7 Apr 2017 14:56:05 +0300 Use SASLprep to normalize passwords for SCRAM authentication. An important step of SASLprep normalization, is to convert the string to Unicode normalization form NFKC. Unicode normalization requires a fairly large table of character decompositions, which is generated from data published by the Unicode consortium. The script to generate the table is put in src/common/unicode, as well test code for the normalization. A pre-generated version of the tables is included in src/include/common, so you don't need the code in src/common/unicode to build PostgreSQL, only if you wish to modify the normalization tables. The SASLprep implementation depends on the UTF-8 functions from src/backend/utils/mb/wchar.c. So to use it, you must also compile and link that. That doesn't change anything for the current users of these functions, the backend and libpq, as they both already link with wchar.o. It would be good to move those functions into a separate file in src/commmon, but I'll leave that for another day. No documentation changes included, because there is no details on the SCRAM mechanism in the docs anyway. An overview on that in the protocol specification would probably be good, even though SCRAM is documented in detail in RFC5802. I'll write that as a separate patch. An important thing to mention there is that we apply SASLprep even on invalid UTF-8 strings, to support other encodings. Patch by Michael Paquier and me. Discussion: https://www.postgresql.org/message-id/CAB7nPqSByyEmAVLtEf1KxTRh=PWNKiWKEKQR=e1yGehz=wbymQ@mail.gmail.com
As referenced in RFC 4103,[...]
When using continuous archiving, you must restore a file system backup before replaying the WAL. If the file system backup was taken long ago, WAL replay might take a long time. One way to avoid this is to take file system backups frequently.
Another option is to perform an incremental file system backup that can be laid over the original file system backup, then replay WAL over that. This reduces restore time because you only need to replay WAL from the start of the incremental backup, not the start of the full backup. This also reduces the amount of WAL that must be retained.
However, Postgres doesn't natively support incremental backup. The best you can do is to use a tool like pgBackRest or Barman that supports incremental backup at the file level. The only problem is that the database files are potentially one gigabyte in size, so the granularity of the incremental backup isn't great. Ideally solutions will be developed that do page-level (8k) incremental backups, which would be much smaller. The trick is finding an efficient way to record which 8k pages have been changed since the last file system backup.
The April meeting will be held at 18:00 EST on Tues, the 25th. Once again, we will be holding the meeting in the community space at CoverMyMeds. Please RSVP on MeetUp so we have an idea on the amount of food needed.
CoverMyMeds’ very own CJ will be presenting this month. He’s going to tell us all about how CoverMyMeds uses Consul to assist with scaling PostgreSQL and maintaining the high availability of the database.
Please park at a meter on the street or in the parking garage (see below). You can safely ignore any sign saying to not park in the garage as long as it’s after 17:30 when you arrive. Park on the first level in any space that is not marked ‘24 hour reserved’. Once parked, take the elevator to the 3rd floor to reach the Miranova lobby.
The elevator bank is in the back of the building. Take a left and walk down the hall until you see the elevator bank on your right. Grab an elevator up to the 11th floor. Once you exit the elevator, look to your left and right. One side will have visible cubicles, the other won’t. Head to the side without cubicles. You’re now in the community space. The kitchen is to your right (grab yourself a drink) and the meeting will be held to your left. Walk down the room towards the stage.
Want to get your web development ideas in front
of a live audience? The
call for papers for the ConFoo Vancouver 2017 web developer conference is open! If you have
other web development topics, we want to see your proposals. The window is
open only from April 10 to May 8, 2017, so hurry. An added benefit: If your
proposal is selected and you live outside of the Vancouver area, we will
cover your travel and hotel.
You’ll have 45 minutes for the talk, with 35 minutes for your
10 minutes for Q&A. We can’t wait to see your proposals!
Until the talks are picked, the price for the tickets will be at
lowest. Once the talks are announced, prices will go up. Check out the last conference to get an
idea of what to expect.
You probably have heard the term "checkpoint" before, or seen it mentioned in the postgresql.conf file. A checkpoints is a usually-invisible cleanup feature present in most database systems, but it is useful to know what it does.
This diagram illustrates checkpoints. At the top are three Postgres database sessions. Each session reads and writes to the shared buffer cache. Every modification to shared buffers also causes a change record to be written to the write-ahead log (WAL, blog entry). Over time the WAL would grow unbounded in size if it were not trimmed occasionally — that is what checkpoints do.
A checkpoint writes previously-dirtied shared buffers to durable storage over a period of several minutes, at which point the WAL representing those writes is no longer needed for crash recovery. (Hopefully continuous archiving and streaming replication have also processed those WAL files.) Therefore, the old WAL can then be removed or recycled.
After some time of absence due to a high work load I finally got around to write more on an issue, which has been on my mind for quite a while: Removing duplicate rows. It happens from time to time that PostgreSQL support clients accidentally load data twice or somehow manage to duplicate data (human […]
Shortly after I published Benchmarking UUIDs, someone emailed me with a correction. It turns out the approach Jonathan and I used to time how long PostgreSQL takes to generate a million UUIDs is mostly timing how long it takes to generate a million queries:
DO $$ BEGIN FOR i IN 0..1000000 LOOP PERFORM 1; END LOOP; RETURN; END; $$;
They pointed out a better way to test:
SELECT COUNT(*) FROM ( SELECT 1 FROM generate_series(1, 1000000) ) AS x;
This results in a roughly order-of-magnitude difference in test times, just in overhead.
When we take this insight and applying it to the two UUID generator functions, we find that PostgreSQL is faster at this task than nodejs:
SELECT COUNT(*) FROM ( SELECT uuid_generate_v4() FROM generate_series(1, 1000000) ) AS x;
SELECT COUNT(*) FROM ( SELECT gen_random_uuid() FROM generate_series(1, 1000000) ) AS x;
On my machine, I see a big difference, more than 5x:
|6484.110 ms||1166.969 ms|
|6451.433 ms||1169.010 ms|
|6285.573 ms||1161.001 ms|
Interestingly, on another machine, the two functions were
approximately equally fast, with
slightly edging out
Thankfully, I don't think the flaw in my original measurements undermines the conclusion I drew: the difference between these methods is vanishingly small, and the likelihood that generating UUIDs is the bottleneck in your system is low. Better to focus your optimization efforts elsewhere!
Many thanks to my anonymous contributor! I'll update this article to credit them if they like.
Jonathan New wrote an interesting article on UUID creation in Postgres vs Node. In it, he described the performance tradeoff of generating a UUID in the database vs in the application. It's not very long, go read it!
I've used PostgreSQL to generate UUIDs before, but I hadn't seen
uuid_generate_v4(). It turns out to come
from the uuid-ossp
extension, which also supports other UUID generation methods.
Previously, I've used the pgcrypto
extension, which provides the
How do they compare? On my machine, using the PostgreSQL package for Ubuntu (as opposed to the Ubuntu package for PostgreSQL...), the pgcrypto version is more than twice as fast than the uuid-ossp version.
How does this compare with nodejs? Using Jonathan's approach, nodejs is about 1.5 times as fast as PostgreSQL with pgcrypto!
|10942.376 ms||4173.924 ms||2886.117 ms|
|11235.807 ms||4341.270 ms||2822.078 ms|
|10764.468 ms||4265.632 ms||2829.395 ms|
What does this mean? I argue: very little! The slowest method takes ~11 seconds to generate one million UUIDs, and the fastest takes ~3 seconds. That's 3 - 11 microseconds per UUID! If this is the bottleneck in your application, I think you've done a very good job of optimizing - and you might have a pretty unusual use case.
clause, not mentioned in Jonathan's post, is really cool:
> CREATE TABLE example ( example_id UUID PRIMARY KEY DEFAULT gen_random_uuid(), number INTEGER NOT NULL ); CREATE TABLE > INSERT INTO example (number) VALUES (1) RETURNING example_id; example_id -------------------------------------- 045857b4-6125-4746-94b8-a2e58f342b86 (1 row) INSERT 0 1
This was a very unscientific benchmark! I'm not controlling for other programs running on my machine, and this is not a server, it's just a laptop.
In the interest of writing things down, here's how I came up with the numbers above.
/proc/cpuinfo, I am running on a
Intel(R) Core(TM) i7-35
For these reasons, if you are archiving WAL, it is wise to use the major version number in the name of the WAL archive directory, e.g. /archive/pgsql/9.6. This avoids the problem of WAL from an old Postgres major version conflicting with WAL files from a new major version.
Although normally one should try to avoid using non-core extensions/forks of Postgres, in some rare cases it could be even worse not to use them. But as BDR project is also here to stay (as I’ve understood main functionality will be gradually integrated into Postgres core), then the following info might still be useful in […]