PostgreSQL
The world's most advanced open source database
Top posters
Number of posts in the past month
Top teams
Number of posts in the past month
Feeds
Planet
  • Policy for being listed on Planet PostgreSQL.
  • Add your blog to Planet PostgreSQL.
  • List of all subscribed blogs.
  • Manage your registration.
Contact
  • Get in touch with the Planet PostgreSQL administrators at planet at postgresql.org.
Posted by Dimitri Fontaine in CitusData on 2018-07-19 at 10:58

This article fits in the PostgreSQL Concurrency series, where we installed a tweeter like application schema and had all the characters from Shakespeare’s A Midsummer Night’s Dream tweet their own lines in our database in PostgreSQL Concurrency: Data Modification Language.

A previous article in the series covered how to manage concurrent retweets in an efficient way: Computing and Caching, where we learn how to maintain a cache right in your PostgreSQL database, thanks for materialized views.

Today’s article shows how to maintain an external cache in another application layer. In this article we are going to maintain an in-memory cache in a Golang service, using PostgreSQL LISTEN and NOTIFY features.

Posted by Craig Kerstiens in CitusData on 2018-07-19 at 08:20

ZFS is a open source file system with the option to store data on disk in a compressed form. Itself ZFS supports a number of compression algorithms, giving you flexibility to optimize both performance and how much you store on disk. Compressing your data on disk offers two pretty straightforward advantages:

  1. Reduce the amount of storage you need—thus reducing costs
  2. When reading from disk, requires less data to be scanned, improving performance

To date, we have run Citus Cloud—our fully-managed database as a service that scales out Postgres horizontally—in production on EXT4. Today, we’re excited to announce a limited beta program of ZFS support for our Citus Cloud database. ZFS makes Citus Cloud even more powerful for certain use cases. If you are interested in access to the beta contact us to get more info, or continue reading to learn more about the use cases where ZFS and Citus and Postgres can help.

Lots of storage needed in your database cluster

If you have a massive amount of older historical data mixed with more recent data in your Postgres database, then the ZFS support in Citus could be useful for you. By leveraging Citus and pg_partman for time partitioning of data, you’re able to retain a large amount of time-series data, and retain each bucket of data in separate tables.

This means you could easily store a lot of historical data while the most recent data is kept fresh in cache. We’ve commonly seen compression rates on disk of 2X-3X, which means on a standard 2 TB disk sizing on Citus Cloud (that’s 2 TB per node, for each node in your Citus database cluster) you’re able to pack up to 6 TB per node. This can reduce the number of nodes you need in your Citus database cluster, when your biggest need is to retain more data.

Note: We do have custom storage options beyond 2 TB as well. If you have needs larger than 2 TB per node, please get in touch with us

Scanning lots of data in Postgres doesn’t have to be so painful

Beyond allowing you to store lots of data, certain workloads will see a big improveme

[...]
Posted by pgCMH - Columbus, OH on 2018-07-19 at 04:00

The July meeting will be held at 18:00 EST on Tues, the 24th. Once again, we will be holding the meeting in the community space at CoverMyMeds. Please RSVP on MeetUp so we have an idea on the amount of food needed.

What

Our very own Douglas will be presenting again this month. He’s going to tell us all about the most common datatypes you’ll see in PostgreSQL. This will be the 2nd of a two-part talk.

Where

CoverMyMeds has graciously agreed to validate your parking if you use their garage so please park there:

You can safely ignore any sign saying to not park in the garage as long as it’s after 17:30 when you arrive.

Park in any space that is not marked ‘24 hour reserved’.

Once parked, take the elevator/stairs to the 3rd floor to reach the Miranova lobby. Once in the lobby, the elevator bank is in the back (West side) of the building. Take a left and walk down the hall until you see the elevator bank on your right. Grab an elevator up to the 11th floor. (If the elevator won’t let you pick the 11th floor, contact Doug or CJ (info below)). Once you exit the elevator, look to your left and right; one side will have visible cubicles, the other won’t. Head to the side without cubicles. You’re now in the community space:

Community space as seen from the stage

The kitchen is to your right (grab yourself a drink) and the meeting will be held to your left. Walk down the room towards the stage.

If you have any issues or questions with parking or the elevators, feel free to text/call Doug at +1.614.316.5079 or CJ at +1.740.407.7043

Posted by Pavel Stehule on 2018-07-18 at 20:14
This release can be fully configured from menu (themes can be selected from menu too). The setting can be persistent to file ~/.pspgconf:

pg_rewind, introduced in PostgreSQL 9.5, is a powerful utility solving a particular problem: If you have a promoted a streaming replication secondary into being a primary, how can you make sure that the former primary, and any other secondaries that used to be connected to it, are able to connect to the new primary? Previously, there was no entirely safe way of doing so without simply rebuilding the secondaries using pg_basebackup (for example), which could take a very long time on large databases.

pg_rewind works by connecting to (or having direct file-system level access to) the new primary, and uses the WAL information to “back up” the target system (the old master or old peer secondaries) to the point that they can reattach as secondaries to the new primary primary.

It works wonderfully… but like anything powerful, it has some warnings associated with it. Here’s one of them.

In a recent situation, a client was doing a flip-back-and-forth stress test, in which a secondary would be promoted, its former primary rewound, the two reattached, and then back again. This worked well for many iterations, but after one particularly fast iteration, the new secondary (former primary, now rewound) wouldn’t come back up; the message was:

requested timeline 105 does not contain minimum recovery point A58/6B109F28 on timeline 103

What happened?

When PostgreSQL exits recovery, it starts a forced checkpoint; you can see this in the logs with a message:

checkpoint starting: force
database system is ready to accept connections

Note that the relevant text here is that the checkpoint is starting; it hasn’t completed yet.

The hypothesis is (not 100% confirmed, but seems likely) that the pg_rewind on the new secondary was done before the checkpoint had finished. Since the one of last things a checkpoint does is write the pg_control file, and one of the first things that pg_rewind does is read the control file from the source machine, there’s a window where the pg_control file on disk will be out of date. The result is a rather co

[...]

Here’s a step by step guide to install PostgreSQL on your machine using PGInstaller. PGInstaller supports three modes of installation; Graphical, Unattended and Text. We’re going to cover all three of them in this guide.

PGInstaller

To Install PostgreSQL via Graphical Mode

Step 1:

Download PGInstaller herePGInstaller is available for PostgreSQL 9.5, 9.6, 10, and 11(beta).

Step 2:

Click on the executable file to run the installer.

Step 3:

Select your preferred language.

PGInstaller GUI installer, install postgresql

Step 4:

Specify directory where you want to install PostgreSQL.

Step 5:

Specify PostgreSQL server port. You can leave this as default if you’re unsure what to enter.

Step 6:

Specify data directory to initialize PostgreSQL database.

Step 7:

Create a PostgreSQL user password.

Step 8:

Create password for database Superuser.

Step 9:

Click next to begin PostgreSQL installation.

The readme file contains installation paths, service names and database credentials.

To Install PostgreSQL via Unattended Mode

Navigate to the directory that contains PGInstaller and enter:

Windows:

PostgreSQL-10.4-1-windows-installer.exe /q

OSX:

PostgreSQL-10.4-1-osx-installer.app/Contents/MacOS/installbuilder.s –mode unattended –superuser_password database_superuser_password

Linux:

PostgreSQL-10.4-1-osx-installer-2.app/Contents/MacOS/installbuilder.s –mode unattended –superuser_password database_superuser_password

To Install PostgreSQL via Text Mode

Navigate to the directory that contains PGInstaller and enter:

For Windows

PostgreSQL-10.4-1-windows-installer.exe –mode text

For OSX

PostgreSQL-10.4-1-osx-installer.app/Contents/MacOS/installbuilder.sh –mode text

For Linux

PostgreSQL-10.4-1-linux-x64-installer.run –mode text

Space requirements

For the complete PGInstaller installation you will need approximately 60 mb of disk space.

Supported Platforms

Linux

Windows

OSX

  • RHEL6 and RHEL7

  • CentOS6 and CentOS7

  • Ubuntu Bionic, Xenial and Trusty

  • Fedora 25, Fedora 26

  • OpenSUSE 13.x

  • Debian Stretch and Jessie

  • Windows 8

  • Windows 10

  • 10 and above

[...]

In a previous blog post My Favorite PostgreSQL Queries and Why They Matter, I visited interesting queries meaningful to me as I learn, develop, and grow into a SQL developer role.

One of those, in particular, a multi-row UPDATE with a single CASE expression, sparked up an interesting conversation over on Hacker News.

In this blog post, I want to observe comparisons between that particular query, and one involving multiple single UPDATE statements. For good or bane.

Machine/Environment Specs:

  • Intel(R) Core(TM) i5-6200U CPU @ 2.30GHz
  • 8GB RAM
  • 1TB Storage
  • Xubuntu Linux 16.04.3 LTS (Xenial Xerus)
  • PostgreSQL 10.4

Note: To start, I created a 'staging' table with all TEXT type columns to get the data loaded.

The sample data set I'm using, is found at this link here.

But keep in mind, the data itself is used in this example because it's a decent sized set with multiple columns. Any 'analysis' or UPDATES/INSERTS to this data set, is not reflective of actual 'real-world' GPS/GIS operations and is not intended as such.

location=# \d data_staging;
               Table "public.data_staging"
    Column     |  Type   | Collation | Nullable | Default 
---------------+---------+-----------+----------+---------
 segment_num   | text    |           |          | 
 point_seg_num | text    |           |          | 
 latitude      | text    |           |          | 
 longitude     | text    |           |          | 
 nad_year_cd   | text    |           |          | 
 proj_code     | text    |           |          | 
 x_cord_loc    | text    |           |          | 
 y_cord_loc    | text    |           |          | 
 last_rev_date | text    |           |          | 
 version_date  | text    |           |          | 
 asbuilt_flag  | text    |           |          | 

location=# SELECT COUNT(*) FROM data_staging;
count
--------
546895
(1 row)

We have around half a million rows of data in this table.

For this first comparison, I will UPDATE the proj_code column.

Here is an exploratory query to determine its current values:

location=#
[...]
Posted by Christophe Pettus in pgExperts on 2018-07-17 at 17:38

Google recently released a tool to check on-disk checksums in PostgreSQL. PostgreSQL being hot, and Google being Google, this generated a lot of press, much of it containing an element of FUD about the integrity of PostgreSQL backups (note that Google’s own announcements were quite straight-forward about it).

First, some background: Since PostgreSQL 9.3, it has had page checksums available as an option when initializing a new database. These are fast, lightweight checksums to verify that the data on the page hasn’t been corrupted. It’s an excellent and low-impact data integrity feature, and highly recommended.

However, PostgreSQL doesn’t just randomly check the checksums of pages it is not going to read into memory (and why would it?). Thus, there can be lurking corruption in database that you don’t discover until too late… specifically, after all of the backups also contain it. Some PostgreSQL page-level backup tools, such as pgBackRest, check the checksums during the backup, but other tools (such as WAL-E and WAL-G) do not.

Google’s tool adds to that collection by doing an on-disk scan of each page, and verifying the checksum. That’s all it does. That’s not useless! It’s relevant to backups because doing this scan on a backup image doesn’t create the large I/O burden that doing so on an active server would.

So, if you are using a backup tool that does not verify backups, and that creates a disk image (as opposed to say, pg_dump) by all means consider this new addition as part of your toolbox. But there’s nothing inherent about PostgreSQL backups that makes them more prone to corruption, and some of the unfortunate press around this has given that implication.

Posted by Julien Rouhaud on 2018-07-17 at 17:34

A new version of pg_stat_cache is out, with support for widows and other platforms, and more counters available.

What’s new

Version 2.1 of pg_stat_cache has just been released.

The two main new features are:

  • compatibility with platform without getrusage() support (such as Windows)
  • more fields of getrusage() are exposed

As I explained in a previous article, this extension is a wrapper on top of getrusage, that accumulates performance counters per normalized query. It was already giving some precious informations that allows a DBA to identify CPU-intensive queries, or compute a real hit-ratio for instance.

However, it was only available on platforms that have a native version getrusage, so Windows and some other platforms were not supported. But fortunately, PostgreSQL does offer a basic support of getrusage() for those platforms. This infrastructure has been used in the version 2.1.0 of pg_stat_kcache, which means that you can now use this extension on Windows and all the other platforms that wasn’t supported previously. As this is a limited support, only the user and system CPU metrics will be available, the other fields will always be NULL.

This new version also exposes all the remaining fields of getrusage() that have a sense when accumulated per query:

  • soft page faults
  • hard page faults
  • swaps
  • IPC messages sent and received
  • signals received
  • voluntary and involuntary context switches

Another change is to automatically detect the operating system’s clock tick. Otherwise, very short queries (faster than a clock tick) would be either detected as not consuming CPU time, or consuming CPU time from earlier short queries. For queries faster than 3 clock ticks, where imprecision is high, pg_stat_kcache will instead use the query duration as CPU user time, and won’t use anything as CPU system time.

Small example

Depending on your platform, some of those new counters aren’t maintained. On GNU/Linux for instance , the swaps, IPC messages and signaled are unfortunately not maintained, but those which are are still quit

[...]

(This is another intermittent series of small things that are easy to forget, and cause irritation in the PostgreSQL world…)

When setting up pgpool2, it’s common to tweak the configuration file repeatedly… and often get a hostname wrong. One common occurrence is then:

  • You fix the hostnames.
  • You restart pgpool2.
  • It refuses to connect to the host you just fixed.
  • You scratch your head, connect manually with psql, everything works…
  • But pgpool2 just will not acknowledge that host.

Often, the problem is that the pgpool_status file, usually kept in /tmp, has cached the status of one of the hosts. The pgpool_status file is retained across pgpool2 restarts. Deleting the pgpool_status file will fix the issue.

(Why, yes, this just happened to me today…)

Posted by Stefan Fercot in Dalibo on 2018-07-17 at 00:00

PGDay Amsterdam regrouped more than 90 PostgreSQL fans (according to Devrim) on 12 July 2018. It was for me a really nice day and I’d like to share it with you.


User feedbacks, PostgreSQL internals, tools,… the topics covered were pretty wide.

Jan Karremans first started with Why I picked Postgres over Oracle?

Jan shared with us some of the lessons he learned during his journey:

  • weigh your needs (what do you need? what does your project need?);
  • don’t be afraid, you are not alone;
  • PostgreSQL is the best hidden secret.

Inspired by Dimitri Fontaine and his Mastering PostgreSQL in Application Development book, Oleksii Kliukin presented next Ace it with ACID: Postgres transactions for fun and profit.

Building a sample application to find his bike in Amsterdam, Oleksii explained what ACID means. MVCC, isolation levels, transactional DDL,… complete and local, great talk!

Daniel Westermann then tried to summarize What we already know about PostgreSQL 11.

A lot of new interesting improvements are coming regarding partitioning, parallelism, covering unique indexes, procedures with transaction control,… and also, my favorite:

$ psql
postgres=# help
You are using psql, the command-line interface to PostgreSQL.
Type:  \copyright for distribution terms
       \h for help with SQL commands
       \? for help with psql commands
       \g or terminate with semicolon to execute query
       \q to quit
postgres=# let met get out !
postgres-# exit
Use \q to quit.
postgres-# quit
Use \q to quit.
postgres-# \q

After a short coffee break, Stefanie Stoelting explained how to use PostgreSQL As Data Integration Tool by using Foreign Data Wrappers.

FDW, what is it? Which one exists? An overview with useful examples.

Then, well…

My turn : Save your data with pgBackRest.

Short introduction of this A.W.E.S.O.M.E! backup and restore system. For a more detailed example, there’s actually another post on this blog which talk about it :-)

Jeroen de Graaff told us, after lunch, the Step-by-step implementation of PostgreSQL at Rijkswaterst

[...]
Posted by Dimitri Fontaine in CitusData on 2018-07-16 at 07:27

In the previous article of the series Modeling for Concurrency, we saw how to model your application for highly concurrent activity. It was a follow-up to the article entitled PostgreSQL Concurrency: Isolation and Locking, which was a primer on PostgreSQL isolation and locking properties and behaviors.

Today’s article takes us a step further and builds on what we did in the previous articles in our series. After having had all the characters from Shakespeare’s A Midsummer Night’s Dream tweet their own lines in our database in PostgreSQL Concurrency: Data Modification Language, and having had them like and retweet a lot in PostgreSQL Concurrency: Isolation and Locking, we saw how to manage concurrent retweets in an efficient way in Computing and Caching.

What we did implement in the previous article is a cache system, all with its necessary cache invalidation policy. Sometimes though, the processing of an event needs to happen within the same transaction where the event is registered in your system. PostgreSQL makes it possible to maintain a summary table transactionally thanks to its trigger support. Today, we’re going to dive in how to maintain a summary table with triggers, and its impact on concurrency.


This article is extracted from my book Mastering PostgreSQL in Application Development, which teaches SQL to developers so that they may replace thousands of lines of code with very simple queries. The book has a full chapter about Data Manipulation and Concurrency Control in PostgreSQL, including caching with materialized views, check it out!


Posted by Federico Campoli on 2018-07-16 at 00:00

In the previous post we configured three devuan servers from scratch using ansible adding the pgdg repository to the apt sources and installing the PostgreSQL binaries.

This tutorial will revisit the apt role’s configuration and will introduce a new role for configuring the postgres operating system’s user for passwordless ssh connections to each other server.

Posted by Dimitri Fontaine in CitusData on 2018-07-13 at 11:10

Let’s continue to dive in PostgreSQL Concurrency. In the previous article of the seies, Modeling for Concurrency, we saw how to model your application for highly concurrent activity. It was a follow-up to the article entitled PostgreSQL Concurrency: Isolation and Locking, which was a primer on PostgreSQL isolation and locking properties and behaviors.

Today’s article takes us a step further and builds on what we did in the previous articles in our series. After having had all the characters from Shakespeare’s A Midsummer Night’s Dream tweet their own lines in our database in PostgreSQL Concurrency: Data Modification Language, and having had them like a retweet a lot in PostgreSQL Concurrency: Isolation and Locking, it’s time to think about how to display our counters in an efficient way.

In this article, we’re going to think about when we should compute results and when we should cache them for instant retrieval, all within the SQL tooling. The SQL tooling for handling cache is a MATERIALIZED VIEW, and it comes with cache invalidation routines, of course.

In this blog will continue the discussion of parallel query in PostgreSQL. In the previous blog of this series by my colleague we learned about parallel index scans, its design in PostgreSQL and the performance improvement achieved for a few queries on the industrial benchmark of TPC-H. Therein we analysed the performance improvement only for a small factor of 20 (database size was approximately 20GB). But the performance benefits realised by parallel operators aren’t that significant until we leverage them for higher scale factors. Hence, in this blog we will analyse the performance of parallel index scans on 300 scale factor.

For this experiment we used TPC-H inspired benchmark for PostgreSQL.  We used 300 scale factor, which gives 300+ GB of database, depending on the available indexes, etc. Additional indexes we created were on columns (l_shipmode, l_shipdate, o_orderdate, o_comment). We tuned following parameters,
  • random_page_cost = 0.1
  • seq_page_cost = 0.1
  • effective_cache_size = 10GB
  • shared_buffers = 10GB
  • work_mem = 1GB
In the table below, we compared the performance of the queries using parallel-index scans on v10 to their performance in v9.6. Note that in v9.6 parallel seq scans and parallel nested loop joins were available. Additionally, we tabulated the total contribution of parallel index-scan for each of the queries. All the values of timing are in seconds.

TPC-H query
On v9.6
On v10
Contribution of PIS
6
883
155
137
12
740
717
122
14
422
111
53
15
2295
1138
157
17
17724
3376
16

From the above table, it is clear that the benefits of parallel index scans are significant on large amounts of data as well. To get a better understanding of how this scan method benefits the queries in question, we studied their query plans on both versions. On analysing those query plans we found two primary explanations for the benefits attained:
  • Contribution of parallel index scan in total execution time of the query
If the most time-consuming
[...]
Having covered some of the reasons why you would want to run a PostgreSQL on Kubernetes, I'm not going to cover some of the reasons why you wouldn't want to.  This should help you make a balanced judgment about whether or not it's the right next move for you.

Not a Kubernetes Shop


Of course, everything I said about an integrated environment applies in reverse; if you're not already running other services on Kubernetes, or planning to migrate them, then moving your databases to Kubernetes is a lot of effort with minimal reward.  I'd even go so far as to say that your databases should probably be the last thing you move. Stateful services are harder, and require a greater mastery of Kubernetes. Start with the easy stuff.

Also, and perhaps more importantly, nothing in Kubernetes will remove from you (as admin) the necessity of having knowledge of the database platform you're deploying.  While it does make somethings (like HA) easier to automate, it doesn't change the basics of database management.  You still need to have someone on staff who knows PostgreSQL, it's just that that person can manage a lot more databases than they could before.

Everything is Alpha


Like everything in the new stacks, Kubernetes is under heavy development and many of the tools and features you're going to want to use are alpha, beta, or pre-1.0 software.  For folks used to PostgreSQL, where even our daily builds are runnable in production, you're going to find the world of containerish stuff very alpha indeed.

This means that moving your databases (or anything else) to Kubernetes at this stage means embracing the requirement to develop both some Kubernetes expertise, and a tolerance for risk.  Realistically, this means that you're only going to do it today if you have pressing needs to scale your database support and admin team that you can't meet otherwise. Folks in regulated environments with very low tolerance for mistakes should probably be the last to move.

In a year or so, this will change; there is more work today on making [...]

1st RULE: You do not upgrade PostgreSQL with trigger-based replication
2nd RULE: You DO NOT upgrade PostgreSQL with trigger-based replication
3rd RULE: If you upgrade PostgreSQL with trigger-based replication, prepare to suffer. And prepare well.

There must be a very serious reason to not use the pg_upgrade for upgrading the PostgreSQL.

OK, let’s say you can’t afford more than seconds of downtime. Use pglogical then.

OK let’s say you run 9.3 and thus can’t use pglogical. Use Londiste.

Can’t find readable README? Use SLONY.

Too complicated? Use streaming replication - promote the slave and run pg_upgrade on it - then switch apps to work with new promoted server.

Your app is relatively write-intensive all the time? You looked into all possible solutions and still want to setup custom trigger based replication? There are things you should pay attention to then:

  • All tables need PK. You shouldn’t rely on ctid (even with autovacuum disabled)
  • You will need to enable trigger for all constraint bonded tables (and might need Deferred FK)
  • Sequences need manual sync
  • Permissions aren’t replicated (unless you also set up an event trigger)
  • Event triggers can help with automation of support for new tables, but better to not overcomplicate an already complicated process. (like creating a trigger and a foreign table on table creation, also creating same table on foreign server, or altering remote server table with same change, you do on old db)
  • For each statement trigger is less reliable but probably simpler
  • You should vividly imagine your preexisting data migration process
  • You should plan limited tables accessibility while setting up and enabling trigger based replication
  • You should absolutely totally know you relations dependencies and constraints before you go this way.

Enough warnings? You want to play already? Let’s begin with some code then.

Before writing any triggers we have to build some mock up data set. Why? Wouldn't it be much easier to have a trigger before we have data? So the data would replicate to the “upgrade

[...]
In preparation for my workshop on running PostgreSQL on Kubernetes on Monday, I wanted to talk a bit about why you'd want to run your database there in the first place -- and why you wouldn't.

The container world, starting with Docker and then moving to Kubernetes, has focused on stateless services like web applications.  This has been largely because stateless services are simply easier to manage in new environments, and can be handled generically was well, allowing Kubernetes to be relatively platform-agnostic.  Once you get into services that require storage and other persistent resources, the idea of "run your containerized application on any platform" becomes much more challenging.

Somewhere along the way "stateful services are hard" morphed into "you shouldn't run stateful services" as a kind of mantra.  Considering how much work contributors have put into making stateful apps possible, that's simply wrong.  And, for that matter, when has running databases on any stack ever been easy?

Let's start with some of the reasons you would want to run Postgres (or other databases) on Kubernetes.  In tommorrow's post, I'll go into some of the reasons why you would not want to.

One Environment to Rule Them All


The biggest reason is to simplify your development and deployment picture by putting all application components on Kubernetes.  It supplies a whole set of scaffolding to make deploying and integrating applications and databases easier, including shared secrets, universal discovery, load balancing, service monitoring, and scaling.  While there are integration points, like the Service Catalog, that support treating external databases as Kubernetes services, it's always going to be harder to manage a database that has to be deployed by a completely different process from the application it supports.

As a small example, let's take database connection credentials. If both the database and the application are on Kubernetes, rotating/updating credentials is simple: you just update a Secrets object (or a plugin p[...]

PostgreSQL administration, configuration, and deployment can be a tough ask while working in an agile environment with strict deadlines. The key to simplify these operational tasks for PostgreSQL is Ansible – an open source IT automation tool.

To explain how these technologies work together, 2ndQuadrant hosted a Webinar on PostgreSQL deployments using Ansible. The webinar was presented by Tom Kincaid, GM North America at 2ndQuadrant, who gave an overview of Ansible and PostgreSQL, covered best strategies for deployments, and a variety of other topics.

If you weren’t able to make it to the live session, you can now view the recording here.

For any questions, comments, or feedback, please visit our website or send an email to webinar@2ndquadrant.com.

I spent the week of June 25th in Bejing, China with the outstanding Chinese Open Source and Postgres Communities. I was there to speak atboth Postgres Open China and the China Open Source World conferences as well as participate in a Chinese Open Source panel and the International Consultants committee meeting, of which I am the President. This was my first trip to Asia and it was amazing. The Chinese culture, hospitality, and friendliness was unparalleled, as was their drive to be more influential and helpful to the International Open Source and Postgres communities.

The entire week was spent trying to answer the question, “How can China participate more thoroughly in the International Open Source and Postgres communities?” We had a lively panel that included the COPU (Chinese Open Source Promotion Union), and representatives from local universities, George Neville-Neil, Alibaba, Stephen Walli of Microsoft, the President of the FreeBSD Foundation, and others. The panel was of particular interest as I was able to hear some of the struggles the local community has had, including respecting copyright, language and cultural barriers, and ensuring economic viability.

I look forward to continuing to assist the Chinese community in being more productive with not only Postgres but also Open Source. There is a wealth of culturally rich, intelligent, and inventive talent available that the PostgreSQL Global Development Group has yet to tap. It will be an exciting few years as both cultures adapt to work together, the contributor list grows, and we start seeing prominent Chinese developers assisting in the growth of PostgreSQL.

The video of my presentation below walks you through the major features of the native JSON data type in PostgreSQL 9.3 and beyond.

This presentation covers the following topics:

  • What is JSON?
  • How is it available in PostgreSQL?
  • What’s the difference between JSON and JSONB?
  • Accessing JSON values
  • Creating JSON from table data
  • Creating table data from JSON
  • Crosstabs with JSON
  • Indexing and JSON
  • When to use JSON, when to use JSONB, and when neither should be used
Posted by Dimitri Fontaine in CitusData on 2018-07-10 at 08:26

Let’s continue to dive in PostgreSQL Concurrency. Last week’s article PostgreSQL Concurrency: Isolation and Locking was a primer on PostgreSQL isolation and locking properties and behaviors.

Today’s article takes us a step further and builds on what we did last week, in particular the database modeling for a tweet like application. After having had all the characters from Shakespeare’s A Midsummer Night’s Dream tweet their own lines in our database in PostgreSQL Concurrency: Data Modification Language, it’s time for them to do some actions on the tweets: likes and retweet.

Of course, we’re going to put concurrency to the test, so we’re going to have to handle very very popular tweets from the play!

Looking at the type of PostgreSQL support requests, we have received recently, it is striking to see, how many of them are basically related to autovacuum and UPDATE in particular. Compared to other databases such as Oracle, PostgreSQL’s way of handling UPDATE and storage in general is quite different. Therefore people moving from Oracle to PostgreSQL might be surprised. So it can make sense to take a step back and take a look at the broader picture.

How PostgreSQL handles UPDATE

The most important thing beginners have to keep in mind is: Internally UPDATE will duplicate a row. After an UPDATE the old as well as the new row will be in your table. But why is that the case? Here is an example:

BEGIN;
UPDATE tab SET id = 17 WHERE id = 16;
SELECT …
ROLLBACK;

The idea is simple: UPDATE is not allowed to destroy the old version of the row because otherwise ROLLBACK would not work. UPDATE has to copy the row and ensure that both versions are there to handle transactions properly.

Why is that important? Here is one more example. Let us assume that the table will contain a single row (id = 16):

Connection 1 Connection 2
SELECT * FROM tab; SELECT * FROM tab;
… returns 16 … … returns 16 …
BEGIN;
UPDATE tab SET id = 20 WHERE id = 16
SELECT * FROM tab; SELECT * FROM tab;
… returns 20 … … returns 16 …
COMMIT;

Note that the second SELECT in the second connection will still see the old row. PostgreSQL has to ensure that the row is still there.

The same applies to DELETE: If your disk is full, deleting 100 million rows will not automatically return space to the filesystem because some concurrent transactions might still use the data.

VACUUM: The art of cleaning dead rows

As stated already DELETE does not actually remove old rows. Who does? The answer is: VACUUM. Let us take a look at an example:

Connection 1 Connection 2 Connection 3
BEGIN;
DELETE FROM tab;
… running … SELECT * FROM tab;
… running … … will return all the rows …
VACUUM;
COMMIT; … does not delete anything …
[...]
Posted by Denish Patel on 2018-07-09 at 18:32

As I mentioned in my previous post in “Audit logging using JSONB in Postgres” , audit tables can be partitioned easily in Postgres 10.

Let’s try to implement  partitioning in sample users_audit table…

Drop existing trigger and table

drop trigger users_audit_trig ON public.users;
drop table if exists audit.users_audit;

Create partition table by RANGE partition on audit_ts timestamp column …

set search_path to audit;

create table audit.users_audit(
audit_ts timestamptz not null default now(),
operation varchar(10)not null,
username text not null default "current_user"(),
before jsonb, 
after jsonb
) partition by RANGE (audit_ts);

Create child tables…

CREATE TABLE audit.users_audit_2018_07 PARTITION OF audit.users_audit FOR VALUES FROM ('2018-07-01') TO ('2018-08-01');
CREATE TABLE audit.users_audit_2018_08 PARTITION OF audit.users_audit FOR VALUES FROM ('2018-08-01') TO ('2018-09-01');
CREATE TABLE audit.users_audit_2018_09 PARTITION OF audit.users_audit FOR VALUES FROM ('2018-09-01') TO ('2018-10-01');
CREATE TABLE audit.users_audit_2018_10 PARTITION OF audit.users_audit FOR VALUES FROM ('2018-10-01') TO ('2018-11-01');
CREATE TABLE audit.users_audit_2018_11 PARTITION OF audit.users_audit FOR VALUES FROM ('2018-11-01') TO ('2018-12-01');
CREATE TABLE audit.users_audit_2018_12 PARTITION OF audit.users_audit FOR VALUES FROM ('2018-12-01') TO ('2019-01-01');

Create required index on EACH partitioned table..

create index on audit.users_audit_2018_07 (audit_ts desc,operation);
create index on audit.users_audit_2018_07 using GIN(before);
create index on audit.users_audit_2018_07 using GIN(after);
create index on audit.users_audit_2018_07 using GIN ((after->'userid'));

NOTE: you have to pre-create these child tables as well indices in advance so you can come up with process to create them using some kind of script or add trigger on parent partitioned table to create child table automatically.

Place the trigger back on USERS table..

CREATE TRIGGER users_audit_trig
 BEFORE INSERT OR UPDATE OR DELETE
 ON public.users
 FO
[...]

The video of my presentation below walks you through the major features of the native JSON data type in PostgreSQL 9.3 and beyond.

This presentation covers the following topics:

  • What is JSON?
  • How is it available in PostgreSQL?
  • What’s the difference between JSON and JSONB?
  • Accessing JSON values
  • Creating JSON from table data
  • Creating table data from JSON
  • Crosstabs with JSON
  • Indexing and JSON
  • When to use JSON, when to use JSONB, and when neither should be used

I get asked about Oracle RAC often. My usual answer is that Oracle RAC gives you 50% of high reliability (storage is shared, mirroring helps) and 50% of scaling (CPU and memory is scaled, storage is not). The requirement to partition applications to specific nodes to avoid cache consistency overhead is another downside. (My scaling presentation shows Oracle RAC.)

I said the community is unlikely to go the Oracle RAC direction because it doesn't fully solve a single problem, and is overly complex. The community prefers to fully-solve problems and simple solutions.

For me, streaming replication fully solves the high availability problem and sharding fully solves the scaling problem. Of course, if you need both, you have to deploy both, which gives you 100% of two solutions, rather than Oracle RAC which gives you 50% of each.

Continue Reading »

Posted by Pavel Stehule on 2018-07-09 at 14:57
The pager pspg has more than 30 keyboard short cuts, what can be hard to use for beginners without Linux command line knowleadge.

For this reason I enhanced pspg about menus. Unfortunately I didn't find any library that implements menu on ncurses platform. Native ncurses menus doesn't support mouse well, and has little bit different philosophy than CUA. So I wrote own st-menu library. This library is available for generic and free usage. pspg can be use as referential usage of this library.

Few screenshots:
Posted by Julien Rouhaud on 2018-07-09 at 10:43

You can now view the Wait Events in PoWA thanks to the pg_wait_sampling extension.

Wait Events & pg_wait_sampling

Wait events are a famous and useful feature in a lot of RDBMS. They have been added in PostgreSQL 9.6, quite a few versions ago. Unlike most of others PostgreSQL statistics, those are only an instant view of what the processes are currently waiting on, and not some cumulated counters. You can get those event using the pg_stat_activity view, for instance:

=# SELECT datid, pid, wait_event_type, wait_event, query FROM pg_stat_activity;
 datid  |  pid  | wait_event_type |     wait_event      |                                  query
--------+-------+-----------------+---------------------+-------------------------------------------------------------------------
 <NULL> | 13782 | Activity        | AutoVacuumMain      |
  16384 | 16615 | Lock            | relation            | SELECT * FROM t1;
  16384 | 16621 | Client          | ClientRead          | LOCK TABLE t1;
 847842 | 16763 | LWLock          | WALWriteLock        | END;
 847842 | 16764 | Lock            | transactionid       | UPDATE pgbench_branches SET bbalance = bbalance + 1229 WHERE bid = 1;
 847842 | 16766 | LWLock          | WALWriteLock        | END;
 847842 | 16767 | Lock            | transactionid       | UPDATE pgbench_tellers SET tbalance = tbalance + 3383 WHERE tid = 86;
 847842 | 16769 | Lock            | transactionid       | UPDATE pgbench_branches SET bbalance = bbalance + -3786 WHERE bid = 10;
[...]

In this example, we can see that the wait event for pid 16615 is a Lock on a Relation. In other words, the query is blocked waiting for a heavyweight lock, while the pid 16621, which obviously holds the lock, is idle waiting for client commands. This is something we could already see in previous version, though in a different manner. But more interesting, we can also see that the wait event for pid 16766 is a LWLock, or a Lightweight Lock. Those are internal transient locks that you previsouly couldn’t see at the SQL level. In this examp

[...]

Sometimes logs are the last things checked when things are going wrong, but they are usually the first things screaming for help when something happens. Manually looking through the logs for problems helps, but why not use log analyzers to automatically generate reports to provide insight on the database before something goes wrong?

The PostgreSQL log analyzer “pgBadger” is an open source “fast PostgreSQL log analysis report” program written in Perl that takes the log output from a running PostgreSQL instance and processes it into an HTML file. The report it generates shows all information found in a nice and easy to read report format. These reports can help shed light on errors happening in the system, checkpoint behavior, vacuum behavior, trends, and other basic but crucial information for a PostgreSQL system.

PostgreSQL Logging Setup

To use pgBadger effectively, logging in PostgreSQL should be set up to provide pgBadger as much information as possible. A handful of options can tweaked to allow the database system to log useful information for pgBadger to generate useful reports. Full documentation for PostgreSQL configuration can be found on the pgBadger github page, but some basic information is below.

When run, pgBadger will process the logs from PostgreSQL whether they are syslog, stderr, or csvlog, as long as the log lines themselves have enough information in the prefix.

Example log_line_prefix values:

If log_destination = ‘syslog’

log_line_prefix = 'user=%u,db=%d,app=%aclient=%h '

If log_destination = ‘stderr’

log_line_prefix = '%t [%p]: [%l-1] user=%u,db=%d,app=%a,client=%h '

Basic configuration settings to set in postgresql.conf:

log_checkpoints = on
log_connections = on
log_disconnections = on
log_lock_waits = on
log_temp_files = 0
log_autovacuum_min_duration = 0
log_error_verbosity = default
log_statement = off
lc_messages='C'

Log_min_duration_statement = (see below)

One of the more helpful parts of pgBadger’s reports is the slow query report, which relies on the database logging queries that exc

[...]

One of the features we are looking forward to in upcoming PostgreSQL 11 is the introduction of procedures via the CREATE PROCEDURE ANSI-SQL construct. The major benefit that sets apart procedures from functions is that procedures are not wrapped in an outer transaction and can have COMMITs within them. This means it's not an all or nothing like it is with functions. Even if you stop a procedure in motion, whatever work has been done and committed is saved. In the case of functions, a stop or failure would roll-back all the work. It also means you can see work in progress of a stored procedure since the work will already have been committed. This is a huge benefit for batch processing. Batch processing covers a lot of use-cases of PostGIS users since a good chunk of PostGIS work involves doing some kind of batch processing of data you get from third-parties or machines.

Continue reading "Using procedures for batch geocoding and other batch processing"