PostgreSQL
The world's most advanced open source database
Top posters
Number of posts in the past month
Top teams
Number of posts in the past month
Feeds
Planet
  • Policy for being listed on Planet PostgreSQL.
  • Add your blog to Planet PostgreSQL.
  • List of all subscribed blogs.
  • Manage your registration.
Contact
  • Get in touch with the Planet PostgreSQL administrators at planet at postgresql.org.
Posted by Bruce Momjian in EnterpriseDB on 2019-02-22 at 20:30

I did a webcast earlier this week about the many options available to people choosing Postgres — many more options than are typically available for proprietary databases. I want to share the slides, which covers why open source has more options, how to choose a vendor that helps you be more productive, and specifically tool options for extensions, deployment, and monitoring.

fsync postgresql upgrade

PostgreSQL logoIn case you didn’t already see this news, PostgreSQL has got its first minor version released for 2019. This includes minor version updates for all supported PostgreSQL versions. We have indicated in our previous blog post that PostgreSQL 9.3 had gone EOL, and it would not support any more updates. This release includes the following PostgreSQL major versions:

What’s new in this release?

One of the common fixes applied to all the supported PostgreSQL versions is on – panic instead of retrying after fsync () failure. This fsync failure has been in discussion for a year or two now, so let’s take a look at the implications.

A fix to the Linux fsync issue for PostgreSQL Buffered IO in all supported versions

PostgreSQL performs two types of IO. Direct IO – though almost never – and the much more commonly performed Buffered IO.

PostgreSQL uses O_DIRECT when it is writing to WALs (Write-Ahead Logs aka Transaction Logs) only when

wal_sync_method
 is set to :
open_datasync
 or to 
open_sync
 with no archiving or streaming enabled. The default 
wal_sync_method
 may be
fdatasync
 that does not use O_DIRECT. This means, almost all the time in your production database server, you’ll see PostgreSQL using O_SYNC / O_DSYNC while writing to WAL’s. Whereas, writing the modified/dirty buffers to datafiles from shared buffers is always through Buffered IO.  Let’s understand this further.

Upon checkpoint, dirty buffers in shared buffers are written to the page cache managed by kernel. Through an fsync(), these modified blocks are applied to disk. If an fsync() call is successful, all dirty pages from the corresponding file are guaranteed to be persisted on the disk. When there is an fsync to flush the pages to disk, PostgreSQL cannot guarantee a copy of a modified/dirty page. The reason is that writes to storage from the page cache are completely managed by the kernel, and not by PostgreSQL.

This could still be fine if

[...]
Posted by Paul Ramsey in PostGIS on 2019-02-22 at 13:00

Map projection is a core feature of any spatial database, taking coordinates from one coordinate system and converting them to another, and PostGIS has depended on the Proj library for coordinate reprojection support for many years.

Proj6 in PostGIS

For most of those years, the Proj library has been extremely slow moving. New projection systems might be added from time to time, and some bugs fixed, but in general it was easy to ignore. How slow was development? So slow that the version number migrated into the name, and everyone just called it “Proj4”.

No more.

Starting a couple years ago, new developers started migrating into the project, and the pace of development picked up. Proj 5 in 2018 dramatically improved the plumbing in the difficult area of geodetic transformation, and promised to begin changing the API. Only a year later, here is Proj 6, with yet more huge infrastructural improvements, and the new API.

Some of this new work was funded via the GDALBarn project, so thanks go out to those sponsors who invested in this incredibly foundational library and GDAL maintainer Even Roualt.

For PostGIS that means we have to accomodate ourselves to the new API. Doing so not only makes it easier to track future releases, but gains us access to the fancy new plumbing in Proj.

Proj6 in PostGIS

For example, Proj 6 provides:

Late-binding coordinate operation capabilities, that takes metadata such as area of use and accuracy into account… This can avoid in a number of situations the past requirement of using WGS84 as a pivot system, which could cause unneeded accuracy loss.

Or, put another way: more accurate results for reprojections that involve datum shifts.

Here’s a simple example, converting from an old NAD27/NGVD29 3D coordinate with height in feet, to a new NAD83/NAVD88 coordinate with height in metres.

SELECT ST_Astext(
         ST_Transform(
           ST_SetSRID(geometry('POINT(-100 40 100)'),7406), 
           5500));

Note that the height in NGVD29 is 100 feet, if converted directly to meters, it would be 30.48 metres. The transformed po

[...]
Posted by Craig Kerstiens in CitusData on 2019-02-21 at 17:44

For those considering Citus, if your use case seems like a good fit, we often are willing to spend some time with you to help you get an understanding of the Citus database and what type of performance it can deliver. We commonly do this in a roughly two hour pairing session with one of our engineers. We’ll talk through the schema, load up some data, and run some queries. If we have time at the end it is always fun to load up the same data and queries into single node Postgres and see how we compare. After seeing this for years, I still enjoy seeing performance speed ups of 10 and 20x over a single node database, and in cases as high as 100x.

And the best part is it didn’t take heavy re-architecting of data pipelines. All it takes is just some data modeling, and parallelization with Citus.

The first step is sharding

We’ve talked about this before but the first key to these performance gains is that Citus splits up your data under the covers to smaller more manageable pieces. These are shards (which are standard Postgres tables) are spread across multiple physical nodes. This means that you have more collective horsepower within your system to pull from. When you’re targetting a single shard it is pretty simple: the query is re-routed to the underlying data and once it gets results it returns them.

Thinking in MapReduce

MapReduce has been around for a number of years now, and was popularized by Hadoop. The thing about large scale data is in order to get timely answers from it you need to divide up the problem and operate in parallel. Or you find an extremely fast system. The problem with getting a bigger and faster box is that data growth is outpacing hardware improvements.

Data growth vs moores law

MapReduce itself is a framework for splitting up data, shuffling the data to nodes as needed, and then performing the work on a subset of data before recombining for the result. Let’s take an example like counting up total pageviews. If we wanted leverage MapReduce on this we would split the pageviews into 4 separate buckets. We could do th

[...]
If PostgreSQL is the fastest growing database, then why is the community so small?

The database king continues its reign. For the second year in a row, PostgreSQL is still the fastest growing DBMS.

By comparison, in 2018 MongoDB was the second fastest growing, while Oracle, MySQL, and SQL Server all shrank in popularity.

If PostgreSQL is the fastest growing database, then why is the community so small?

For those who stay on top of news from database land, this should come as no surprise, given the number of PostgreSQL success stories that have been published recently:

Let’s all pat ourselves on the back, shall we? Not quite yet.

The PostgreSQL community by the numbers

As the popularity of PostgreSQL grows, attendance at community events remains small. This is the case even as more and more organizations and developers embrace PostgreSQL, so from our perspective, there seems to be a discrepancy between the size of the Postgres user base and that of the Postgres community.

The two main PostgreSQL community conferences are Postgres Conference (US) and PGConf EU. Below is a graph of Postgres Conference attendance for the last 5 years, with a projection for the Postgres Conference 2019 event occurring in March.

If PostgreSQL is the fastest growing database, then why is the community so small?

Last year, PGConf EU had around 500 attendees, a 100% increase since 4 years ago.

Combined, that’s about 1,100 attendees for the two largest conferences within the PostgreSQL community. By comparison, Oracle OpenWorld has about 60,000 attendees. Even MongoDB World had over 2,000 attendees in 2018.

We fully recognize that attendance at Postgres community events will be a portion of the user base. We also really enjoy these events and applaud the organizers for the effort they invest in running them. And in-person events may indeed be a lagging indicator of a systems growth in popularity. Let’s just grow faster!

One relatively new gathering point for the Postgres community is the P

[...]
Posted by Nickolay Ihalainen in Percona on 2019-02-21 at 14:05
parallel queries in postgresql

PostgreSQL logoModern CPU models have a huge number of cores. For many years, applications have been sending queries in parallel to databases. Where there are reporting queries that deal with many table rows, the ability for a query to use multiple CPUs helps us with a faster execution. Parallel queries in PostgreSQL allow us to utilize many CPUs to finish report queries faster. The parallel queries feature was implemented in 9.6 and helps. Starting from PostgreSQL 9.6 a report query is able to use many CPUs and finish faster.

The initial implementation of the parallel queries execution took three years. Parallel support requires code changes in many query execution stages. PostgreSQL 9.6 created an infrastructure for further code improvements. Later versions extended parallel execution support for other query types.

Limitations

  • Do not enable parallel executions if all CPU cores are already saturated. Parallel execution steals CPU time from other queries, and increases response time.
  • Most importantly, parallel processing significantly increases memory usage with high WORK_MEM values, as each hash join or sort operation takes a work_mem amount of memory.
  • Next, low latency OLTP queries can’t be made any faster with parallel execution. In particular, queries that returns a single row can perform badly when parallel execution is enabled.
  • The Pierian spring for developers is a TPC-H benchmark. Check if you have similar queries for the best parallel execution.
  • Parallel execution supports only SELECT queries without lock predicates.
  • Proper indexing might be a better alternative to a parallel sequential table scan.
  • There is no support for cursors or suspended queries.
  • Windowed functions and ordered-set aggregate functions are non-parallel.
  • There is no benefit for an IO-bound workload.
  • There are no parallel sort algorithms. However, queries with sorts still can be parallel in some aspects.
  • Replace CTE (WITH …) with a sub-select to support parallel execution.
  • Foreign data wrappers do not currently support parallel execution (but they c
[...]
Posted by Bruce Momjian in EnterpriseDB on 2019-02-20 at 20:00

Postgres supports two types of server-side languages, trusted and untrusted. Trusted languages are available for all users because they have safe sandboxes that limit user access. Untrusted languages are only available to superusers because they lack sandboxes.

Some languages have only trusted versions, e.g., PL/pgSQL. Others have only untrusted ones, e.g., PL/Python. Other languages like Perl have both.

Why would you want to have both trusted and untrusted languages available? Well, trusted languages like PL/Perl limit access to only safe resources, while untrusted languages like PL/PerlU allow access to files system and network resources that would be unsafe for non-superusers, i.e., it would effectively give them the same power as superusers. This is why only superusers can use untrusted languages.

Continue Reading »

Posted by Paul Ramsey in Crunchy Data on 2019-02-20 at 16:38

New features and better performance get a lot of attention, but one of the relatively unsung improvements in PostGIS over the past ten years has been inclusion in standard software repositories, making installation of this fairly complex extension a "one click" affair.

Once you've got PostgreSQL/PostGIS installed though, how are upgrades handled? The key is having the right versions in place, at the right time, for the right scenario and knowing a little bit about how PostGIS works.

First off – not trying to kindle any flame wars here, just trying to broaden my (your) horizons a bit, gather some ideas (maybe I’m missing out on something cool, it’s the most used Open Source RDBMS after all) and to somewhat compare the two despite being a difficult thing to do correctly / objectively. Also I’m leaving aside here performance comparisons and looking at just the available features, general querying experience and documentation clarity as this is I guess most important for beginners. So just a list of points I made for myself, grouped in no particular order.

Disclaimer: last time I used MySQL for some personal project it was 10 years ago, so basically I’m starting from zero and only took one and a half days to get to know it – thus if you see that I’ve gotten something screamingly wrong then please do leave a comment and I’ll change it. Also, my bias in this article probably tends to favour Postgres…but I’m pretty sure a MySQL veteran with good knowledge of pros and cons can write up something similar also on Postgres, so my hope is that you can leave this aside and learn a thing or two about either system.

To run MySQL I used the official Docker image, 8.0.14. Under MySQL the default InnoDB engine is meant.

docker run --rm -p 3306:3306 -e MYSQL_ROOT_PASSWORD=root mysql:8

“mysql” CLI (“psql” equivalent) and general querying experience

* When the server requires a password why doesn’t it just ask for it?

mysql -h 0.0.0.0 -u root   # adding '-p' will fix the error
ERROR 1045 (28000): Access denied for user 'root'@'172.17.0.1' (using password: NO)

* Very poor tab-completion compared to “psql”. Using “mycli” instead makes much sense. I’m myself 99% of time on CLI-s, so it’s essential.
* Lot less shortcut helpers to list tables, views, functions, etc…
* Can’t set to “extended output” (columns as rows) permanently, only “auto” and “per query”.
* One does not need to specify a DB to connect to – I find it positive actually as it’s easy to forgot those database names and when once in, one can call “sh

[...]
Posted by Stefan Fercot in Dalibo on 2019-02-20 at 00:00

pgBackRest is a well-known powerful backup and restore tool.

Relying on the status information given by the “info” command, we’ve build a specific plugin for Nagios : check_pgbackrest.

This post will help you discover this plugin and assume you already know pgBackRest and Nagios.


Let’s assume we have a PostgreSQL cluster with pgBackRest working correctly.

Given this simple configuration:

[global]
repo1-path=/some_shared_space/
repo1-retention-full=2

[mystanza]
pg1-path=/var/lib/pgsql/11/data

Let’s get the status of our backups with the pgbackrest info command:

stanza: mystanza
    status: ok
    cipher: none

    db (current)
        wal archive min/max (11-1): 00000001000000040000003C/000000010000000B0000004E

        full backup: 20190219-121527F
            timestamp start/stop: 2019-02-19 12:15:27 / 2019-02-19 12:18:15
            wal start/stop: 00000001000000040000003C / 000000010000000400000080
            database size: 3.0GB, backup size: 3.0GB
            repository size: 168.5MB, repository backup size: 168.5MB

        incr backup: 20190219-121527F_20190219-121815I
            timestamp start/stop: 2019-02-19 12:18:15 / 2019-02-19 12:20:38
            wal start/stop: 000000010000000400000082 / 0000000100000004000000B8
            database size: 3.0GB, backup size: 2.9GB
            repository size: 175.2MB, repository backup size: 171.6MB
            backup reference list: 20190219-121527F

        incr backup: 20190219-121527F_20190219-122039I
            timestamp start/stop: 2019-02-19 12:20:39 / 2019-02-19 12:22:55
            wal start/stop: 0000000100000004000000C1 / 0000000100000004000000F4
            database size: 3.0GB, backup size: 3.0GB
            repository size: 180.9MB, repository backup size: 177.3MB
            backup reference list: 20190219-121527F, 20190219-121527F_20190219-121815I

        full backup: 20190219-122255F
            timestamp start/stop: 2019-02-19 12:22:55 / 2019-02-19 12:25:47
            wal start/stop: 000000010000000500000000 / 00000001000000050000003
[...]
Posted by Hubert 'depesz' Lubaczewski on 2019-02-19 at 20:00
Recent change in layout of PG Docs broke my spider for why-upgrade.depesz.com. Today got some time and decided to bite the bullet. Fixed spider code, used it to get new changelog, and while I was at it, did couple of slight modifications of the site: display count of all changes that are there in given … Continue reading "why-upgrade updates"
On 16th of February 2019, Tom Lane committed patch: Allow user control of CTE materialization, and change the default behavior.   Historically we've always materialized the full output of a CTE query, treating WITH as an optimization fence (so that, for example, restrictions from the outer query cannot be pushed into it). This is appropriate … Continue reading "Waiting for PostgreSQL 12 – Allow user control of CTE materialization, and change the default behavior."

Barman 2.6 introduces support for geo-redundancy, meaning that Barman can now copy from another Barman instance, not just a PostgreSQL database.

Geographic redundancy (or simply geo-redundancy) is a property of a system that replicates data from one site (primary) to a geographically distant location as redundancy, in case the primary system becomes inaccessible or is lost. From version 2.6, it is possible to configure Barman so that the primary source of backups for a server can also be another Barman instance, not just a PostgreSQL database server as before.

Briefly, you can define a server in your Barman instance (passive, according to the new Barman lingo), and map it to a server defined in another Barman instance (primary).

All you need is an SSH connection between the Barman user in the primary server and the passive one. Barman will then use the rsync copy method to synchronise itself with the origin server, copying both backups and related WAL files, in an asynchronous way. Because Barman shares the same rsync method, geo-redundancy can benefit from key features such as parallel copy and network compression. Incremental backup will be included in future releases.

Geo-redundancy is based on just one configuration option: primary_ssh_command.

Our existing scenario

To explain how geo-redundancy works, we will use the following example scenario. We keep it very simple for now.

We have two identical data centres, one in Europe and one in the US, each with a PostgreSQL database server and a Barman server:

  • Europe:
    • PostgreSQL server: eric
    • Barman server: jeff, backing up the database server hosted on eric
  • US:
    • PostgreSQL server: duane
    • Barman server: gregg, backing up the database server hosted on duane

Let’s have a look at how jeff is configured to backup eric, by reading the content of the /etc/barman.d/eric.conf file:

[eric]
description = Main European PostgreSQL server 'Eric'
conninfo = user=barman-jeff dbname=postgres host=eric
ssh_command = ssh postgres@eric
backup_method = rsync
parallel_jobs = 4
r
[...]
Posted by Bruce Momjian in EnterpriseDB on 2019-02-18 at 16:30

SQL is a declaritive language, meaning you specify what you want, rather than how to generate what you want. This leads to a natural language syntax, like the SELECT command. However, once you dig into the behavior of SELECT, it becomes clear that it is necessary to understand the order in which SELECT clauses are executed to take full advantage of the command.

I was going to write up a list of the clause execution ordering, but found this webpage that does a better job of describing it than I could. The ordering bounces from the middle clause (FROM) to the bottom to the top, and then the bottom again. It is hard to remember the ordering, but memorizing it does help in constructing complex SELECT queries.

Posted by Robert Haas in EnterpriseDB on 2019-02-18 at 15:23
One of the things I sometimes get asked to do is review someone's postgresql.conf settings.  From time to time, I run across a configuration where the value of autovacuum_naptime has been increased, often by a large multiple.  The default value of autovacuum_naptime is 1 minute, and I have seen users increase this value to 1 hour, or in one case, 1 day.  This is not a good idea.  In this blog post, I will attempt to explain why it's not a good idea, and also something about the limited circumstances under which you might want to change autovacuum_naptime.

Read more »
Posted by Jonathan Katz in Crunchy Data on 2019-02-17 at 23:46

Common table expressions, aka CTEs, aka WITH queries, are not only the gateway to writing recursive SQL queries, but also help developers write maintainable SQL. WITH query clauses can help developers who are more comfortable writing in imperative languages to feel more comfortable writing SQL, as well as help reduce writing redundant code by reusing a particular common table expressions multiple times in a query.

A new patch, scheduled to be a part of PostgreSQL 12 major release later in the year, introduces the ability, under certain conditions, to inline common table expressions within a query. This is a huge feature: many developers could suddenly see their existing queries speed up significantly, and the ability to explicitly specify when to inline (i.e. the planner "substitutes" a reference to the CTE in the main query and can then optimize further) or, conversely, materialize (i.e. place the CTE into memory but lose out on certain planning & execution optimizations).

But why is this a big deal? Before we look into the future, first let's understand how WITH queries currently work in PostgreSQL.

This email thread from 2017 asks the question of whether there is an imperative language that generates declarative output that can be converted into an imperative program and executed. Specifically, is there an imperative syntax that can output SQL (a declarative language) which can be executed internally (imperatively) by Postgres?

The real jewel in this email thread is from Peter Geoghegan, who has some interesting comments. First, he explains why developers would want an imperative language interface, even if it has to be converted to declarative:

Some developers don't like SQL because they don't have a good intuition for how the relational model works. While SQL does have some cruft — incidental complexity that's a legacy of the past — any language that corrected SQL's shortcomings wouldn't be all that different to SQL, and so wouldn't help with this general problem. QUEL wasn't successful because it was only somewhat better than SQL was at the time.

Continue Reading »

I wrote yesterday about Vitess, a scale-out sharding solution for MySQL. Another similar product is Citus, which is a scale-out sharding solution for PostgreSQL. Similar to Vitess, Citus is successfully being used to solve problems of scale and performance that have previously required a lot of custom-built middleware.

What’s new in PostgreSQL 11

PosgreSQL 11 was released four months ago and my review is long overdue. Here we go!

With respect to standard SQL, the main theme in PostgreSQL 11 is window functions (over). For almost eight years, from 2009 until 2017, PostgreSQL was the only major free open-source product to support SQL window functions. Just a year later, by September 2018, all open-source competitors have caught up…and some even overtook PostgreSQL. The PostgreSQL community was prepared. PostgreSQL 11 was just released in 2018, and it has restored and even expanded its leadership position.0

This article explains this race and covers other improvements in PostgreSQL 11.

Complete SQL:2011 Over Clause

The over clause defines which rows are visible to a window function. Window functions were originally standardized with SQL:2003, and PostgreSQL has supported them since PostgreSQL 8.4 (2009). In some areas, the PostgreSQL implementation was less complete than the other implementations (range frames, ignore nulls), but in other areas it was the first major system to support them (the window clause). In general, PostgreSQL was pretty close to the commercial competitors, and it was the only major free database to support window functions at all—until recently.

In 2017, MariaDB introduced window functions. MySQL and SQLite followed in 2018. At that time, the MySQL implementation of the over clause was even more complete than that of PostgreSQL, a gap that PostgreSQL 11 closed. Furthermore, PostgreSQL is again the first to support some aspects of the over clause, namely the frame unit groups and frame exclusion. These are not yet supported by any other major SQL database—neither open-source, nor commercial.

The only over clause feature not supported by PostgreSQL 11 are pattern and related clauses. These clauses were just standardized with SQL:2016 and do a framing based on a regular expression. No major database supports this this framing yet.1

Frame Units

Before looking into the new functionality in PostgreSQL 11, I’l

[...]
plprofiler postgres performance tool

PostgreSQL is emerging as the standard destination for database migrations from proprietary databases. As a consequence, there is an increase in demand for database side code migration and associated performance troubleshooting. One might be able to trace the latency to a plsql function, but explaining what happens within a function could be a difficult question. Things get messier when you know the function call is taking time, but within that function there are calls to other functions as part of its body. It is a very challenging question to identify which line inside a function—or block of code—is causing the slowness. In order to answer such questions, we need to know how much time an execution spends on each line or block of code. The plprofiler project provides great tooling and extensions to address such questions.

Demonstration of plprofiler using an example

The plprofiler source contains a sample for testing plprofiler. This sample serves two purposes. It can be used for testing the configuration of plprofiler, and it is great place to see how to do the profiling of a nested function call. Files related to this can be located inside the “examples” directory. Don’t worry—I’ll be running through the installation of plprofiler later in this article.

$ cd examples/

The example expects you to create a database with name “pgbench_plprofiler”

postgres=# CREATE DATABASE pgbench_plprofiler;
CREATE DATABASE

The project provides a shell script along with a source tree to test plprofiler functionality. So testing is just a matter of running the shell script.

$ ./prepdb.sh
dropping old tables...
....

Running session level profiling

This profiling uses session level local-data. By default the plprofiler extension collects runtime data in per-backend hashtables (in-memory). This data is only accessible in the current session, and is lost when the session ends or the hash tables are explicitly reset. plprofiler’s run command will execute the plsql code and capture the profile information.

This is illustrated by below

[...]

I want to take a few minutes for a deep dive into the effect your data model has on storage density when using PostgreSQL. When this topic came up with a customer, I explained my thoughts on the matter, but I realized at the time that I had never done a reasonably careful apples-to-apples test to see just exactly what the effect is, at least for a model sample size of one. So here it is.

Posted by Bruce Momjian in EnterpriseDB on 2019-02-13 at 17:15

You might not be aware that you can store a virtual row, called a composite value, inside a database field. Composite values have their own column names and data types. This is useful if you want to group multiple statically-defined columns inside a single column. (The JSON data types are ideal for dynamically-defined columns.)

This email thread explains how to define and use them, I have a presentation that mentions them, and the Postgres manual has a section about them.

Posted by KUNTAL GHOSH in EnterpriseDB on 2019-02-13 at 08:23
When data are naturally aligned, CPU can perform read and write to memory efficiently. Hence, each data type in PostgreSQL has a specific alignment requirement. When multiple attributes are stored consecutively in a tuple, padding is inserted before an attribute so that it begins from the required aligned boundary. A better understanding of these alignment requirements may help minimizing the amount of padding required while storing a tuple on disk, thus saving disk space.

Data types in Postgres are divided into following categories:
  • Pass-by-value, fixed length: Data types that are passed by values to Postgres internal routines and have fixed lengths fall into this category.. The length can be 1, 2,  or 4 (or 8 on 64-bit systems) bytes.
  • Pass-by-reference, fixed length: For these data types, an address reference from the in-memory heap page is sent to internal Postgres routines. They also have fixed lengths.
  • Pass-by_reference, variable length: For variable length data types, Postgres prepends a varlena header before the actual data. It stores some information about how the data is actually stored on-disk (uncompressed, compressed or TOASTed) and the actual length of the data. For TOASTed attributes, the actual data is stored in a separate relation. In these cases, the varlena headers follow some information about the actual location of the data in their corresponding TOAST relation.
    Typically, on-disk size of a varlena header is 1-byte. But, if the data cannot be toasted and size of the uncompressed data crosses 126 bytes, it uses a 4-bytes header. For example,
    CREATE TABLE t1 (
    , a varchar
    );
    insert into t1 values(repeat('a',126));
    insert into t1 values(repeat('a',127));
    select pg_column_size(a) from t1;
    pg_column_size
    ---------------------
    127
    131
    Besides, attributes having 4-bytes varlena header need to be aligned to a 4-bytes aligned memory location. It may waste upto 3-bytes of additional padding space. So, some careful length restrictions on such columns may save space.
  • Pass-by_referen
[...]
Posted by Craig Kerstiens in CitusData on 2019-02-12 at 16:52

I’ve learned a lot of skills over the course of my career, but no technical skill more useful than SQL. SQL stands out to me as the most valuable skill for a few reasons:

  1. It is valuable across different roles and disciplines
  2. Learning it once doesn’t really require re-learning
  3. You seem like a superhero. You seem extra powerful when you know it because of the amount of people that aren’t fluent

Let me drill into each of these a bit further.

SQL a tool you can use everywhere

Regardless of what role you are in SQL will find a way to make your life easier. Today as a product manager it’s key for me to look at data, analyze how effective we’re being on the product front, and shape the product roadmap. If we just shipped a new feature, the data on whether someone has viewed that feature is likely somewhere sitting in a relational database. If I’m working on tracking key business metrics such as month over month growth, that is likely somewhere sitting in a relational database. At the other end of almost anything we do there is likely a system of record that speaks SQL. Knowing how to access it most natively saves me a significant amount of effort without having to go ask someone else the numbers.

But even before becoming a product manager I would use SQL to inform me about what was happening within systems. As an engineer it could often allow me to pull information I wanted faster than if I were to script it in say Ruby or Python. When things got slow in my webapp having an understanding of the SQL that was executed and ways to optimize it was indespensible. Yes, this was going a little beyond just a basic understanding of SQL… but adding an index to a query instead of rolling my own homegrown caching well that was well worth the extra time learning.

SQL is permanent

I recall roughly 20 years ago creating my first webpage. It was magical, and then I introduced some Javascript to make it even more impressive prompting users to click Yes/No or give me some input. Then about 10 years later jQuery came along and while

[...]

A feature of PostgreSQL that most people don’t even know exists is the ability to export and import transaction snapshots.

The documentation is accurate, but it doesn’t really describe why one might want to do such a thing.

First, what is a “snapshot”? You can think of a snapshot as the current set of committed tuples in the database, a consistent view of the database. When you start a transaction and set it to REPEATABLE READ mode, the snapshot remains consistent throughout the transaction, even if other sessions commit transactions. (In the default transaction mode, READ COMMITTED, each statement starts a new snapshot, so newly committed work could appear between statements within the transaction.)

However, each snapshot is local to a single transaction. But suppose you wanted to write a tool that connected to the database in multiple sessions, and did analysis or extraction? Since each session has its own transaction, and the transactions start asynchronously from each other, they could have different views of the database depending on what other transactions got committed. This might generate inconsistent or invalid results.

This isn’t theoretical: Suppose you are writing a tool like pg_dump, with a parallel dump facility. If different sessions got different views of the database, the resulting dump would be inconsistent, which would make it useless as a backup tool!

The good news is that we have the ability to “synchronize” various sessions so that they all use the same base snapshot.

First, a transaction opens and sets itself to REPEATABLE READ or SERIALIZABLE mode (there’s no point in doing exported snapshots in READ COMMITTED mode, since the snapshot will get replaced at the very next transaction). Then, that session calls pg_export_snapshot. This creates an identifier for the current transaction snapshot.

Then, the client running the first session passes that identifier to the clients that will be using it. You’ll need to do this via some non-database channel. For example, you can’t use LISTEN / NOTIFY,

[...]
Posted by Bruce Momjian in EnterpriseDB on 2019-02-11 at 20:00

I saw AT TIME ZONE used in a query, and found it confusing. I read the Postgres documentation and was still confused, so I played with some queries and finally figured it out. I then updated the Postgres documentation to explain it better, and here is what I found.

First, AT TIME ZONE has two capabilities. It allows time zones to be added to date/time values that lack them (timestamp without time zone, ::timestamp), and allows timestamp with time zone values (::timestamptz) to be shifted to non-local time zones and the time zone designation removed. In summary, it allows:

  1. timestamp without time zone &roarr timestamp with time zone (add time zone)
  2. timestamp with time zone &roarr timestamp without time zone (shift time zone)

It is kind of odd for AT TIME ZONE to be used for both purposes, but the SQL standard requires this.

Continue Reading »

Posted by Regina Obe in PostGIS on 2019-02-11 at 08:31

We recently installed PostgreSQL 11 on an Ubuntu 18.04 using apt.postgresql.org. Many of our favorite extensions were already available via apt (postgis, ogr_fdw to name a few), but it didn't have the http extension we use a lot. The http extension is pretty handy for querying things like Salesforce and other web api based systems. We'll outline the basic compile and install steps. While it's specific to the http extension, the process is similar for any other extension you may need to compile.

Continue reading "Compiling http extension on ubuntu 18.04"
Posted by Alexey Lesovsky in Data Egret on 2019-02-09 at 19:34
Great news for all pgCenter users - a new version 0.6.0 has been released with new features and few minor improvements.

Here are some major changes:

  • new wait events profiler - a new sub-command which allows to inspect long-running queries and understand what query spends its time on.
  • goreleaser support - goreleaser helps to build binary packages for you, so you can find .rpm and .deb packages on the releases page.
  • Goreport card A+ status - A+ status is the little step to make code better and align it to Golang code style
This release also includes following minor improvements and fixes:
  • report tool now has full help list of supported stats, you can, at any time, get a descriptive explanation of stats provided by pgCenter. Check out the “--describe” flag of “pgcenter report”;
  • “pgcenter top” now has been fixed and includes configurable aligning of columns, which make stats viewing more enjoyable (check out builtin help for new hotkeys);
  • wrong handling of group mask has been fixed. It is used for canceling group of queries, or for termination of backends’ groups;
  • also fixed the issue when pgCenter is failed to connect to Postgres with disabled SSL;
  • and done some other minor internal refactoring.
New release is available here. Check it out and have a nice day.

PostgreSQL has two autovacuum-age related settings, autovacuum_freeze_max_age, and vacuum_freeze_table_age.

Both of them are in terms of the transaction “age” of a table: That is, how long it has been since the table has been scanned completely for “old” tuples that can be marked as “frozen” (a “frozen” tuple is one that no open transaction can cause to disappear by a rollback). In short, the “oldest” a table can become in PostgreSQL is 2^31-1 transactions; if a table were ever to reach that, data loss would occur. PostgreSQL takes great pains to prevent you from eaching that point.

The “vacuum freeze” process is the process that scans the table and marks these tuples as frozen.

vacuum_freeze_table_age causes a regular autovacuum run to be an “autovacuum (to prevent xid wraparound)” run, that is, an (auto)vacuum freeze, if the age of the table is higher than vacuum_freeze_table_age.

autovacuum_freeze_max_age will cause PostgreSQL to start an “autovacuum (to prevent xid wraparound)” run even if it has no other reason to vacuum the table, should a table age exceed that setting.

By default, vacuum_freeze_table_age = 100000000 (one hundred million), and autovacuum_freeze_max_age = 200000000 (two hundred million).

Do not change them.

In the past, I made a recommendation I now deeply regret. Because, before 9.6, each autovacuum freeze run scanned the entire table, and (on its first pass) potentially rewrote the entire table, it could be very high I/O, and when it woke up suddenly, it could cause performance issues. I thus recommended two things:

  1. Increase autovacuum_freeze_max_age and vacuum_freeze_table_age, and,
  2. Do manual VACUUM FREEZE operations on the “oldest” tables during low-traffic periods.

Unfortunately, far too many installations adopted recommendation #1, but didn’t do #2. The result was that they cranked up autovacuum_freeze_max_age so high that by the time the mandatory autovacuum freeze operation began, they were so close to transaction XID wraparound point, they had no choice but to take the system offl

[...]

Extensions are capable of extending, changing, and advancing the behavior of Postgres. How? By hooking into low level Postgres API hooks. The open source Citus database that scales out Postgres horizontally is itself implemented as a PostgreSQL extension, which allows Citus to stay current with Postgres releases without lagging behind like other Postgres forks. I’ve previously written about the various types of extensions, today though I want to take a deeper look at the most useful Postgres extension: pg_stat_statements.

You see, I just got back from FOSDEM. FOSDEM is the annual free and open source software conference in Brussels, and at the event I gave a talk in the PostgreSQL devroom about Postgres extensions. By the end of the day, over half the talks that had been given in the Postgres devroom mentioned pg_stat_statements:

If you use Postgres and you haven’t yet used pg_stat_statements, it is a must to add it to your toolbox. And even if you are familiar, it may be worth a revisit.

Getting started with pg_stat_statements

Pg_stat_statements is what is known as a contrib extension, found in the contrib directory of a PostgreSQL distribution. This means it already ships with Postgres and you don’t have to go and build it from source or install packages. You may have to enable it for your database if it is not already enabled. This is as simple as:

CREATE EXTENSION pg_stat_statements;

If you run on a major cloud provider there is a strong likelihood they have already installed and enabled it for you.

Once pg_stat_statements is installed, it begins silently going to work under the covers. Pg_stat_statements records queries that are run against your database, strips out a number of va

[...]