Latest Blog Posts

Postgres Synchronous Replication - a 99.99% guarantee only
Posted by Kaarel Moppel on 2024-12-08 at 22:00
At last week’s local Postgres user group meetup here in Estonia, one of the topics was HA and recent Patroni (the most popular cluster manager for Postgres) improvements in supporting quorum commit, which by the way on its own has been possible to use for years. Things went deep quickly...

Postgres Partitioning with a Default Partition
Posted by Keith Fiske in Crunchy Data on 2024-12-06 at 13:30

Partitioning is an important database maintenance strategy for a growing application backed by PostgreSQL. As one of the main authors of pg_partman and an engineer here at Crunchy Data, I spend a lot of my time helping folks implement partitioning. One of the nuances of PostgreSQL’s partitioning implementation is the default partition, which I’ll dig into in this post and discuss how to use it effectively.

Why default partitions are important

The default partition is pretty much what it sounds like; you can make a special partition designated as the DEFAULT, which will capture any and all data that does not have an existing partition with matching boundary constraints.

If you’re new to partitioning, you might be making partitions a week in advance. But after monitoring you realize you need to make them more like 2 weeks in advance. Default partitions can help you learn how to manage and when to create your child partitions.

Default partitions are also there to catch mistakes. Maybe there’s an issue in application code putting timestamps a hundred years into the future instead of one year. Maybe there’s just some bad data getting created. Your default partition can help you spot that.

While having a default partition is a good idea, you don’t actually want to leave data in there. I’ll show you some tips later on about how to monitor the default for the presence of any rows. When you find data in there, you’ll want to evaluate whether the data is valid, and if it is, create the relevant child partitions and move the data there.

Adding a default

PostgreSQL declarative partitioning does not create any child partitions automatically, including the default. pg_partman can help with that and we’ll discuss that later.

Here we have a daily partition set that has been created but does not yet have a default.

                                      Partitioned table "partman_test.time_taptest_table"
 Column |           Type           | Collation | Nullable |    Default    | Storage  | Compr
[...]

Highlights from PostgreSQL Conference 2024 in Seattle
Posted by cary huang in Highgo Software on 2024-12-05 at 22:36

Introduction

PostgreSQL Conference was held on November 6th and 7th, 2024 as part of PASS Data Community Summit in Seattle. Bringing together database enthusiasts, developers, and industry professionals from around the world, the event offered an exceptional platform to delve into all things database.

Although nearly a month has passed since I had the opportunity to attend, the information shared during those two days have left a lasting impression. This article serves as a summary of the most talks I participated in. Overall it’s been great!

Understanding Database Statistics – Anita Singh, Raj Jayakrishnan

  • There is an entry in pg_stats per user table created
  • statistics are used for planner to estimate and optimize queries
  • metrics like fraction of nulls, average size, number of distinct values, most common values per column, most common update frequencies per column, correlations…etc are recorded
  • we can use ANALYZE to see these metrics being used in query planning
  • We can alter a particular column’s statistic values using ALTER TABLE SET STATISTICS
  • Extended statistics can be used to correct and optimize planner’s query decisions
  • examples are provided by the speakers to analyze common postgresql cases:

PostgreSQL for Oracle DBA – Thuymy Tran

  • introduced major differences between oracle and postgresql
  • PostgreSQL uses MVCC, has data bloat issue if no vacuum in time, depends on transaction ID to determine visibility, has transaction wraparound issues,
  • Oracle has separate storage for UNDO, physically determine data visibility rather than logical, no special case to clean up table or index bloat.
  • concept of tablespace is different
    • oracle is a logical meaning to categorize tables
    • postgresql means another physical directory for storage
  • postgresql has more index types to select from
  • postgresql has no support for SYNONYM but oracle does
  • postgresql suppor
[...]

Is pg_dump a backup tool?
Posted by Gülçin Yıldırım Jelínek in Xata on 2024-12-05 at 00:00
I was invited to the Postgres.fm podcast together with Robert Haas, hosted by Michael Christofides and Nikolay Samokhvalov, to discuss the technical aspects of the question if pg_dump is a backup tool, and in what way it is or is not.

Counter Analytics in PostgreSQL: Beyond Simple Data Denormalization
Posted by Jônatas Paganini in Timescale on 2024-12-04 at 21:42

If you've been working with PostgreSQL, you've probably seen memes advocating for denormalized counters instead of counting related records on demand. The debate usually looks like this:

-- The "don't do this" approach: counting related records on demand
SELECT COUNT(*) FROM post_likes WHERE post_id = $1;
-- The "do this instead" approach: maintaining a denormalized counter
SELECT likes_count FROM posts WHERE post_id = $1;

Let's break down these approaches. In the first approach, we calculate the like count by scanning the post_likes table each time we need the number. In the second approach, we maintain a pre-calculated counter in the posts table which we update whenever someone likes or unlikes a post.

The denormalized counter approach is often recommended for OLTP (online transaction processing) workloads because it trades write overhead for read performance. Instead of executing a potentially expensive COUNT query that needs to scan the entire post_likes table, we can quickly fetch a pre-calculated number. 

This is particularly valuable in social media applications, where like counts are frequently displayed but rarely updated—you're showing like counts on posts much more frequently than users are actually liking posts.

However, when we enter the world of time-series data and high-frequency updates, this conventional wisdom needs a second look. Let me share an example that made me reconsider this approach while working with a PostgreSQL database optimized for time series via the TimescaleDB extension.

Source

While this advice might make sense for traditional OLTP workloads, when working with time-series data in TimescaleDB, we need to take a different approach to data modeling.

Counter Analytics vs. Data Denormalization and Its Limitations

Let's start with a common scenario: tracking post likes in a social media application. The traditional data denormalization approach m

[...]

Unique Constraint Violations During Inserts Cause Bloat in PostgreSQL
Posted by Josef Machytka in credativ on 2024-12-04 at 11:55

The issue of table and index bloat due to failed inserts on unique constraints is well known and has been discussed in various articles across the internet. However, these discussions sometimes lack a clear, practical example with measurements to illustrate the impact. And despite the familiarity of this issue, we still frequently see this design pattern—or rather, anti-pattern—in real-world applications. Developers often rely on unique constraints to prevent duplicate values from being inserted into tables. While this approach is straightforward, versatile, and generally considered effective, in PostgreSQL, inserts that fail due to unique constraint violations unfortunately always lead to table and index bloat. And on high-traffic systems, this unnecessary bloat can significantly increase disk I/O and the frequency of autovacuum runs. In this article, we aim to highlight this problem once again and provide a straightforward example with measurements to illustrate it. We suggest simple improvement that can help mitigate this issue and reduce autovacuum workload and disk I/O.

Two Approaches to Duplicate Prevention

In PostgreSQL, there are two main ways to prevent duplicate values using unique constraints:

1. Standard Insert Command (INSERT INTO table)

The usual INSERT INTO table command attempts to insert data directly into the table. If the insert would result in a duplicate value, it fails with a “duplicate key value violates unique constraint” error. Since the command does not specify any duplicate checks, PostgreSQL internally immediately inserts the new row and only then begins updating indexes. When it encounters a unique index violation, it triggers the error and deletes the newly added row. The order of index updates is determined by their relation IDs, so the extent of index bloat depends on the order in which indexes were created. With repeated “unique constraint violation” errors, both the table and some indexes accumulate deleted records leading to bloat, and the resulting write oper

[...]

3 Essential PostgreSQL Priorities for 2025
Posted by Umair Shahid in Stormatics on 2024-12-04 at 10:05

As IT budgets tighten and workloads increase, 2025 is the year to focus on maximizing PostgreSQL efficiency, security, and reliability. Whether you are running fully-managed or self-managed PostgreSQL databases, these three priorities

- Reducing cloud costs
- Increasing data security, and
- Enhancing availability

will be key to staying competitive.

Here is a deep dive into each priority and actionable steps to make them a reality.

1. Reduce Cloud Costs Without Compromising Performance

Cloud costs can escalate quickly when PostgreSQL instances are not optimized for the workload. Here is how to implement cost-saving measures with technical precision:

Instance Sizing and Scaling

Analyze Workload Patterns: Use tools like pg_stat_activity and pg_stat_user_tables to identify peak usage and idle times. Leverage this data to choose the right instance type and size.

Autoscaling with Load Balancers: Deploy PostgreSQL in a cloud environment using managed services that support autoscaling or set up custom scaling policies.

Storage and Index Optimization

Partitioning: Use table partitioning to manage large datasets efficiently and reduce query processing times. For instance, partition large logs by time, and ensure that queries use partition pruning.

Index Tuning: Remove redundant indexes using pg_stat_user_indexes and optimize index types (e.g., switching from B-Tree to GiST or GIN indexes for specific queries). This reduces storage requirements and speeds up query performance.

Query Optimization

EXPLAIN and ANALYZE: Run slow queries through EXPLAIN to pinpoint inefficiencies. Common culprits include sequential scans on large tables and ineffcient join strategies with large datasets.

Caching Frequently Accessed Data: Use tools like pgpool-II to enable query result caching and connection pooling, minimizing redundant query execution.

These optimizations not only reduce costs but also improve overall database responsiveness.

The post 3 Essential PostgreSQL

[...]

Performance archaeology: OLAP
Posted by Tomas Vondra on 2024-12-03 at 12:00

A couple days ago I wrote about performance improvements on OLTP workloads since Postgres 8.0, released 20 years ago. And I promised to share a similar analysis about analytical workloads in a follow-up post. So here we go ;-) Let me show you some numbers from a TPC-H benchmark, with some basic commentary and thoughts about the future.

compiled win x64 dll of orafce-4.14.0 and plpgsql_check-2.7.12 for PostgreSQL 16 and 17
Posted by Pavel Stehule on 2024-12-02 at 21:02

I compiled and uploaded zip files with latest orafce and plpgsql_check for PostgreSQL 16 and PostgreSQL 17 - I used Microsoft Visual C 2022.

Setup:

  1. download orafce-4.14.0-x64.zip or plpgsql_check-2.7.12-x64.zip and extract files
  2. copy related dll file to PostgreSQL lib directory (NN is number of pg release)
    orafce-NN.dll -> "c:\Program Files\PostgreSQL\NN\lib"
  3. remove suffix "x64-16" or "x64-17" from dll file
    orafce-NN.dll -> orafce.dll
  4. copy *.sql and *.control files to extension directory
    *.sql, *.control -> "c:\Program Files\PostgreSQL\NN\share\extension"
  5. execute with super user rights SQL command CREATE EXTENSION
    CREATE EXTENSION orafce;

This was first time when I used meson build system, and I was able to run regress tests. Fortunately the meson reduces lot of monkey work, unfortunately not all - I didn't find a way how to use ninja install with UAC.

Strange data type transformations
Posted by Frederic Delacourt in Data Bene on 2024-12-02 at 00:00

When your function argument types are loosely changed

This article results from a code review I did for a customer.

Our customer created a pg_dump --schema-only of the target database to provide
me with the plpgsql code and database object structures to review. So far
so good.

I started to read the code and then became puzzled. The code looks like this:

CREATE FUNCTION xxx( p_id character, p_info character varying )
RETURNS integer
LANGUAGE plpgsql
AS $$
DECLARE
BEGIN
   ...
   INSERT INTO t1
   SELECT * FROM t2 WHERE t2.id = p_id;
   ...
END;
$$
;

Maybe you saw nothing wrong with the function. Perhaps knowing the table
definition will help:

CREATE TABLE t2 (
   id VARCHAR(130) NOT NULL
   ...
   PRIMARY KEY (id)
);

t2.id is always 130 characters long (in practice) and there are 400 million tuples.
So as you may have guessed, it seems odd to have the p_id CHARACTER matching id VARCHAR(130).
Moreover CHARACTER is the same as CHAR(1).

Our customer had not seen any issues with the code for years. Nevertheless, our customer told me that the function definition he wrote was not like that: it was meant to be p_id CHARACTER(130) - not CHARACTER.

So what went wrong? Let’s test around because it’s fun.

CREATE FUNCTION test( c character, d character varying )
RETURNS void
LANGUAGE plpgsql
AS $$
BEGIN
  RAISE NOTICE 'c=%, d=%', c,d;
END;
$$;

SELECT test( '123465789', '987654321' );
NOTICE:  c=123465789, d=987654321
 test 
------
 
(1 row)

We have an interesting result here: no casting to CHAR(1) has been done.
Let’s see more details:

EXPLAIN (COSTS OFF,ANALYZE,VERBOSE)
        SELECT test( '123465789', '987654321' );
NOTICE:  c=123465789, d=987654321
                             QUERY PLAN                              
---------------------------------------------------------------------
 Result (actual time=0.040..0.041 rows=1 loops=1)
   Output: test('123465789'::bpchar, '987654321'::character varying)
 Planning Time: 0.023 ms
 Execution Time: 0.053 ms
(4 rows)

We can se

[...]

PostGIS Day 2024 Summary
Posted by REGINA OBE in PostGIS on 2024-12-01 at 22:59

PostGIS Day yearly conference sponsored by Crunchy Data is my favorite conference of the year because it's the only conference I get to pig out on PostGIS content and meet fellow passionate PostGIS users pushing the envelop of what is possible with PostGIS and by extension PostgreSQL. Sure FOSS4G conferences do have a lot of PostGIS content, but that content is never quite so front and center as it is on PostGIS day conferences. The fact it's virtual means I can attend in pajamas and robe and that the videos come out fairly quickly and is always recorded. In fact the PostGIS Day 2024 videos are available now in case you wanted to see what all the fuss is about.

Continue reading "PostGIS Day 2024 Summary"

Substring function Regex style
Posted by Regina Obe in PostGIS on 2024-12-01 at 21:37

I was reviewing some old code when I stumbled across something I must have completely forgotten or someone else some time ago knew. That is that it is possible to use the function substring for regular expression work.

Continue reading "Substring function Regex style"

New PGXN Mastodon Account
Posted by David E. Wheeler on 2024-12-01 at 16:02

Sadly, the home of the PGXN Mastodon bot for the last two years, botsin.space is shutting down. I’ve created a new account, @pgxn@mastodon.social and moved all the followers. Please give it a follow if you didn’t follow the old account, and stay up-to-date on the latest PGXN releases!

SQL best practices – don’t compare count(*) with 0
Posted by Hubert 'depesz' Lubaczewski on 2024-12-01 at 09:21
Every now and then I see something like this: SELECT u.* FROM users u WHERE 0 = (SELECT COUNT(*) FROM addresses a WHERE a.user_id = u.id); and it kinda pains me. So figured, I'll write about what is the problem with it, and how to avoid such constructs. Lets consider what went through someones mind … Continue reading "SQL best practices – don’t compare count(*) with 0"

The PostgreSQL operator labyrinth
Posted by Ashutosh Bapat on 2024-11-29 at 13:29

While working on SQL/PGQ patch I wanted to find an equality operator for given left and right argument types to construct a condition to match an edge with its adjacent vertexes. It would look as simple as calling C function oper() with operator as "=" and required left and right data types. But soon it turned out to be a walk in PostgreSQL's operator labyrinth, which held my equality operator at the center instead of Minotaur.

First and foremost in PostgreSQL '=' does not necessarily mean and equality operator. It's simply a name of an operator used for comparing operands for equality. One could get swallowed by Sphinx for that. So oper() is useless. Equality operators are instead identified by strategies EqualStrategyNumbers like HTEqualStrategyNumber, BTEqualStrategyNumber, RTEqualStrategyNumber and so on. But there's no C function which would provide you an equality operator given the strategy number and data types of left and right operands. Suddenly I found myself trapped in the index labyrinth since BT, HT and RT are related to hash, b-tree and R-tree indexes. Now, all I was doing was begging to get out of the labyrinth rather than finding answer to my seemingly simple question. But this Wit-Sharpening Potion helped me to find my path out of the labyrinth and also answered my question.

The path is surprising simple Index -> Operator Class -> Operator Family -> Operator. Like Daedalus's labyrinth, it's unicursal but has a four course design instead of seven course. Like the An index needs operators to compare values of a column or an indexed expression. All values being indexes are of the same datatype. An operator class holds all the required comparison operators for that datatype. However, a a value being searched or compared to in that index may not necessarily have the same datatype. For example an index may be on an column of type int4 but it could still be used to search a value of type int2. PostgreSQL requires different operators for different pairs of operand data types as the semantics

[...]

vip-manager v2.8 meets Patroni REST API
Posted by Pavlo Golub in Cybertec on 2024-11-28 at 09:01

vip-manager has long been a crucial tool for managing virtual IPs in PostgreSQL HA setups, providing reliable leader routing. With the release of version 2.8.0, we're taking things to the next level by introducing native integration with the Patroni REST API. This new feature adds flexibility and simplifies configuration, making it easier than ever to manage your PostgreSQL HA clusters.

Main Feature: Patroni REST API Integration

In this release, vip-manager can now directly interact with Patroni via its REST API. This integration removes the need for complex custom setups, enabling more dynamic and flexible management of virtual IPs.

What does this mean for you?

  • Native Integration: vip-manager now works seamlessly with Patroni clusters without additional configurations for etcd or consul.
  • Huge Flexibility: The tool isn't limited to managing virtual IPs for primary nodes anymore; it can also handle replicas.
  • Simplified Configuration: With default values in place (e.g., localhost for DCS endpoints), setting up vip-manager is quicker and easier than ever.

Example Usage of Patroni REST API

Here are a couple of examples to showcase the new feature:

Managing Virtual IP for the Primary Node

This command configures vip-manager to set up a virtual IP for the primary node:

vip-manager --ip=10.0.0.42 --interface=eth0 \ 
--trigger-key="/leader" --trigger-value=200 --dcs-type=patroni

What does it do?

  • --ip=10.0.0.42: The virtual IP to be managed.
  • --interface=eth0: The network interface where vip-manager will set up the virtual IP.
  • --trigger-key="/leader": The endpoint in Patroni's REST API to check if the node is the leader.
  • --trigger-value=200: The HTTP status code indicating success.
  • --dcs-type=patroni: Specifies Patroni as the source of truth (the prefix dcs is still here due to historical reasons).

If the current host is the leader, according to Patroni, the virtual IP will be assigned.

Managing Vi

[...]

Combining cloud storage and dedicated backup host with pgBackRest
Posted by Stefan Fercot in Data Egret on 2024-11-28 at 07:30

pgBackRest is a popular backup and restore tool for PostgreSQL, known for easily handling even the largest databases and workloads. It’s packed with powerful features, but all that flexibility can sometimes feel a bit overwhelming.

In my earlier posts, I showed how to take backups from a standby server and how to set up a dedicated backup host. But there’s another great feature we haven’t explored yet: pgBackRest can store backups in cloud storage like S3, Azure, or Google Cloud.

Most people choose one option: either a backup host or cloud storage. But did you know you can use both at the same time? This gives you even more flexibility in your backup strategy.

Let’s pick up where we left off. We’ve got a PostgreSQL cluster with a primary server (pg1), a standby server (pg2), and a backup server (repo1) storing backups and WAL archives on an NFS mount. Today, we’ll take it a step further by adding an S3 bucket to the setup :-)


Example setup: S3 bucket with a repository host

Initial situation

As a reminder, here’s the initial situation and configuration.

pg1 is our primary:

[postgres@pg1 ~]$ ps -o pid,cmd fx
    PID CMD
    764 /usr/pgsql-16/bin/postgres -D /var/lib/pgsql/16/data/
    814  \_ postgres: logger 
    824  \_ postgres: checkpointer 
    825  \_ postgres: background writer 
    834  \_ postgres: walwriter 
    835  \_ postgres: autovacuum launcher 
    836  \_ postgres: archiver last was 000000010000000000000060
    837  \_ postgres: logical replication launcher 
    841  \_ postgres: walsender replicator 192.168.121.66(38334) streaming 0/610001C0

pg2 is our standby server:

[postgres@pg2 ~]$ ps -o pid,cmd fx
    PID CMD
    744 /usr/pgsql-16/bin/postgres -D /var/lib/pgsql/16/data/
    821  \_ postgres: logger 
    831  \_ postgres: checkpointer 
    832  \_ postgres: background writer 
    833  \_ postgres: startup recovering 000000010000000000000061
    870  \_ postgres: walreceiver streaming 0/610001C0

The /etc/pgbackrest.conf configuration is exactly

[...]

PostGIS Day 2024 Summary
Posted by Paul Ramsey in Crunchy Data on 2024-11-27 at 16:30

In late November, on the day after GIS Day, we hosted the annual PostGIS day online event. 22 speakers from around the world, in an agenda that ran from mid-afternoon in Europe to mid-afternoon on the Pacific coast.

We had an amazing collection of speakers, exploring all aspects of PostGIS, from highly technical specifics, to big picture culture and history. A full playlist of PostGIS Day 2024 is available on the Crunchy Data YouTube channel. Here’s a highlight reel of the talks and themes throughout the day.

The Old and the New

My contribution to the day is a historical look back at the history of databases and spatial databases. The roots of PostGIS are the roots of PostgreSQL, and the roots of PostgreSQL in turn go back to the dawn of databases. The history of software involves a lot of coincidences, and turns on particular characters sometimes, but it’s never (too) dull!

Joshua Carlson delivered one of the stand-out talks of the day, exploring how he built a very old-style cartographic product–a street with a grid-based index to find street names–using a very new-style approach–spatial SQL to generate the grid and find the grid numbers for each street to fill in the index. Put Making a Dynamic Street Map Index with ST_SquareGrid at the top of your video play list.

alt

For the past ten years, Brian Timoney has been warning geospatial practitioners about the complexity of the systems they are delivering to end users. In Simplify, simplify, simplify, Timoney both walks the walk and talks the talk, delivering denunciations of GIS dashboard mania, while building out a minimalist mapping solution using just PostGIS, SVG and (yes!) Excel. It turns out that SVG is an excellent medium for delivering cartographic products, and you can generate them entirely in PostgreSQL/PostGIS.

And then, for example, work with them directly in MS Word! (This is, as Brian says, what customers are looking for, not a dashboard.)

alt

Steve Pousty brought the mandatory AI-centric talk, but avoided the hype

[...]

Building a Better Ruby ORM for Time Series and Analytics
Posted by Jônatas Paganini in Timescale on 2024-11-27 at 13:30

Rails developers know the joy of working with ActiveRecord. DHH didn’t just give us a framework; he gave us a philosophy, an intuitive way to manage data that feels delightful. But when it comes to time-series data, think metrics, logs, or events, ActiveRecord can start to feel a little stretched. Handling huge volumes of time-stamped data efficiently for analytics? That’s a challenge it wasn’t designed to solve (and neither was PostgreSQL).

This is where TimescaleDB comes in. Built on PostgreSQL (it’s an extension), TimescaleDB is purpose-built for time series and other demanding workloads, and thanks to the timescaledb gem, it integrates seamlessly into Rails. You don’t have to leave behind the conventions or patterns you love, it just works alongside them.

One of TimescaleDB’s standout features is continuous aggregates. Think of them as an upgrade to materialized views, automatically refreshing in the background so your data is always up-to-date and fast to query. With the new timescaledb gem continuous aggregates macro, you can define hierarchical time-based summaries in a single line of Ruby. It even reuses your existing ActiveRecord scopes, so you’re not duplicating logic you’ve already written.

Now, your Rails app can effortlessly handle real-time analytics dashboards or historical reports, scaling your time-series workloads while staying true to the Rails philosophy.

Better Time-Series Data Aggregations Using Ruby: The Inspiration

The following code snippet highlights the real-life use case that inspired me to build a continuous aggregates macro for better time-series data aggregations. It’s part of a RubyGems contribution I made, and it’s still a work in progress. However, it’s worth validating how this idea can reduce the Ruby code you’ll have to maintain.

Example model

class Download < ActiveRecord::Base
  extend Timescaledb::ActsAsHypertable
  include Timescaledb::ContinuousAggregatesHelper

  acts_as_hypertable time_column: 'ts'

  scope :total_downloads, -> { select("coun
[...]

Maintaining Postgres for Modern Workloads
Posted by Gülçin Yıldırım Jelínek in Xata on 2024-11-27 at 00:00
I was invited to the Maintainable Podcast hosted by Robby Russell. We talked about what makes software maintainable and naturally we also talked about Postgres.

Operator Classes: Fine-Tuning Index Performance in PostgreSQL
Posted by Umair Shahid in Stormatics on 2024-11-26 at 20:34

Efficient data retrieval is crucial in any production environment, especially for databases handling heavy traffic and large datasets. PostgreSQL’s operator classes are a powerful but often overlooked tool for fine-tuning index performance. They allow you to control how PostgreSQL compares data within an index, helping to streamline searches and improve query efficiency in ways that default settings simply can’t match.

What Are Operator Classes in PostgreSQL?
An operator class in PostgreSQL is essentially a set of rules that defines how data in an index should be compared and sorted. When you create an index, PostgreSQL assigns a default operator class based on the data type, but different types (like text or geometric data) often have multiple classes to choose from. Selecting the right operator class allows PostgreSQL to work with your data in a way that better matches your search, sort, and retrieval needs.

For example:

Text: Operator classes can control whether a search is case-sensitive or case-insensitive.
Geometric Data: For location-based data, operator classes can compare things like distance or spatial relationships.

Choosing the right operator class can make a measurable difference in how quickly and efficiently your queries run, particularly when dealing with large datasets or complex data types.

Why Operator Classes Matter in Production Databases
In a production setting, performance optimization is critical, not merely a nice to have. While default operator classes work fine for general use, choosing specific classes can bring serious speed and efficiency gains for certain use cases. Here’s where they add the most value:

Faster Text Searches: Tailor searches to be case-sensitive or case-insensitive based on what makes sense for your data.
Geometric Data Efficiency: Use spatially-optimized comparisons for location-based data, like finding points within a certain radius.
Custom Data Types: For specialized data types, custom operator classes ensure that comparisons a

[...]

Contributions for the week of 2024-11-18 (Week 47 overview)
Posted by Floor Drees in postgres-contrib.org on 2024-11-26 at 14:16

Benchmarking PostgreSQL Batch Ingest
Posted by James Sewell in Timescale on 2024-11-26 at 14:00

In a previous article in this series, I explored the magic of INSERT...UNNEST for improving PostgreSQL batch INSERT performance. While it’s a fantastic technique, I know it’s not the fastest option available (although it is very flexible). Originally, I hadn't intended to loop back and benchmark all the batch ingest methods, but I saw a lot of confusion out there, so I'm back, and this time I'm looking at COPY too. As usual for this series, it’s not going to be a long post, but it is going to be an informative one. 

I flipped my approach for this post, comparing not just the PostgreSQL database performance in isolation but the practical performance from an application. To do this, I built a custom benchmarking tool in Rust to measure the end-to-end performance of each method. In this article, I’ll walk you through the batch ingest options you’ve got, and how they stack up (spoiler alert, the spread is over 19x!).

The Introduction: Batch Ingest in PostgreSQL

I’m defining batch ingest as writing a dataset to PostgreSQL in batches or chunks. You’d usually do this because the data is being collected in (near) real time (think a flow of IoT data from sensors) before being persisted into PostgreSQL (hopefully with TimescaleDB, although that's out of scope for this post).

Writing a single record at a time is incredibly inefficient, so writing batches makes sense (the size probably depends on how long you can delay writing). Just to be clear this isn't about loading a very large dataset in one go, I’d call that bulk ingest not batch ingest (and you'd usually do that from a file).

Broadly speaking, there are two methods for ingesting multiple values at once in PostgreSQL: INSERT and COPY. Each of these methods has a few variants, so let's look at the differences.

INSERT: VALUES and UNNEST

The most common method to ingest data in PostgreSQL is the standard INSERT statement using the VALUES clause. Everyone recognizes it, and every language and ORM (object-relational mapper) can make use of it

[...]

Performance archaeology: OLTP
Posted by Tomas Vondra on 2024-11-26 at 14:00

The Postgres open source project is nearly 30 years old, I personally started using it about 20 years ago. And I’ve been contributing code for at least 10 years. But even with all that experience I find it really difficult to make judgments about how the performance changed over the years. Did it improve? And by how much? I decided to do some benchmarks to answer this question.

Why PostgreSQL major version upgrades are hard
Posted by Peter Eisentraut in EDB on 2024-11-26 at 05:00

Upgrades between PostgreSQL major versions are famously annoying. You can’t just install the server binaries and restart, because the format of the data directory is incompatible.

Why is that? Why can’t we just keep the data format compatible?

Perhaps surprisingly, the data format is actually mostly compatible, but not completely. There are just a few things missing that are very hard to solve.

Let’s look at the possible upgrading procedures:

  1. pg_upgrade
  2. dump and restore using pg_dumpall
  3. logical replication to a new instance

2 and 3 are essentially the same idea: Build a new instance and transport the data between the instances in a higher-level, compatible format.

pg_upgrade is more interesting. What pg_upgrade does is:

  1. Take a dump (using pg_dump) of the schema of the old database, and restore that to the new database. (Actually separately for each database in the instance, and in combination with pg_dumpall, but that’s not important here.)

  2. Copy the data files directly from the old instance to the new instance.

So the data files, which contain the table and index data, are actually compatible between major versions. They have been for a long time, ever since pg_upgrade was introduced.

How that is managed depends on what is in those files. For example, btree has a straightforward versioning mechanism:

#define BTREE_VERSION   4       /* current version number */
#define BTREE_MIN_VERSION   2   /* minimum supported version */

heap is more complicated, it just has to maintain compatibility with whatever old versions might have stored. But in any case, this works.

What’s not compatible is the way the schema (the data definitions, the metadata for tables etc.) is stored. This is what pg_upgrade has to deal with.

So then why is the schema storage incompatible between major versions?

The schema data of a PostgreSQL database is stored in so-called system catalogs, which are themselves just tables. For bootstr

[...]

Some of my favorite PostgreSQLisms
Posted by Regina Obe in PostGIS on 2024-11-26 at 04:01

When I work with other relational databases I am reminded about how I can't use my favorite PostgreSQL hacks in them. I call these hacks PostgreSQLisms. A PostgreSQLism is a pattern of SQL unique to PostgreSQL or descendants of PostgreSQL. In this post I'm going to cover some of my favorite ones.

Continue reading "Some of my favorite PostgreSQLisms"

Posted by Hubert 'depesz' Lubaczewski on 2024-11-25 at 22:45
New Blog Post Title: How can I send mail or HTTP request from database? This question happens every now and then in one of PostgreSQL support places. Whenever it happens, I just suggest to not try, as interacting with outside world from database can be problematic, and instead use LISTEN/NOTIFY. But it occurred to me, … Continue reading ""

Could GROUP-BY clause reordering improve performance?
Posted by Andrei Lepikhov in Postgres Professional on 2024-11-25 at 22:00

PostgreSQL users often employ analytical queries that sort and group data by different rules. Optimising these operators can significantly reduce the time and cost of query execution. In this post, I will discuss one such optimisation: choosing the order of columns in the GROUP BY expression.

Postgres can already reshuffle the list of grouped expressions according to the ORDER BY condition to eliminate additional sorting and save computing resources. We went further and implemented an additional strategy of group-by-clause list permutation in a series of patches (the first attempt and the second one) for discussion with the Postgres community, expecting it to be included in the next version of PostgreSQL core. You can also try it in action in the commercial Postgres Pro Enterprise fork.

A short introduction to the issue

To group table data by one or more columns, DBMSes usually use hashing methods (HashAgg) or preliminary sorting of rows (tuples) with subsequent traversal of the sorted set (SortAgg). When sorting incoming tuples by multiple columns, Postgres must call the comparison operator not just once but for each pair of values. For example, to compare a table row ('UserX1', 'Saturday', $100) with a row ('UserX1', 'Monday', $10) and determine the relative order of these rows, we must first compare the first two values ​​and, if they match, move on to the next pair. If the second pair of values ​​(in our example, 'Saturday' and 'Monday') differs, then there is no point in calling the comparison operator for the third element.

This is the principle on which the proposed SortAgg operator optimisation mechanism is based. If, when comparing rows, we compare column values ​​with fewer duplicates first (for example, first compare UserID numbers and then days of the week), then we will have to call the comparison operator much less often.

Time for a demo case

How much minimising the number of comparisons may speed up a Sort operation? Let's look at the examples. In the first example, we so

[...]

RFC: Extension Packaging &amp; Lookup
Posted by David Wheeler in Tembo on 2024-11-25 at 19:14

Several weeks ago, I started a pgsql-hackers thread proposing a new extension file organization and a search path GUC for finding extensions. The discussion of Christoph Berg’s extension_destdir patch inspired this proposal. These threads cover quite a lot of territory, so I want to pull together a more unified, public proposal.

Here goes.

Challenges

A number of challenges face extension users, thanks to extension file organization in the Postgres core. The common thread among them is the need to add extensions without changing the contents of the Postgres installation itself.

Packager Testing

On Debian systems, the user account that creates extension packages lacks permission to add files to Postgres install. But testing extensions requires installing the extension where Postgres can find it. Moreover, extensions ideally build against a clean Postgres install; adding an extension in order to run make installcheck would pollute it.

Christoph’s patch solves these problems by adding a second lookup path for extensions and dynamic modules, so that Postgres can load them directly from the package build directory.

Alas, the patch isn’t ideal, because it simply specifies a prefix and appends the full pg_config directory paths to it. For example, if --sharedir outputs /opt/share and extension_destdir GUC is set to /tmp/build/myext, the patch will search in /tmp/build/myext/opt/share. This approach works for the packaging use case, which explicitly uses full paths with a prefix, but would be weird for other use cases.

Peter Eisentraut proposed an alternate patch with a new GUC, extension_control_path, that provides a more typical search path pattern to find extension control files, but doesn’t account for shared modules that ship with an extension, requiring that they still live in the dynamic_library_path. Installing into custom directories requires the undocumented datadir and p

[...]

Substituting a variable in a SQL script
Posted by Florent Jardin in Dalibo on 2024-11-25 at 08:00

In a world where we constantly seek to automate repetitive tasks, it is common to write down a query in a script, make it more convenient, and eventually integrate the whole thing into a project’s codebase. Tools like SQL*Plus and psql can be powerful allies in this game, as relevant as Bash or Python interpreters.

In several projects I have been involved in, I have come across a large number of those kinds of scripts. Some of them have the particularity of offering input parameters, processed by SQL*Plus with the very comfortable mechanism named variable substitution. In this article, I share some tips to convert them to an equivalent syntax that PostgreSQL’s psql tool can parse and manage.

Top posters

Number of posts in the past two months

Top teams

Number of posts in the past two months

Feeds

Planet

  • Policy for being listed on Planet PostgreSQL.
  • Add your blog to Planet PostgreSQL.
  • List of all subscribed blogs.
  • Manage your registration.

Contact

Get in touch with the Planet PostgreSQL administrators at planet at postgresql.org.