Latest Blog Posts

PostgreSQL 18: part 1 or CommitFest 2024-07
Posted by Pavel Luzanov in Postgres Professional on 2024-11-20 at 00:00

This article is the first in the series about the upcoming PostgreSQL 18 release. Let us take a look at the features introduced in the July CommitFest.

  • Planner: Hash Right Semi Join support
  • Planner: materializing an internal row set for parallel nested loop join
  • Planner support functions for generate_series
  • EXPLAIN (analyze): statistics for Parallel Bitmap Heap Scan node workers
  • Functions min and max for composite types
  • Parameter names for regexp* functions
  • Debug mode in pgbench
  • pg_get_backend_memory_contexts: column path instead of parent, new column type
  • Function pg_get_acl
  • pg_upgrade: pg_dump optimization
  • Predefined role pg_signal_autovacuum_worker

...

Loading the World! OpenStreetMap Import In Under 4 Hours
Posted by Greg Smith in Crunchy Data on 2024-11-19 at 14:30

The OpenStreetMap (OSM) database builds almost 750GB of location data from a single file download. OSM notoriously takes a full day to run. A fresh open street map load involves both a massive write process and large index builds. It is a great performance stress-test bulk load for any Postgres system. I use it to stress the latest PostgreSQL versions and state-of-the-art hardware. The stress test validates new tuning tricks and identifies performance regressions.

Two years ago, I presented (video / slides) at PostGIS Day on challenges of this workload. In honor of this week’s PostGIS Day 2024, I’ve run the same benchmark on Postgres 17 and the very latest hardware. The findings:

  • PostgreSQL keeps getting better! Core improvements sped up index building in particular.
  • The osm2pgsql loader got better too! New takes on indexing speed things up.
  • Hardware keeps getting better! It has been two years since my last report and the state-of-the-art has advanced.

Tune Your Instrument

First, we are using bare metal hardware—a server with 128GB RAM—so so let’s tune Postgres for loading and to match that server:

max_wal_size = 256GB
shared_buffers = 48GB
effective_cache_size = 64GB
maintenance_work_mem = 20GB
work_mem = 1GB

Second, let’s prioritize bulk load. The following settings do not make sense for a live system under read/write load, but they will improve performance for this bulk load scenario:

checkpoint_timeout = 60min
synchronous_commit = off
# if you don't have replication:
wal_level = minimal
max_wal_senders = 0
# if you believe my testing these make things
# faster too
fsync = off
autovacuum = off
full_page_writes = off

It’s also possible to tweak the background writer for the particular case of massive data ingestion, but for bulk loads without concurrency it doesn’t make a large difference.

How PostgreSQL has Improved

In 2022, testing that year's new AMD AM5 hardware loaded the data in just under 8 hours with Postgres 14. Today the amount of data in the

[...]

What Happens Behind the Scenes When You Modify a Row in PostgreSQL?
Posted by semab tariq in Stormatics on 2024-11-19 at 10:13

Data is often called the new gold, and databases are where we store and manage this precious resource as it constantly changes and grows. At first glance, updating data might seem like a simple task—just modify a row. 

But behind the scenes, it’s more complex to ensure that data remains consistent and accessible. In today’s blog, I’ll answer some frequently asked questions from our customers and dive into why PostgreSQL relies on a process called VACUUM to efficiently manage data updates.

Updating a row in PostgreSQL isn’t as straightforward as directly changing the existing data. Instead, PostgreSQL avoids in-place updates, meaning it doesn’t overwrite rows directly. 

But what does this actually mean? 

When an update occurs, PostgreSQL creates a new row, inserts the updated data there, and marks this new row as the latest version. The old row, meanwhile, is flagged as obsolete row.

A similar process applies to deletes, where rows are marked as outdated rather than removed immediately. 

This raises an interesting question on why did PostgreSQL choose this more complex approach for handling updates and deletes? 

The answer lies in its design philosophy, which is rooted in Multi-Version Concurrency Control (MVCC). MVCC ensures data consistency and allows for high concurrency.

What is Multi-Version Concurrency Control (MVCC)?

Multi-Version Concurrency Control (MVCC) allows PostgreSQL to manage multiple transactions at once, enabling consistent data views for each transaction without interference.

Imagine a library with a single book titled The Ultimate Guide to Databases. 

Without Multi-Version Concurrency Control (MVCC), if two people want to check out the book at the same time, one would have to wait for the other to finish reading it. This is similar to a traditional database where transactions can block each other, preventing simultaneous access.

With MVCC, the process works differently. When the first person checks out the book, the library creates a copy just

[...]

PGConf.EU 2024 Lightning Talks
Posted by Andreas Scherbaum on 2024-11-19 at 08:45
This year, the Lightning Talks at PGConf.EU were handled a bit different, compared to the previous years. If you attended a previous PGConf.EU, you remember a whiteboard which at some point appears near the registration desk and people rush to fill in 12 talks. First come, first selected. This year, Karen Jex suggested to have a box and everyone can drop in cards with a talk, and then 12 talks are randomly selected.

Postgres on Spot VMs - only for the crazy?
Posted by Kaarel Moppel on 2024-11-18 at 22:00
Postgres is already great, surely - even “too popular” one could complain with a twist…as this broadcast “was” actually supposed to be my Lightning Talk at last month’s pgConf.eu in Athens 🙂 But indeed, seems Postgres has in an awesome way gotten so big that it’s not like that anymore...

Contributions for the week of 2024-11-11 (Week 46 overview)
Posted by Jimmy Angelakos in postgres-contrib.org on 2024-11-18 at 17:16

Easy Totals and Subtotals in Postgres with Rollup and Cube
Posted by Elizabeth Garrett Christensen in Crunchy Data on 2024-11-18 at 14:30

Postgres is being used more and more for analytical workloads. There’s a few hidden gems I recently ran across that are really handy for doing SQL for data analysis, ROLLUP and CUBE. Rollup and cube don’t get a lot of attention, but follow along with me in this post to see how they can save you a few steps and enhance your date binning and summary reporting.

We also have a web based tutorial that covers Postgres Functions for Rolling Up Data by Date if you want to try it yourself with a sample data set.

Superpowered Group By

Before we dig into rollup and cube, let’s look at GROUP BY statements. Here’s an example query where I want to get totals for all my months of sales and categories.

Using to_char with date_trunc I can roll up things by month. With GROUP BY I can rollup data by category.

-- Get totals for each month and category
SELECT
    to_char(date_trunc('month', order_date), 'FMMonth YYYY') AS month,
    category,
    COUNT(*) AS total_orders,
    SUM(total_amount) AS total_amount
FROM
    orders
GROUP BY
    date_trunc('month', order_date), category
ORDER BY
    date_trunc('month', order_date), category;
     month      |  category   | total_orders | total_amount
----------------+-------------+--------------+--------------
 October 2021   | Books       |            3 |      2375.73
 October 2021   | Clothing    |           18 |     13770.09
 October 2021   | Computers   |           17 |     13005.87

If I wanted to get subtotals for years, I would have to pull this into Excel or write separate select statements and unions. You would have to add a lot more SQL in snippets and section like this:

-- Get total for each month
....
UNION ALL

SELECT
    to_char(date_trunc('month', order_date), 'FMMonth YYYY') AS month,
    NULL AS category,
    COUNT(*) AS total_orders,
    SUM(total_amount) AS total_amount
FROM
    orders
GROUP BY
    date_trunc('month', order_date)
ORDER BY
    date_trunc('month', order_date)

-- Get grand total across all months and categories
...
UNION ALL

SELECT
[...]

Ayse Bilge Ince
Posted by Andreas 'ads' Scherbaum on 2024-11-18 at 14:00
PostgreSQL Person of the Week Interview with Ayse Bilge Ince: Hello! I’m Bilge Ince. Originally from Turkey, I now call London home. I’m currently a Machine Learning Engineer at EDB, which is also how I began connecting with the PostgreSQL community.

dbicdump: using PostgreSQL schemas as package separator in produced Perl classes
Posted by Luca Ferrari on 2024-11-18 at 00:00

A way to instrument dbicdump to use PostgreSQL schemas as package separators.

dbicdump: using PostgreSQL schemas as package separator in produced Perl classes

Perl DBIx::Class is a great Object Relational Mapper (ORM), and I use it regularly with dbicdump, which is a tool to synchronize your existing database structure with the classes your program is going to use.

PostgreSQL being PostgreSQL, a great rock solid database we all love, allows us to organize tables into schemas, a flat namespace that is usually transparent to the user because the default schema, public, is always into the search_path for every user.

But how to take advantage of PostgreSQL schemas and DBIx::Class packages?

Well, it turned out that this is possible, with a little customization of the way you sycnhronize your own data structure.

Example Database

Assume we have an example database with a couple of tables, namely products and orders, each one replicated into two different schemas named respectively italy and japan. Note, this is probably not the better design for your database, but it does serve only as an example to get a quick and easy idea of how to achieve things.

The database results as follows:



dbic=> CREATE SCHEMA italy;
CREATE SCHEMA
dbic=> CREATE SCHEMA japan;
CREATE SCHEMA
                                           ^
dbic=> CREATE TABLE italy.product( pk serial,
            code text,
                description text,
                        primary key( pk ),
                        unique( code ) );
CREATE TABLE

dbic=> CREATE TABLE japan.product( pk serial,
          code text,
                  description text,
                  primary key( pk ), unique( code ) );
CREATE TABLE

dbic=> CREATE TABLE italy.orders( pk serial,
                                  product int not null,
                                  qty int default 0
                                  , primary key ( pk )
                                  , foreign key( product ) references italy.product( pk ) );
CR
[...]

Prague PostgreSQL Meetup on November 18th
Posted by Gülçin Yıldırım Jelínek in Xata on 2024-11-17 at 00:00
The Prague PostgreSQL Meetup will take place on November 18th!

The Doom That Came To PostgreSQL: When Collations Change
Posted by Christophe Pettus in pgExperts on 2024-11-15 at 19:31

In this installment of our series on locales and PostgreSQL, we’ll talk about what can happen with the library that provides locales changes out from under a PostgreSQL database. It’s not pretty.

First, more about collations

Back in an earlier installment, we described a collation as a function that takes two strings, and returns if the strings are greater than, equal to, or less than each other:

coll(a, b) -> {greater than, equal, less than}

This is a slight oversimplification. What the actual collation function does is take a string, and returns a sort key:

coll(a) -> Sa

The sort keys have the property that if the strings they are derived from have (within the rules of the collation) a particular relationship, the sort keys do as well. This is very valuable, because it allows doing these comparisons repeatedly without having to rerun the (potentially expensive) collation function.

In the case of C locale, the collation function is the identity function: you compare exactly the same string of bytes that is present in the input. In more sophisticated locales, the collation function “normalizes” the input string into a string that can be repeatedly compared by just comparing the bytes.

What is an index, anyway?

Everyone who works with databases has an intuitive idea of what an index is, but for our purposes, an index is a cache of a partiuclar sortation of the input column. You can think of an index defined as:

sql CREATE INDEX on t(a) WHERE b = 1;

as a cache of the query predicate:

sql SELECT ... WHERE b = 1 ORDER BY a;

This allows the database engine to quickly find rows that use that predicate, as well as (often) retrieve a set of rows in sorted order without having to resort them.

The database automatically updates the “cache” when a row changes in a way that would alter the index. This generally works very, very well.

But:

Cache invalidation is hard.

Suppose that the collation function changes, but no one tells the database

[...]

On CVEs, BoMs, and Releases
Posted by Douglas Hunley on 2024-11-15 at 17:10
As previously mentioned in this Crunchy Postgres via Automation series, we now use a Bill of Materials (BoM) ship tested-together versions of the components that comprise the CPA product.

A change to RelResultInfo - A Near Miss with Postgres 17.1
Posted by Craig Kerstiens in Crunchy Data on 2024-11-15 at 15:30

Since its inception Crunchy Data has released new builds and packages of Postgres on the day community packages are released. Yesterday's minor version release was the first time we made the decision to press pause on a release. Why did we not release it immediately? There appeared to be a very real risk of breaking existing installations. Let's back up and walk through a near miss of Postgres release day.

Yesterday when Postgres 17.1 was released there appeared to be breaking changes in the Application Build Interface (ABI). The ABI is the contract that exists between PostgreSQL and its extensions. Initial reports showed that a number of extensions could be affected, triggering warning sirens around the community. In other words, if you were to upgrade from 17.0 to 17.1 and use these extensions, you could be left with a non-functioning Postgres database. Further investigation showed that TimescaleDB and Apache AGE were the primarily affected extensions and if you are using them you should hold off at this time upgrading to the latest minor release or ensure to rebuild the extension against the latest PostgreSQL release in coordination with your upgrade.

The initial list of extensions for those curious:

Affected Unaffected
Apache AGE HypoPG
TimescaleDB pg-query
Citus
pglast
pglogical
pgpool2
ogr-fdw
pg-squeeze
mysql-fdw

First, a little bit on Postgres releases. Postgres releases major versions each year, and minor versions every three months roughly. The major versions are expected to be forward compatible, but do introduce bigger changes that result in cata

[...]

Grouping data into array of sums – fun with custom aggregates
Posted by Hubert 'depesz' Lubaczewski on 2024-11-15 at 14:55
Was asked recently about optimization of interesting case. There was table like: =$ CREATE TABLE input_data ( category_id INT8, object_id INT8, interaction_ts timestamptz, interaction_type TEXT, interaction_count INT4 ); And there was a code that was grouping it all by sum()ing interaction_count per category, object, interaction_type, and timestamp truncated to hour. Basically, storing somewhere result of: … Continue reading "Grouping data into array of sums – fun with custom aggregates"

Farm to TABLE: Local(e) Providers
Posted by Christophe Pettus in pgExperts on 2024-11-15 at 00:57

In our this installment about character encodings, locations, and locales in PostgreSQL, we’re talking about locale providers.

A reminder about locales

A “locale” is a bundled combination of data structures and code that provides support services for handling different localization services. For our purposes, the two most important things that a locale provides are:

  • A character encoding.
  • A collation.

And as a quick reminder, a character encoding is a mapping between a sequence of one or more bytes and a “glyph” (what we generally think of as a charcter, like A or * or Å or 🤷). A collation is a function that places strings of characters into a total ordering.

UTF-8

Everything here assumes you are using UTF-8 encoding in your database. If you are not, please review why you should.

Getting locale

A locale provider is a library (in the computing sense, a hunk of code and data) whose API provides a set of locales and lets you use the functions in them. There are three interesting locale providers in the PostgreSQL world:

  • libc — This is the implementation of the C standard library on every POSIX-compatible system, and many others. You’ll most often see glibc, which is the GNU project’s implemenatation of libc.
  • ICU — The Inernational Components for Unicode. This is a (very) extensive library for working with Unicode text. Among other features, it has a comprehensive locale system.
  • builtin — As of version 17, PostgreSQL has a builtin collation provider, based on locales built into PostgreSQL itself.

libc

When nearly everyone thinks of locales, they think of libc. Really, the mostly think of the ubiquitous en_US.UTF-8 locale. If you do not specify a locale when creating a new PostgreSQL instance using initdb, it uses the system locale, which very often en_US.UTF-8. (Or C locale, which is almost never the right choice.)

The structure of a glibc locale name is (using en_US.UTF-8 as an example):

  • en — The base language of the locale, usually the
[...]

Recap of PGConf.EU 2024 in Athens
Posted by Gülçin Yıldırım Jelínek in Xata on 2024-11-15 at 00:00
My notes from PGConf.EU 2024 in Athens, Greece. Talks, extension summit, diversity committee meeting, Xata Postgres dinner, and more!

Transitioning from Oracle to PostgreSQL: Concurrency Control
Posted by Umair Shahid in Stormatics on 2024-11-14 at 17:27

Transitioning from Oracle to PostgreSQL can be a transformative experience for database administrators because of the subtle differences between the two technologies. Understanding how the two handle concurrency differently is critical to managing highly concurrent workloads.

Concurrency control is essential for maintaining data consistency when multiple users access the database simultaneously. Oracle and PostgreSQL take different approaches to concurrency control: Oracle primarily relies on locking and consistent snapshots, while PostgreSQL utilizes a Multi-Version Concurrency Control (MVCC) system.

This article provides an in-depth look at concurrency control in PostgreSQL from an Oracle perspective.
Concurrency Control Basics in Oracle vs. PostgreSQL

Oracle's Concurrency Model
Oracle’s concurrency model is robust, based on a combination of locks, snapshots, and undo segments. When a transaction begins, Oracle isolates its changes by locking rows and using rollback segments to store previous versions of data. This approach maintains consistency but may impact concurrency, especially in high-transaction environments.

Oracle also uses a feature called redo and undo logging to handle multi-user transactions. Redo logs ensure that all committed changes are preserved even in case of a failure, while undo logs allow Oracle to provide a consistent view of data for queries that run alongside updates.

PostgreSQL’s MVCC Approach
PostgreSQL’s MVCC (Multi-Version Concurrency Control) provides an alternative by allowing multiple versions of rows to coexist. This means that when a transaction modifies a row, PostgreSQL creates a new version instead of overwriting the original. The previous version remains accessible to other transactions, allowing read and write operations to occur simultaneously with minimal locking.

In PostgreSQL, MVCC prevents locking conflicts that could slow down the system, providing consistent data snapshots without needing locks for every read. For Oracle DBAs, this

[...]

PostgreSQL Hacking Workshop - December 2024
Posted by Robert Haas in EDB on 2024-11-14 at 13:01

Next month, I'll be hosting a discussion of Melanie Plageman's talk, Intro to Postgres Planner, given at PGCon 2019. You can sign up using this form. To be clear, the talk is not an introduction to how the planner works from a user perspective, but rather how to hack on it and try to make it better and perhaps get your improvements committed to PostgreSQL. If you're interested, please join us. I anticipate that both Melanie and I will be present for the discussions.

Read more »

Accessing Large Language Models from PostgreSQL
Posted by Paul Ramsey in Crunchy Data on 2024-11-13 at 14:30

Large language models (LLM) provide some truly unique capacities that no other software does, but they are notoriously finicky to run, requiring large amounts of RAM and compute.

That means that mere mortals are reduced to two possible paths for experimenting with LLMs:

  • Use a cloud-hosted service like OpenAI. You get the latest models and best servers, at the price of a few micro-pennies per token.
  • Use a small locally hosted small model. You get the joy of using your own hardware, and only paying for the electricity.

Amazingly, you can do either approach, and use the same access API to hit the LLM services, because the OpenAI API has become a bit of an industry standard.

OpenAI access extension

Knowing this, it makes sense to build a basic OpenAI API access extension in PostgreSQL to make using the API quick and easy. The extension we built for this post has three functions:

  • openai.models() returns a list of models being served by the API
  • openai.prompt(context text, prompt text) returns the text answer to the prompt, evaluated using the context.
  • openai.vector(prompt text) returns the vector embedding of the prompt text.

The OpenAI API just accepts JSON queries over HTTP and returns JSON responses, so we have everything we need to build a client extension, combining native PostgreSQL JSON support with the http extension.

There are two ways to get the extension functions:

  • You can install the extension if you have system access to your database.
  • Or you can just load the openai--1.0.sql file, since it is 100% PL/PgSQL code. Just remember to CREATE EXTENSION http first, because the API extension depends on the http extension.

Local or remote

The API extension determines what API end point to hit and what models to use by reading a handful of global variables.

Using OpenAI is easy.

  • Sign up for an API key.
  • Set up the key, URI and model variables.
SET openai.api_key = 'your_api_key_here';
SET openai.api_uri = 'https://api.openai.
[...]

Schema changes and the power of expand-contract with pgroll
Posted by Andrew Farries in Xata on 2024-11-13 at 00:00
A pgconf EU talk recap covering how the expand-contract pattern and pgroll enable zero-downtime schema changes and rollbacks.

Transitioning from Oracle to PostgreSQL: Indexes
Posted by Umair Shahid in Stormatics on 2024-11-12 at 14:51

For database experts well-versed in Oracle, moving to PostgreSQL opens up new indexing methods that differ significantly in terms of structure, management, and optimization. While both databases leverage indexing to enhance query speed, their approaches vary, particularly in terms of available types, performance tuning, and maintenance. This guide clarifies key differences and provides practical strategies for effectively handling indexes in PostgreSQL.

Understanding Indexing in Databases: The Basics

Indexes reduce query time by creating a more accessible data structure, limiting the need to scan entire tables. Think of them as a ‘Table of Contents’ of sorts to quickly look up the relevant data. However, indexes consume storage and require careful planning—creating too many or inefficient indexes can degrade performance. Both Oracle and PostgreSQL offer various index types, each suited for specific tasks. Here is where they align and where PostgreSQL introduces unique indexing options.

Types of Indexes in Oracle vs. PostgreSQL

B-tree Indexes
Oracle: The default index type, suitable for common lookup operations, range conditions, and queries using comparison operators.
PostgreSQL: B-tree indexes are also default in PostgreSQL, optimized for single and range lookups, and offer operator class flexibility for more precise control.

Bitmap Indexes
Oracle: Bitmap indexes optimize performance for low-cardinality columns with complex WHERE clauses.
PostgreSQL: While bitmap indexes are not available, PostgreSQL’s query planner can use B-tree indexes with bitmap heap scans to achieve a similar effect. This approach is typically used in complex AND/OR queries but doesn’t fully replicate Oracle’s bitmap capabilities.

Hash Indexes
Oracle: Limited in application and typically used in specialized cases as hash clusters.
PostgreSQL: Offers hash indexes but with restricted use cases. They support only equality operations and require careful selection to avoid unnecessary bloat.

GIN and Gi

[...]

Deploying the Zalando Kubernetes operator
Posted by Hans-Juergen Schoenig in Cybertec on 2024-11-12 at 06:00

Kubernetes (and OpenShift) have become highly popular deployment methods to run PostgreSQL at scale. While Kubernetes is an excellent way of running things it also helps to make things such as consistency, compliance and a lot more easier to achieve. In this post you will learn how to deploy the Zalando operator for PostgreSQL and make things work quickly and easily.

Deploying Minikube and cloning the operator

To make things simple for you, I have used Minikube which is an easy to use way to run Kubernetes locally on your development environment or even on your laptop.

If you are new to Minikube I highly recommend to take a look at the "Getting started" guide. It explains the most basic steps in detail.

Let us assume Minikube is already deployed on your machine. You can easily start the system by calling the following command:

$ # We do start of a minikube with:
$ minikube start

We can verify if things are working as follows:

$ # Kubectl should be working and i.e.
$ kubectl get pod
...
$ # should return list of pods in 'default' namespace

Once we have ensured that Kubernetes is indeed running we can download the Zalando operator from the Github repository which is easy to do:

$ # Clone of postgresql operator
$ git clone https://github.com/zalando/postgres-operator.git
$ cd postgres-operator

The repository is public and can be cloned directly. In the next step we can simply apply templates. The way this works is that such templates basically describes what we want from Kubernetes. All the vital information of the service we desire to deploy is there.

Before looking at the content of the file we can simply apply the template:
The first thing to throw at Kubernetes is the configmap:

$ # We apply default postgres operator configuration with
$ kubectl create -f manifests/configmap.yaml

The next thing to set is identities and permissions which can be stored in a simple YAML file as well. Here is how it works:

$ # We apply manifest for identity and permissions
[...]

Contributions for the week of 2024-11-04 (Week 45 overview)
Posted by Boriss Mejias in postgres-contrib.org on 2024-11-11 at 17:27
  • PGConf.Brazil 2024 was held in Belo Horizonte, Brazil, on November 7th-8th. It was organized by:

    • Kenia Galiego
    • Danielle Monteiro
    • Euler Taveira
    • Fernando Franquini
    • Lucio Chiessi
    • William Ivanski

    The Call for Papers committee was formed by:

    • Fabio Telles
    • Gerdan Santos
    • Glauco Torres
    • Kenia Galiego
    • Luciane Pinheiro
  • On 6th of November, the Copenhagen PostgreSQL User Group held a meetup where Luigi Nardi presented DBTune.

Getting comfortable with psql
Posted by Craig Kerstiens in Crunchy Data on 2024-11-11 at 15:57

psql is a CLI editor that ships with Postgres. It’s incredibly powerful for working with Postgres, and doesn’t take too much of a learning curve to start to get comfortable so you can really feel like an expert working with your database.

Just a rundown of a few things to get you started:

Once connected in psql you can get an idea of all utility commands available with:

\?

A handy thing I use all the time is \d.

\d will describe the tables within database. You can also add a table/index/etc. onto it to describe that specific table such as:

\d accounts

There are a number of options you can set in your psqlrc (config) file to customize your CLI experience. But you can also toggle those when directly working in psql.

  • \timing will give you the time it took to run your query
  • \x auto will autoformat your text output
  • \pset pager 0 turns off your pager or 1 to turn it back on

Oh and for editing a query in your editor of choice. Make sure you set your $EDITOR enviroment variable to the editor of your choice, though the only right choice is vim:

\e

Just a few things to get you started working with psql.

Understanding Volatility in PL/pgSQL Functions: A Real-World Lesson
Posted by Deepak Mahto on 2024-11-10 at 13:44

The PL/pgSQL language, available as a default extension in PostgreSQL, provides powerful tools and flexibility for application developers to build complex, enterprise-scale functionality within the database. Through PL/pgSQL’s functions and procedures, developers can choose different volatility categories—IMMUTABLE, STABLE, or VOLATILE—that offer varying performance benefits, especially in terms of result caching and data state awareness. For a deeper dive, refer to the official documentation. Choosing the appropriate volatility for a function is crucial; as they say, “With great power comes great responsibility.

During one of my migration projects, we ran into unexpected behavior while debugging code converted from Oracle. Below is a recreated example using mock functions to illustrate the issue:

truncate table user_status;
create table user_status(col1 bigint GENERATED BY DEFAULT AS IDENTITY , status text);

create or replace function upd_status(text)
returns bigint language plpgsql 
as $$
declare var1 bigint;
begin
with alias1 as (insert into user_status(status) values($1) returning col1)
        select * into var1 from alias1;
return var1;
end;$$;

create or replace function lookup_status(text)
returns boolean language plpgsql 
immutable parallel safe
as $$
begin return (select exists (select 1 from user_status where status = $1)); end;$$;

create or replace function check_status(text)
returns void language plpgsql 
as $$ 
declare var1 bigint;
begin
var1 := upd_status($1);
if lookup_status($1) then 
    raise notice 'Status exists as part of immutable check - %', $1;
else 
    raise notice 'Status does not exists - %', $1;
end if;
raise notice 'Actual Result for status % is %', $1,(select exists (select 1 from user_status where status = $1 and col1 = var1));
end;$$;

If you observe the output, the Open status is inserted into user_status via the check_status call, but any call to lookup_status within the same snapshot doesn’t detect this data change.

It took some time to diagnose the

[...]

Text identifiers in PostgreSQL database design
Posted by Radim Marek on 2024-11-09 at 00:00

Whether you are designing a standalone application or a microservice, you will inevitably encounter the topic of sharing identifiers. Whether it’s URLs of web pages, RESTful API resources, JSON documents, CSV exports, or something else, the identifier of specific resources will be exposed.

/orders/123
/products/345/variants/1

While an identifier is just a number and does not carry any negative connotations, there are valid reasons why you might want to avoid exposing them. These reasons include:

  1. Security and Data Exposure: Numerical identifiers are sequential and predictable, which can expose information about the underlying data source (e.g., the volume of the data) and provide a basis for ID enumeration.
  2. Privacy and Confidentiality: Concerns may arise about concealing the volume of referenced data. For example, the number of customers, clients, or orders might be information a business prefers to keep private.
  3. Non-descriptive Nature: Integers as identifiers can lead to confusion. An ID like 123 does not convey any additional information, making debugging edge cases more challenging.

These and other reasons (like SEO optimisation) have led to the increased use of text-based identifiers. Their readability and versatility make them ideal for external data sharing.

However, in database (or data model) design, the advantages of text identifiers are often overshadowed by the problems they introduce. While text identifiers improve interoperability, they frequently come with performance and storage trade-offs. In contrast, integers are naturally faster and more efficient to process, resulting in lower storage requirements and faster indexing, sorting, and searching-tasks that computers are optimised for.

In this article, we will explore scenarios where using text identifiers directly in the database design might seem natural and discuss strategies for using them effectively.

What Makes Text Identifiers Appealing?

Let’s be honest-text identifiers are popular f

[...]

Crunchy Postgres via Automation V2.2
Posted by Douglas Hunley on 2024-11-08 at 17:23
Continuing our series on Crunchy Postgres via Automatin, we’re here this week to discuss the highlights of our latest release line, v2.

8 Steps in Writing Analytical SQL Queries
Posted by Christopher Winslett in Crunchy Data on 2024-11-08 at 14:30

It is never immediately obvious how to go from a simple SQL query to a complex one -- especially if it involves intricate calculations. One of the “dangers” of SQL is that you can create an executable query but return the wrong data. For example, it is easy to inflate the value of a calculated field by joining to multiple rows.

Let’s take a look at a sample query. This appears to look for a summary total of invoice amounts across teams. If you look closely, you might see that the joins would inflate a team’s yearly invoice spend for each team member.

SELECT
        teams.id,
        json_agg(accounts.email),
        SUM(invoices.amount)
FROM teams
        INNER JOIN team_members ON teams.id = team_members.team_id
        INNER JOIN accounts ON teams.id = team_members.team_id
        INNER JOIN invoices ON teams.id = invoices.team_id
WHERE lower(invoices.period) > date_trunc('year', current_date)
GROUP BY 1;

The query is joining invoices to teams after already joining team_members to teams. If a team has multiple members and multiple invoices, each invoice amount could be counted multiple times in the SUM(invoices.amount) calculation.

Building SQL from the ground up

The above error may not be immediately obvious. This is why it’s better to start small and use building blocks.

Writing complex SQL isn’t as much “writing a query” as it is “building a query.” By combining the building blocks, you get the data that you think you are getting. The steps to building are a query are continuous loops until you improve:

  1. Using words, define the data
  2. Investigate available data
  3. Return the simplest data
  4. Confirm the simple data
  5. Augment the data with joins
  6. Perform summations
  7. Augment with details or aggregates
  8. Debugging

Let’s step through this above query example, getting sum aggregate totals, to learn my method for building a query.

Step 1: In human words, write what you want

It is okay if it changes, because data exploration may mean the data is differ

[...]

Scenarios That Trigger Autovacuum in PostgreSQL
Posted by semab tariq in Stormatics on 2024-11-08 at 13:00

PostgreSQL is widely known for its Multi-Version Concurrency Control (MVCC) model, which allows multiple transactions to occur simultaneously without interfering with each other. 

However, one side effect of MVCC is the creation of dead tuples—old versions of data rows that are no longer needed but still occupy space. 

Dead tuples also lead to a phenomenon known as table bloat, which refers to the excessive unused space in a table caused by dead tuples that haven't been cleaned up, resulting in inefficient storage and reduced performance

To address the issues of dead tuples and table bloat, autovacuum comes into play. It's an automatic process designed to clean up these dead tuples and maintain optimal database performance.

In this blog, we will explore the main situations when autovacuum should run:

The post Scenarios That Trigger Autovacuum in PostgreSQL appeared first on Stormatics.

From Backup to Integrity: Leveraging WAL-G for PostgreSQL
Posted by Marat Bogatyrev in Data Egret on 2024-11-07 at 16:36

A key aspect of maintaining backup integrity is understanding data checksums. Without proper checksum validation, detecting data corruption becomes virtually impossible. Therefore, we will start with The Importance of Data Checksums.


The Importance of Data Checksums

Ensuring data integrity is crucial for the reliability of any database system. Data checksum validation is essential for ensuring data integrity. Checksums help detect data corruption caused by hardware issues.

How PostgreSQL Calculates Checksums:

1. During Page Writes: PostgreSQL calculates the checksum each time a data page is written to disk. This ensures that any corruption occurring after the write can be detected during future reads.

2. Verification During Reads: Every time a page is read from disk into memory—whether by user queries or maintenance operations—PostgreSQL verifies the checksum to confirm data integrity.
If the checksum does not match, a warning is raised and recorded in the log. The log entry below indicates that a data page is corrupt, and the checksum does not match the expected value. Such logs can be conveniently captured and monitored using the logerrors extension, which helps identify and track these errors for further analysis.

WARNING:  page verification failed, calculated checksum 24693 but expected 58183

Although enabling checksums does add extra overhead to the database, the general rule of thumb is to enable this option on every cluster via initdb. This practice is considered so important that, in PostgreSQL 18, there is a strong possibility that data checksum will be enabled by default.

However, remember that checksums are disabled by default in PostgreSQL through version 17, so you must enable them when creating a cluster. If data checksums are not enabled, PostgreSQL provides a tool called pg_checksums. This tool allows you to enable, disable, or verify checksums, but it only works when the cluster is offline.

While enabling checksums provides a baseline for detecting corru

[...]

Top posters

Number of posts in the past two months

Top teams

Number of posts in the past two months

Feeds

Planet

  • Policy for being listed on Planet PostgreSQL.
  • Add your blog to Planet PostgreSQL.
  • List of all subscribed blogs.
  • Manage your registration.

Contact

Get in touch with the Planet PostgreSQL administrators at planet at postgresql.org.