PostgreSQL
The world's most advanced open source database
Top posters
Number of posts in the past two months
Top teams
Number of posts in the past two months
Feeds
Planet
  • Policy for being listed on Planet PostgreSQL.
  • Add your blog to Planet PostgreSQL.
  • List of all subscribed blogs.
  • Manage your registration.
Contact
  • Get in touch with the Planet PostgreSQL administrators at planet at postgresql.org.

In the lifecycle of any application, developers are often asked to create proof-of-concept features, test newly released functionality, and visualize data analysis. In many cases, the available test data is stale, not representative of normal usage, or simply doesn't exist for the feature being implemented. In situations like this, knowing how to quickly create sample time-series data with native PostgreSQL and SQL functions is a valuable skill to draw upon!

In this three-part series on generating sample time-series data, we demonstrate how to use the built-in PostgreSQL function, generate_series(), to more easily create large sets of data to help test various workloads, database features, or just to create fun samples.

In part 1 and part 2 of the series, we reviewed how generate_series() works, how to join multiple series using a CROSS JOIN to create large datasets quickly, and finally how to create and use custom PostgreSQL functions as part of the query to generate more realistic values for your dataset. If you haven't used generate_series() much before, we recommend first reading the other two posts. The first one is an intro to the generate_series() function, and the second one shows how to generate more realistic data.

With those skills in hand, you can quickly and easily generate tens of millions of rows of realistic-looking data. Even still, there's one more problem that we hinted at in part 2 - all of our data, regardless of how it's formatted or constrained, is still based on the random() function. This means that over thousands or millions of samples, every device we create data for will likely have the same MAX() and MIN() value, and the distribution of random values over millions of rows for each device generally means that all devices will have similar average values.

This third post demonstrates a few methods for influencing how to create data that mimics a desired shape or trend. Do you need to simulate time-series values that cycle over time? What about demonstrating a counter

[...]
Posted by Elizabeth Garrett Christensen in Crunchy Data on 2022-01-19 at 16:52

 If you’ve read Crunchy blogs recently you probably noticed by now that we’re all big fans of indexing. Indexing is key to optimizing your database workloads and reducing query times. Postgres now supports quite a few types of indexes and knowing the basics is a key part of working with Postgres. 

The role of database indexes is similar to the index section at the back of a book. A database index stores information on where a data row is located in a table so the database doesn't have to scan the entire table for information. When the database has a query to retrieve, it goes to the index first and then uses that information to retrieve the requested data.

Indexes are their own data structures and they’re part of the Postgres data definition language (the DDL). They're stored on disk along with data tables and other objects. 

Posted by Martin Davis in Crunchy Data on 2022-01-18 at 14:53

A common spatial need is to find a polygon that accurately represents a set of points.  The convex hull of the points often does not provide this, since it can enclose large areas which contain no points.  What is required is a non-convex hull, often termed the concave hull.  

The Convex Hull and a Concave Hull of a point set

A concave hull is generally considered to have some or all of the following properties:

  • The hull is a simply connected polygon
  • It contains all the input points
  • The vertices in the hull polygon boundary are all input points
  • The hull may or may not contain holes
For a typical point set there are many polygons which meet these criteria, with varying degrees of concaveness.  Concave Hull algorithms provide a numeric parameter which controls the amount of concaveness in the result.  The nature of this parameter is particularly important, since it affects the ease-of-use in practical scenarios.  Ideally it has the following characteristics:
  • Simple geometric basis: this allows the user to understand the effect of the parameter and aids in determining an effective value
  • Scale-free (dimensionless): this allows a single parameter value to be effective on varying sizes of geometry, which is essential for batch or automated processing
  • Local (as opposed to global):  A local property (such as edge length) gives the algorithm latitude to determine the concave shape of the points. A global property (such as area) over-constrains the possible result. 
  • Monotonic area:  larger (or smaller) values produce a sequence of more concave areas
  • Monotonic containment :the sequence of hulls produced are topologically nested
  • Convex-bounded: an extremal value produces the convex hull

This is a well-studied problem, and many different approaches have been proposed.  Some notable ones are:

[...]

The pgDay Paris 2022 conference schedule has now been posted. Eight great sessions, eight great speakers with diverse perspectives, all in a single-track event so no one has to miss out on any of the presentations.

Space is limited at the event due to COVID-19 restrictions. Tickets can be purchased on the website.

Posted by Robert Haas in EDB on 2022-01-18 at 13:00

If PostgreSQL had the ability to give to a privileged non-superuser the right to administer objects belonging to some designated group of superusers just as if the privileged account were superuser, it would get us much closer to a world in which the database can be effectively administered by a non-superuser. A highly privileged user - let's call him sauron - could be given the right to administer tables, schemas, functions, procedures, and a variety of other objects owned by dependent users witchking and khamul just as if sauron were superuser. sauron might indeed feel himself to be virtually a superuser, at least within his own domain, as long as he didn't spend too much time thinking about the users over which he had not been given administrative rights. However, sauron might notice a few irksome limitations.

Read more »
Posted by Andreas 'ads' Scherbaum on 2022-01-17 at 14:00
PostgreSQL Person of the Week Interview with Kuntal Ghosh: Hi, I am Kuntal, and I live in Bangalore, India. I’ve started exploring PostgreSQL during my Master Thesis, and continued it by joining EDB in India. Currently, I’m working at Amazon on Aurora Postgres.
On 17th of January 2022, Michael Paquier committed patch: Introduce log_destination=jsonlog   "jsonlog" is a new value that can be added to log_destination to provide logs in the JSON format, with its output written to a file, making it the third type of destination of this kind, after "stderr" and "csvlog". The format is convenient … Continue reading "Waiting for PostgreSQL 15 – Introduce log_destination=jsonlog"
Back in 2018 I wrote why-upgrade.depesz.com – aggregator of changelogs between various versions of Pg. Want to know what you will get when upgrading from 12.1 to 12.9? Here you go. Longer time changes? Like, from 9.5.20 to 14.1? I've got your back. It even has a way to list every change that relates to … Continue reading "Configuration changes across Pg versions"
Posted by Franck Pachot in Yugabyte on 2022-01-14 at 15:08

pg_stat_statements can identify long queries, but are limited when we need to go further: the queries are normalized with values replaced by parameters. And the metrics gathered are cumulative, about time, with min, max, average, stddev, usually over a long time window. It is hard to catch the statements responsible for the activity at peak load, and when caught, not easy to reproduce the execution plan, know the user, or the application that ran it.

pg_stat_activity holds this information and here is small script to sample it. The goal of sampling the running statements is to start the investigation by looking at the top queries. However, they are all different, with different parameters. What I'm doing here is aggregate the ones sharing the same execution plan. For each active queries in pg_stat_activity I run an explain and remove the parts with values (Filter and Index Conditions). Then I take a md5 of it and this will become my "Plan Hash Value".

Sampling function

This installs my sampling function and the table to store the plans:

drop table if exists yb_ash_plans;
drop function if exists yb_ash_sample;

create table if not exists yb_ash_plans(
 plan_hash_value text primary key, 
 samples int, running_for interval,last_plan text,last_pg_stat_activity jsonb
 );
create or replace function yb_ash_sample() returns int as
 $$
declare
 psa record;
 plan_xml text;
 plan_nofilter text;
 plan_md5 text;
 last_running_for interval;
 count_updated int:=0;
begin
  for psa in (select * from pg_stat_activity
   where state='active' and pid!=pg_backend_pid()) loop
   begin 
    --raise notice 'SQL %',psa.query;
    execute format('explain (format xml, costs false, buffers false) %s',psa.query) into plan_xml;
    exception when others then
     if sqlstate not like '42%' then raise warning '% % %',sqlstate,sqlerrm,psa.query; end if;
    end;
   end loop;
   if plan_xml is not null then
    --raise notice 'XML %',plan_xml;
    plan_nofilter:=regexp_replace(plan_xml,'(|||
[...]
Posted by Nikolay Samokhvalov in Postgres.ai on 2022-01-14 at 07:11
Progress bar for Postgres queries – let's dive deeper

Recently, I have read a nice post titled "Query Progress Bar", by Brian Davis. It describes an interesting approach to observing the progress of slow query execution.

At some point, the author mentions:

> Don't use this in prod.

And I agree. The article discusses long-running queries such as SELECTs, UPDATEs, DELETEs, and quite "invasive" methods of progress monitoring. In an OLTP production scenario, in most cases, we should try to limit the duration of such queries, setting statement_timeout to a very low value – such as 30 or even 15 seconds.

Let's dive deeper into the topic of query progress monitoring, and discuss various types of queries, how to monitor their progress, considering production and non-production environments separately.

Continue reading...

Posted by Regina Obe in PostGIS on 2022-01-14 at 00:00

As of PostGIS 3.0, the PostGIS raster support is no longer part of the postgis extension, but instead spun off into a new PostGIS extension called postgis_raster.

The two main reasons for this break were:

  • Raster functionality in PostGIS is fat with over 150 functions and several types. Wading through these extra functions frustrated many who had no use for rasters.

  • Raster gdal dependency is very big and many dreamed of having postgis extension without the big raster dependency.

While repackaging raster as it’s own extension resolved many complaints, it meant a slightly more complicated upgrade process going from PostGIS 2.something to 3.something that even experienced PostGIS users managed to screw up.

I will detail the proper way of upgrading PostGIS raster, from a PostGIS 2.* install to 3.* install.

You can run these steps using psql or pgAdmin or any other PostgreSQL tool you want.

Regardless which version of PostGIS you are coming from, you should install the PostGIS 3.* binaries first.

Continue Reading by clicking title hyperlink ..
Posted by Paul Ramsey in Crunchy Data on 2022-01-13 at 18:23

Last month, just under the wire for a 2021 release, the 3.2 version of PostGIS hit the streets! This new PostGIS also supports the latest 3.10 release of GEOS, which underpins a few of the new features.

PostgreSQL Patroni Logical Replication Slot Failover

Failover of the logical replication slot has always been the pain point while using the logical replication in PostgreSQL. This lack of feature undermined the use of logical replication and acted as one of the biggest deterrents. The stake and impact were so high that many organizations had to discard their plans around logical replication, and it affected many plans for migrations to PostgreSQL. It was painful to see that many had to opt for proprietary/vendor-specific solutions instead.

At Percona, we have written about this in the past: Missing Piece: Failover of the Logical Replication Slot.  In that post, we discussed one of the possible approaches to solve this problem, but there was no reliable mechanism to copy the slot information to a Physical standby and maintain it.

The problem, in nutshell, is: the replication slot will be always maintained on the Primary node. If there is a switchover/failover to promote one of the standby, the new primary won’t have any idea about the replication slot maintained by the previous primary node. This breaks the logical replication from the downstream systems or if a new slot is created, it becomes unsafe to use.

The good news is that Patroni developers and maintainers addressed this problem from Version 2.1.0 and provided a working solution without any invasive methods/extensions. For me, this is a work that deserves a big round of applause from the Patroni community and that is the intention of this blog post and to make sure that a bigger crowd is aware of it.

How to Set it Up

A ready-to-use Patroni package is available from the Percona repository. But you are free to use Patroni from any source.

Basic Configuration

In case you are excited about this and want to try it, the following steps might be helpful.

The entire discussion is about logical replication. So the minimum requirement is to have a wal_level set to “logical”. If the existing Patroni configuration is having wal_level set to “replica” and if you want to use this featu

[...]

I have done a few previous blog posts on who has contributed to PostgreSQL, but I did not do one last year. A couple people mentioned to me that they missed it, so I decided to do one this year, and I decided to gather statistics, using basically the same methodology that I have in the past, for both 2020 and 2021. As always, it is important to remember, first, that many people contribute to the project in ways that these statistics don't capture, and second, that the statistics themselves are prone to mistakes (since there is a bunch of manual work involved) and bias (since each commit is attributed to the first author for lack of knowledge of the relative contributions of the various authors). As usual, I have posted a dump of the database I used to generate this in case anyone wants to check it over for goofs or use it for any other purpose.

I calculate that there were 176 people who were the principal author of at least one PostgreSQL commit in 2020 and 182 such people in 2021. In 2020, 13 people contributed 66% of the lines of new code, and 35 people contributed 90% of the lines of new code. In 2021, these numbers were 14 and 41 respectively. In 2020, there were a total of 2181 commits from 26 committers, and in 2021, there were 2269 commits from 28 committers. In each year, about 5 committers committed about two thirds of the non-self-authored patches, with Tom Lane leading the pack in both years.

Here are the top 35 contributors by lines of new code contributed in 2020. Asterisks indicate non-committers. Note that some of these people are committers now but were not committers at the time.

 #  |          author           | lines | pct_lines | commits 

----+---------------------------+-------+-----------+---------

  1 | Tom Lane                  | 65203 |     25.95 |     436

  2 | Peter Eisentraut          | 28771 |     11.45 |     229

  3 | Paul Jungwirth [*]        | 10723 |      4.27 |       2

  4 | Heikki Linnakangas        |  8293 |      3.30 |      31

  5 | Rober

[...]

If you’ve never done it before, you might be daunted by the idea of giving a conference talk. You know: the work involved, the butterflies, how to make it a good talk and not a boring one, the people who might judge you… And perhaps the hardest bit: choosing a topic others will find interesting.

The last few months I’ve been working on a new event. It’s a free and virtual developer event happening on 12-13 Apr 2022 called Citus Con: An Event for Postgres. Organized by the Postgres and Citus team here at Microsoft Azure, Citus Con is geared toward Postgres users, Azure Database for PostgreSQL customers, and those who use the Citus extension to Postgres (or other PG extensions.)

Like the Postgres devroom at FOSDEM in 2022, Citus Con will be virtual. And while many of us miss the human interaction of in-person events these past years, virtual events do have a silver lining: no jetlag. And the impact on your schedule is not nearly as big. At Citus Con, there will be short 3-hour livestreams in 3 different geographies—Americas, EMEA, and APAC—to make it easy to join from any time zone. Plus on-demand videos you can watch anytime. Talks will be 25 minutes long hence less material to prepare as a speaker. And we’re working to set up a place where y’all can engage with each other during the event, too.

Wearing my talk selection team hat, I’ve been reaching out to people to spread the word about the CFP for Citus Con which closes on Feb 6th. Along the way, this question has come up:

Why give a talk at a Postgres conference?

This post will walk you through the ways you, your team, your project—and especially the Postgres community—can benefit from a talk you give.

First-time speakers are welcome in so many conferences

But what if this is your first conference talk? Or your first talk at a Postgres event? Or your first talk at a Microsoft-organized event?

Many talk selection committees—this is true for PostgreSQL CFPs (including Citus Con) and also loads of Python events—welcome, I

[...]
Standby Cluster With the Percona Distribution for PostgreSQL Operator

A customer recently asked if our Percona Distribution for PostgreSQL Operator supports the deployment of a standby cluster, which they need as part of their Disaster Recovery (DR) strategy. The answer is yes – as long as you are making use of an object storage system for backups, such as AWS S3 or GCP Cloud Storage buckets, that can be accessed by the standby cluster. In a nutshell, it works like this:

  • The primary cluster is configured with pgBackRest to take backups and store them alongside archived WAL files in a remote repository;
  • The standby cluster is built from one of these backups and it is kept in sync with the primary cluster by consuming the WAL files that are copied from the remote repository.

Note that the primary node in the standby cluster is not a streaming replica from any of the nodes in the primary cluster and that it relies on archived WAL files to replicate events. For this reason, this approach cannot be used as a High Availability (HA) solution. Even though the primary use of a standby cluster in this context is DR, it can be also employed for migrations as well.

So, how can we create a standby cluster using the Percona operator? We will show you next. But first, let’s create a primary cluster for our example.

Creating a Primary PostgreSQL Cluster Using the Percona Operator

You will find a detailed procedure on how to deploy a PostgreSQL cluster using the Percona operator in our online documentation. Here we want to highlight the main steps involved, particularly regarding the configuration of object storage, which is a crucial requirement and should better be done during the initial deployment of the cluster. In the following example, we will deploy our clusters using the Google Kubernetes Engine (GKE) but you can find similar instructions for other environments in the previous link.

Considering you have a Google account configured as well as the gcloud (from the Google Cloud SDK suite) and kubectl command-line tools installed, authenticate your

[...]
Posted by Luca Ferrari on 2022-01-11 at 00:00

A couple of different solutions to an interesting problem.

Pentagon numbers

Since a few weeks, I tend to implement some of the tasks for the Perl Weekly Challenge into PostgreSQL specific code. One interesting problem has been this week task 2 of the Challenge 147: finding out a couple of pentagon numbers that have simultaneously a sum and a diff that is another pentagon number.
In this post, I discuss two possible solutions to the task.

What is a Pentagon Number?

A pentagon number is defined as the value of the expression n * ( 3 * n - 1 ) / 2, therefore the pentagon number corresponding to 3 is 12.
The task required to find out a couple of pentagon numbers so that:



P(n1) + P(n2) = P(x)
P(n1) - P(n2) = P(y)



It does not matter what x and y are, but n1 and n2 must be pentagon numbers and both their sum and diff must be pentagon numbers too.

The first approach: a record based function

The first solution I came with was inspired by the solution I provided in Raku, and is quite frankly a kind of record-based approach.
Firs of all, I define an IMMUTABLE function named f_pentagon that computes the pentagon number value starting from a given number, so that f_pentagon( 3 ) returns 12. Why do I need a function? Because I want to implement a table with a stored virtual column to keep track of numbers and their pentagon values.
For that reason, I created a pentagons table with a generic n column that represents the starting value and the p column that represents the computed pentagon value.



CREATE OR REPLACE FUNCTION
f_pentagon( n bigint )
RETURNS bigint
AS
$CODE$
        SELECT ( n * ( 3 * n - 1 ) / 2 );
$CODE$
LANGUAGE sql
IMMUTABLE;


DROP TABLE IF EXISTS pentagons;
CREATE TABLE pentagons
(
        n bigint
        , p bigint GENERATED ALWAYS AS ( f_pentagon( n ) ) STORED
);



INSERT INTO pentagons( n )
SELECT generate_series( 1, 5000 );




I inserted into the table 5000 records because I know, from the Raku solution, that what I’m look

[...]
Posted by Andreas 'ads' Scherbaum on 2022-01-10 at 14:00
PostgreSQL Person of the Week Interview with Emil Shkolnik: I was born in Riga, Latvia and our family moved to Israel in 1990, since then I am here.
Posted by Avinash Vallarapu in MigOps on 2022-01-07 at 21:37

On behalf of the PostgreSQL community and the team at MigOps, we wish all the readers a very Happy New Year - 2022. It is no wonder that PostgreSQL is the only Open Source database with a consistent increase in its adoption every year. The Year 2021 has been another milestone in the PostgreSQL world. Every user of PostgreSQL has either directly or indirectly created a positive impact to the PostgreSQL adoption through - patch submissions or reviews, feature requests, blogs and articles, speaking at PostgreSQL conferences, discussions in postgres mailing lists about proposed features, etc. In this article we are going to see a summary of the year 2021 in the PostgreSQL ecosystem. 

 

What’s being discussed in this Article ? Summary of great things in PostgreSQL in 2021
  1. DB-Engines Rankings - PostgreSQL is again the DBMS of the Year 2021.
  2. PostgreSQL 14 released - Let us see some of its features.
  3. Minor versions released in 2021.
  4. PostgreSQL versions that got EOL in 2021.
  5. Extensions that took birth in 2021.
  6. Security Vulnerabilities that got detected and patched
  7. Ora2Pg includes some more features simplifying Oracle to PostgreSQL migrations.
  8. Core Team report for 2021.
DB-Engines Rankings - PostgreSQL is again the DBMS of the Year 2021

DB-Engines determines the DBMS of the Year by subtracting the popularity score of the current year from the popularity score of the previous year for all the databases. Similarly, MigOps subtracted the popularity score of the databases in Dec, 2020 from the score of Dec, 2021 to find the DBMS of the Year - 2021.

We are proud to announce that PostgreSQL is again the DBMS of the Year - 2021. As per DB-Engines Rankings, PostgreSQL has already been the DBMS of the Year in 2017, 2018 and 2020. It now becomes the only database to win this title 4 times in the past 5 years.

 

PostgreSQL 14 released - Some of its features

The PostgreSQL community has announced the release of PostgreSQL 14 in the year 2021. This release included

[...]

Co-authored by Brian Pace

I was excited to hear that Kubernetes 1.22 was recently released with better support for cgroup-v2 and has support for Linux swap. These changes potentially resolve two of my chief complaints about running Postgres under Kubernetes. Obviously it will take some time before we see uptake in the wild on these features, but I wanted to become familiar with them.

What does High Availability actually mean when we’re discussing Postgres clusters? Sure, we can set an RPO, RTO, SLA, and other 3-letter acronyms regarding expectations and deliverables, but what are we trying to achieve? Fast failover? No lost transactions? Both? What about time necessary to rebuild or reclaim a former Primary node? How do all of our various replication configurables affect this? Let’s see if we can come up with a definition of High Availability that’s suitable to a Postgres context. This may seem a trivial exercise at first, but there’s a method to our madness. [Continue reading...]
Every January, the pganalyze team takes time to sit down to reflect on the year gone by. Of course, we are thinking about pganalyze, our customers and how we can improve our product. But, more importantly, we always take a bird's-eye view at what has happened in our industry, and specifically in the Postgres community. As you can imagine: A lot! So we thought: Instead of trying to summarize everything, let's review what happened with the Postgres project, and what is most exciting from our…
Posted by Jon Jensen in End Point on 2022-01-07 at 00:00

Stone building with arched windows, a tower, steps leading up, and lush lawn, flowers, and trees

Normally in a newly-created PostgreSQL database cluster there is a single all-powerful administrative role (user) with “superuser” privileges, conventionally named postgres, though it can have any name.

After the initial cluster setup you can create other roles as needed. You may optionally grant one or more of your new roles the superuser privilege, but it is best to avoid granting superuser to any other roles because you and your applications should generally connect as roles with lower privilege to reduce the risk of accidental or malicious damage to your database.

Let’s break something 😎

Imagine you have a cluster with two or more superuser roles. If you accidentally remove superuser privilege from one role, you can simply connect as the other superuser and re-grant it.

But if you have a cluster where only the single postgres role is a superuser, what happens if you connect as that role and try to remove its superuser privilege?

$ psql -U postgres postgres
psql (14.1)
Type "help" for help.

postgres=# \du
                       List of roles
 Role name |            Attributes             | Member of
-----------+-----------------------------------+-----------
 postgres  | Superuser, Create role, Create DB | {}
 somebody  | Create DB                         | {}

postgres=> \conninfo
You are connected to database "postgres" as user "postgres" via socket in "/tmp" at port "5432".
postgres=# ALTER ROLE postgres NOSUPERUSER;
ALTER ROLE
postgres=# \du
                 List of roles
 Role name |       Attributes       | Member of
-----------+------------------------+-----------
 postgres  | Create role, Create DB | {}
 somebody  | Create DB              | {}

postgres=# ALTER ROLE postgres SUPERUSER;
ERROR:  must be superuser to alter superuser roles or change superuser attribute
postgres=# \q

PostgreSQL happily lets us do that, and now we have no superuser, and so we cannot re-grant the privilege to that role or any other!

Homebrew PostgreSQL problem

Aside from such a severe

[...]
Io, a Jupiter's moon (Io, a Jupiter's moon)

SQL query optimization is challenging for those who have just started working with PostgreSQL. There are many objective reasons for this, such as:

  • the difficulty of the field of system performance in general,
  • lack of good "playground" environments where people can experience how databases work at a larger scale,
  • lack of certain capabilities in Postgres observability tools that are still developing (though, at a good pace),
  • insufficiency of good educational materials.

All these barriers are reasonable. They limit the number of engineers possessing well-developed Postgres query optimization skills. However, there is a specific artificial barrier that is rather influential and which is relatively easy to eliminate.

Here it is: the EXPLAIN command has the BUFFERS option disabled by default. I am sure it has to be enabled and used by everyone who needs to do some SQL optimization work.

In a Part 2 of this series I defined a function to get tokens for API call rate limiting. I've run it in \watch 0.01 loops in 8 sessions to show concurrent access for the same user. With PostgreSQL, the guarantee that reads and writes happen on the same state is enforced by pessimistic locking: sessions wait but don't fail. With YugabyteDB, optimistic locking is more scalable, but can fail on conflict. This must be handled by the application. In the Part 3 of this series, I introduced the JDBC driver for YugabyteDB. I'll use it even when connecting to PostgreSQL because all is compatible. But of course, no cluster-aware features will not be used as PostgreSQL has only one writer endpoint.

Here is how I installed the driver (but check for newer releases):

wget -qc -O jdbc-yugabytedb.jar https://github.com/yugabyte/pgjdbc/releases/download/v1.0.0/jdbc-yugabytedb-42.3.0.jar
export CLASSPATH=.:./jdbc-yugabytedb.jar

The creation of the table and function is in the Part 2 of this series and I do not reproduce it here. All will go to a github repository at the end of this blog series.

I'll create a RateLimitDemo class with a main() that defines the DataSource ("jdbc:yugabytedb://...) in args[1] and start a specific number of threads (args[0]). The call to the PL/pgSQL function defined in Part 2 takes a rate limit (args[3]) for an id (args[2]) which can be a user, a tenant, an edge location... on which we want to allow a maximum number of tokens per second. Each thread will connect to the data source (ds) and set the user requesting the tokens (id) with a rate of refill per second (rate token/s here) and the number of retries in case of transaction conflicts (max_retry passed in args[4]).

The constructor public RateLimitDemo(YBClusterAwareDataSource ds, String id, int rate, int max_retry connects and sets the default transaction isolation level to SERIALIZABLE and prepares the statement to call the procedure. The SELECT statement returns the host@pid session identification (rs.getString(1)) and th

[...]
Posted by Rahul Shirsat in EDB on 2022-01-06 at 14:45
Senior Software Engineer Rahul Shirsat guides you through the setup process for SSH tunneling with pgAdmin 4.
Posted by Ryan Lambert on 2022-01-04 at 05:01

A data dictionary is an important tool for anyone that stores and consumes data. The PgDD extension makes it easy to inspect and explore your data structures in Postgres. This post shows how PgDD provides access to current and accurate information about your databases for a variety of users:

  • Analysts
  • DBAs and Developers
  • The Business

This data dictionary information from PgDD is made available using standard SQL by querying a small set of views.

Background

Relational databases, including Postgres, track the majority of the information needed for a data dictionary. This is done in the underlying system catalogs; Postgres' system catalogs are in the pg_catalog schema. The challenge with using the system catalogs is they are not very user friendly to query for the type of details commonly needed. PgDD does not do anything magical, it is simply a wrapper around the Postgres system catalogs!

In a previous post I stored tags and groups as arrays in the posts table. I created GIN indexes on them, but without the possibility to add the timestamp to it. To optimize queries on a list of tags, or groups, within a range of date, I added secondary tables, automatically maintained by triggers. This was much more efficient in YugabyteDB because this additional index is like an index (tables are stored as LSM Trees rather than heap tables) and because, without this, the Rows Removed by Filter from the GIN index are very expensive in a distributed database. But when showing the execution plan with the GIN index, I mentioned that PostgreSQL would show another join method. In this post I'll show the difference. I had it in draft an this tweet by Nikolay Samokhvalov about arrays was a good occasion:

PostgreSQL

I've run the same as the previous post but this time on AWS Aurora with PostgreSQL compatibility. For what I'm doing there, this is the same as PostgreSQL. I've loaded approximately the same generated data:

aurora=> select count(*) from posts_by_user;
  count
----------
 42670000
(1 row)

Time: 13444.326 ms (00:13.444)

The first query was about getting the the last 2-days posts for one user:

aurora=> explain analyze
           select posts_by_user.*
            from posts_by_user where user_id=1
            and created_date > now() - 2*interval '1 day'
           order by created_date desc;
                                                                      QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------------------------------
[...]
Posted by brian davis on 2022-01-03 at 15:00

If this were YouTube, the thumbnail would look something like this

Imagined YouTube Thumbnail

But seriously. Is there any way to monitor long running Postgres queries?

Kinda!

Using sequences.

The Setup

We create two tables. One small and one big.

DROP TABLE IF EXISTS small;
CREATE TABLE small (
  some_val int
);

INSERT INTO small
SELECT
  gs
FROM
  generate_series(1, 100) AS gs;

DROP TABLE IF EXISTS big;
CREATE TABLE big (
  other_val int
);

INSERT INTO big
SELECT
  gs % 100 + 1
FROM
  generate_series(1, 50000000) AS gs;

CREATE INDEX ON big (other_val);

VACUUM ANALYZE small;
VACUUM ANALYZE big;

The Query

This is the slow query we'll try to monitor

SELECT
  some_val,
  (
    SELECT
      count(*)
    FROM
      big AS b
    WHERE
      b.other_val = s.some_val
  )
FROM
  small AS s;

The Technique

Create a dummy sequence

DROP SEQUENCE IF EXISTS my_sequence;
CREATE SEQUENCE my_sequence;

Find a way to incorporate a call to nextval for this sequence

nextval('my_sequence')

Easy enough.

SELECT
  some_val,
  (
    SELECT
      count(*)
    FROM
      big AS b
    WHERE
      b.other_val = s.some_val
  ),
  nextval('my_sequence')
FROM
  small AS s;

Now, while the slow query is running we can open another psql session and run our 'monitor' in a \watch and see how quickly Postgres is chewing through rows. Sequnces aren't transactional, which is why this works. When we see the values from nextval in our watch start shooting up, that's Postgres working its way through the query.

SELECT nextval('my_sequence');
\watch 1

The Video Content

Query monitoring animation

Not Perfect

Unfortunately this won't work for every query. Postgres needs to evaluate the call to nextval in the 'meat' of the execution for it to work, and Postgres won't always choose to do that. It won't work on this query, for example

SELECT
  some_v
[...]
Posted by Franck Pachot in Yugabyte on 2022-01-03 at 14:30

When you provide a service to users, but want to prevent spam, attacks slowing down your system, you need to implement an API rate limiting function to throttle the usage by a user after a specified number of calls per second. This can use counters in an in-memory database, like Redis. You can find good descriptions here and here. When your service is distributed, a shared counter is an anti-pattern for scalability. In this series, I'll show solution with PostgreSQL, and have it scale-out in YugabyteDB.

Token Buckets

The simplest solution is to store a counter per user, as a number of tokens. Each call will require a token, and the counter will be decreased. Every second, tokens will be added from the defined rate of tokens per second. Storing the last call timestamp is sufficient. The maximum tokens is based on the refill rate during, lets say, one minute. The first call for a user will initialize tokens with this maximum (the rate applied to this one minute).

Here is the PostgreSQL table to store it:

create table rate_limiting_token_bucket (
 id text primary key,
 tk bigint not null,
 ts timestamptz not null);
  • id identifies the user
  • tk is the amount of available tokens for this user
  • ts records the last time a token was retrieved In YugabyteDB the table is sharded on a hash of the first column, so this is distributed per user.

When there's no existing row for a user, I'll initialize with an INSERT to set the tokens tk to window_seconds*refill_per_sec-1 where window_seconds is the maximum (I'll use 60 seconds) and refill_per_sec is the number of tokens allowed per second. -1 is the token taken for the current call.

When there's already a row with a previous amount of tokens, I'll UPDATE to set tk to tk-1+refill_per_sec*extract(epoch from clock_timestamp()-ts). tk-1 takes the token for the current call. extract(epoch from clock_timestamp()-ts) is the number of seconds since the last call for this user, which I multiply with the rate of tokens acquired pe

[...]