PostgreSQL
The world's most advanced open source database
Top posters
Number of posts in the past two months
Top teams
Number of posts in the past two months
Feeds
Planet
  • Policy for being listed on Planet PostgreSQL.
  • Add your blog to Planet PostgreSQL.
  • List of all subscribed blogs.
  • Manage your registration.
Contact
  • Get in touch with the Planet PostgreSQL administrators at planet at postgresql.org.
Posted by Bruce Momjian in EDB on 2021-04-19 at 14:15

We all operate on assumptions, e.g., we expect to be alive tomorrow, we expect our family to always support us, we assume our car will start. Everyone needs to make some assumptions to function. But assumptions aren't always true, e.g., some day the sun will rise, and you will not be alive to see it. So, where does that leave us? We have to make assumptions to survive, but what we assume today might be false tomorrow. Fundamentally, successful people, while they rely on assumptions, are always alert for signs that their assumptions are false.

There are a number of assumptions about software that drive user behavior. While some of these assumptions might have been true in the past, they are now of questionable validity:

  • Software that requires payment is always better than free software
  • Open source code is less secure because it makes security flaws visible
  • Open source isn't developed by serious people
  • Oracle is the best database
  • I will always be employed for my Oracle skills

People who actively question assumptions and regularly test their validity are more likely to take advantage of new opportunities, and I think the list above validates that. I often speak to current and prospective EDB customers, and they ask many questions — sometimes I think they are going to ask questions forever. However, often the source of the questions are based on incorrect or no longer valid assumptions they have about the software world. I find that explicitly stating the assumptions, and explaining why they might no longer be valid, to be an effective way to get beyond specific questions. It helps them realize that Postgres is a great opportunity for their organizations, but only if they are willing to see the world as it is now, and not as it was.

Posted by Andreas 'ads' Scherbaum on 2021-04-19 at 14:00
PostgreSQL Person of the Week Interview with Pavlo Golub: Hello. My name is Pavlo Golub. I’m originally from Ukraine, but now I live in Slovakia. I’m a developer and consultant at Cybertec PostgreSQL International.

Back in Nov 2021, I wrote about calling Stored procedure from Java and demonstrated how we can call a simple stored procedure from Java or a one that contains a IN parameter. I have done some more playing around with calling Stored procedure from Java so this blog is about calling a stored procedure with multiple IN parameters and multiple INOUT parameters. As I had explained earlier in my blog of the “Stored Procedure” series that the stored procedures which don’t have a return value and if you want to return a value then you need to use a INOUT parameter. Typically OUT parameters are used for returning a value but PostgreSQL doesn’t support OUT parameter for Stored Procedures, it is currently being worked on by the PostgreSQL community.

First a small intro of Stored procedure before we dive into the main agenda of this blog 

“Stored procedure or a Stored function is a module or a program that is stored in the database, it extends the database functionality by creating and reusing user defined programs in supported SQL/PL languages. They can be created in multiple languages (details to follow) and once compiled they become a schema object which can be executed or referenced by multiple applications.  The stored procedures/functions are very useful component of a database application as they underpin the complex application logic and database operations that needs to executed and reused multiple times by the database application. Without this feature it would become very complex to carry out database operations that need to repeated, it will be done using several complex SQL queries with round trips in a single function within the database.”

Calling Procedure with multiple IN parameters

First we are going create a stored procedure that accepts multiple IN parameters and we will then call this from Java using the PG JDBC driver.

CREATE OR REPLACE PROCEDURE updStudent
(
IN std_id INT,
IN std_class VARCHAR,
std_tot_marks INT
)
LANGUAGE plpgsql AS
$$
BEGIN
        -- updating student's mask
      
[...]
Posted by Tatsuo Ishii in SRA OSS, Inc. on 2021-04-19 at 08:09

 

 Photo by Free-Photos

Pgpool-II provides "query cache" feature which speeds up SELECT to database. It stores query results to specified storage (either shared memory or memcached). If same query (more precisely, SELECT query which has identical text) arrives, Pgpool-II returns the cached result without accessing the database. This is very fast because:

  • Query results are fetched from memory, rather than database's (typically on desk) storage
  • Does not use CPU or memory on database server

Is there any pitfall?

Whether a SELECT result is fetched from the database or cache is transparent to client. That is, you don't realize the distinction except the response speed. Ok, sounds great. But is there any pitfall to use the query cache? One is, cache hit rates. Since any modifications to underlying table clears the cache, on a system which involves frequent updates are not suitable for the feature. According to the manual, system with lower than 70% cache hit rates is recommended to use the query cache.

Row visibility rules

Another pitfall you may need to care about is row visibility rules. The rule defines how the table rows are visible to other sessions (so you don't need to worry about this if you are the only user in the database). PostgreSQL provides several transaction isolation levels. The row visibility rule will vary depending on the transaction isolation level used in the transaction. For example, with the read committed isolation level (which is the default), other users will not see the new data of rows updated by your transaction until it gets committed.

 

 However with repeatable read isolation level, other users will not see the new data of rows updated by your transaction even after the transaction gets committed. They are visible only after the user starts a new transaction.


Query cache could break the visibility rule

Pgpool-II's query cache does not follow the visibility rule. The cache is cre

[...]
Posted by Pavel Stehule on 2021-04-18 at 02:47

I released pspg 4.6.0 There are few interesting features against lastly mentioned here 4.3.0 release: 

  • fix stream mode on apple 
  • new option --menu-always - top bar menu will be visible always
  • fix some visual artefacts for some special cases when user uses style with border = 1
  • new query stream mode.
Query stream mode is designed for using from text editors, that can start pspg in separate terminal in permanent mode. In query stream mode the pspg works like Postgres client, reads queries from pipe, executes it and draws an result. The queries are separated by Group Separated char (GS - 0x1D).
Posted by Avinash Vallarapu in MigOps on 2021-04-17 at 04:40

Recently on twitter, i have seen a poll by Qovery asking how hard is it to create a database like PostgreSQL. With no surprise, people mostly voted it as “Very hard”. However, the word “like” in this question is very ambiguous, as rightly said in one of the comments by Dave Cramer, who is one of the very active contributors and committers of PostgreSQL JDBC driver. Anyways, this ambiguity resulted into an idea of writing an article on this topic by breaking it into 2 questions and discuss them further. PostgreSQL

  1. How difficult is to install and create a PostgreSQL database.
  2. How difficult is it to create an Open Source database software like PostgreSQL ?

Please do not forget to take our Survey at the end of this article. This survey is to help us understand what you think about PostgreSQL. Results will be published through another blog post.

I would like to start with the first question here.

Installing and creating a PostgreSQL database

It might be very easy to get inclined towards a DBaaS platform where a database can be created through a few clicks. DBaaS may also be marketed as a database (available as a service) that requires no maintenance and no administration or tuning as it is automatically managed. Is this really true ? Is Postgres offered on DBaaS really PostgreSQL ? Are users paying more than they estimated and getting into a deep vendor lock-in ? Are users loosing the complete visibility into their databases and lacking a lot of features with DBaaS ? To enable automatic management of backups and high availability similar to what DBaaS promises as a managed service, isn’t it wise to consider projects like pgBackRest for backups and Patroni for high availability ? Anyways, i would love to answer these questions in a separate article. Meanwhile, I would encourage you to watch the presentation on – Why Public Database as a Service is Prime for Open Source Distribution, by Peter Zaitsev, CEO of Percona.

Through this article, i would encourage users to try installing PostgreSQL and

[...]

Connection pooling and management is one of those things most people ignore far too long when it comes to their database. When starting out, you can easily get by without it. With 1 or 2 application servers spawning 5-10 connections, even the tiniest of Postgres servers can handle such. Even with our $35 a month hobby plan on Crunchy Bridge, we can push 5,000 transactions per second through which is quite a bit for < 20 connections.

Posted by Bruce Momjian in EDB on 2021-04-16 at 14:00

Postgres has a where, when, and what set of options to control database server log messages. One frequent request is to allow log message to be split into multiple files or sent to multiple servers. Postgres internally doesn't support this capability, but rsyslog, enabled with log_destination=syslog, allows complex filtering of database server log messages. Specifically, rsyslog's complex filters, combined with log_line_prefix, allows almost unlimited control over how log messages are recorded.

Posted by Vik Fearing in EDB on 2021-04-15 at 15:21
As part of my work on the CTO Team here at EDB, I do daily benchmarking of the PostgreSQL development branch with 500 concurrent users in order to make sure performance does not go down and to identify when in case it does. This also helps us determine which features improve performance for this TPROC-C workload. [Continue reading...]
Posted by Jimmy Angelakos in EDB on 2021-04-15 at 14:34

You have a huge table, and it is necessary to change a column's data type, but your database has to keep running with no downtime. What do you do?

Here's one way to perform this change, in as unobtrusive a manner as possible while your table keeps serving users, by avoiding long DDL table locks and leveraging procedural transaction control.

Video from my short talk at this year's #FOSDEM #PostgreSQL 🐘 devroom 👇

You can find the slides from the talk here.

I don’t usually post about upcoming PostgreSQL features and rather concentrate on tools available / versions released… but this feature got me excited and will certainly be a huge relief for real-life usage, especially for beginners! I had almost lost hope we might see this day – but after many years of wishing for it, one of the features most requested by normal business users/application developers has landed in the repository. Kudos to the author and reviewer!

commit 6c3ffd697e2242f5497ea4b40fffc8f6f922ff60
Author: Stephen Frost 
Date:   Mon Apr 5 13:42:52 2021 -0400

    Add pg_read_all_data and pg_write_all_data roles
    
    ….
    
    Reviewed-by: Georgios Kokolatos
    Discussion: https://postgr.es/m/20200828003023.GU29590@tamriel.snowman.net

Full Git commit entry here.

Why is it a “killer feature”?

In a perfect world, no one would require such convenience functionality – given, of course, that everyone knew more or less exactly how their data model was going to look like, could predict how it was going to be used, and more importantly, who was going to use it, and designed the access/privilege system accordingly. Sadly, we sometimes have to live and deal with the real world where we mostly don’t have the time and resources to work out perfect solutions — so there’s a healthy number of question marks hovering around every decent-sized project, to say the least.

Then again, it’s true that for years, we have managed to get by without this new feature. But often what we see happening is that database operators cannot foresee and nicely accommodate all future data access requirements… and in many cases, they fix the situation with the hefty price tag of handing out potentially dangerous superusers access!

If you’re new(ish) to Postgres you might think here – hmm, how bad can it be? All our data analysts have been superusers for years, and nothing bad happened? Sure, I get it, most of the time nothing bad happens… but nevertheless, you will have a constant “cloud

[...]
Posted by Chengxi Sun in Highgo Software on 2021-04-15 at 04:00

There are many advantages to using bind variables in SQL statements, especially in conditional statements. Using bind variables can save the trouble to parse and optimize the query in every execution. There are also disadvantages such as bind variable will make optimizer lost its prediction accuracy and cause performance decrease. In PostgreSQL are represented by $ sign preceded with a number, just like ‘$n’. Have you ever wondered how to pass and process these parameters in Postgres? In general, a query with bind variables will be transformed to a query tree with Param nodes. Then when this query get executed, it will fetch the input parameters stored in execution state. More details will be introduced below.

how to handle placeholder in query

In parser
When the Postgres parser scans the dollar sign and followed by integers in the input query string, the Parser will treat it as a ‘PARAM’ token and make a ParamRef node based on the information and insert it into the parsetree. If the parameter is a composite type, you can use dot notation (such as $1.name) to access the attribute of the parameter.

| PARAM opt_indirection
    {
        ParamRef *p = makeNode(ParamRef);
        p->number = $1;
        p->location = @1;
        if ($2)
        {
            A_Indirection *n = makeNode(A_Indirection);
            n->arg = (Node *) p;
            n->indirection = check_indirection($2, yyscanner);
            $$ = (Node *) n;
        }
        else
            $$ = (Node *) p;
    }

In analyzer
In Postgres, there is a primitive node type called ‘Param’. It represents various types of parameters in the query/plan tree. There are four kinds of Param nodes:

  • PARAM_EXTERN: The parameter value is supplied from outside the plan.
  • PARAM_EXEC: The parameter is an internal executor parameter, used for passing values into and out of sub-queries or from nestloop joins to their inner scans.
  • PARAM_SUBLINK: The parameter represents an output column of a SubLink node’s sub-select. (This type of Par
[...]
Posted by Egor Rogov in Postgres Professional on 2021-04-15 at 00:00

Last time we got acquainted with the structure of an important component of the shared memory — the buffer cache. A risk of losing information from RAM is the main reason why we need techniques to recover data after failure. Now we will discuss these techniques.

The log

Sadly, there's no such thing as miracles: to survive the loss of information in RAM, everything needed must be duly saved to disk (or other nonvolatile media).

Therefore, the following was done. Along with changing data, the log of these changes is maintained. When we change something on a page in the buffer cache, we create a record of this change in the log. The record contains the minimum information sufficient to redo the change if the need arises.

For this to work, the log record must obligatory get to disk before the changed page gets there. And this explains the name: write-ahead log (WAL).

In case of failure, the data on disk appear to be inconsistent: some pages were written earlier, and others later. But WAL remains, which we can read and redo the operations that were performed before the failure but their result was late to reach the disk.

...

Posted by Bruce Momjian in EDB on 2021-04-14 at 19:30

Postgres has a lot of settings — 329 in Postgres 13. Most of these settings control one specific thing, and the defaults are fine for most use-cases. Synchronous_commit is one of the settings that isn't simple and controls perhaps too many things, though there are operational restrictions which prevent it from being split into separate settings.

In fact, this setting is so complex that when I started to research what it controls, I realized our existing documentation was not completely clear on all of its capabilities. Therefore, I rewrote this setting's documentation in October of 2020, and the updated documentation was released in our November 12th minor releases. As you can see from the table added to the docs, synchronous_commit controls not only the local durability of commits, but also standby behavior.

First, let's focus on its local behavior, that is, without considering standby servers. The value off causes synchronous_commit to not wait for the records associated with the commit to be written to the write-ahead log. It does, however, guarantee data consistency, meaning, that in the event of a crash, the transactions it does recover will be consistent with each other. It just doesn't guarantee that all committed transactions will be restored in the event of a crash. This is in contrast to the setting fsync=off, which does not guarantee data consistency, meaning, after an operating system crash, some transactions might be partial or inconsistent with other records.

Continue Reading »

Some time ago James Courtney reported missing functionality. Specifically, when one uses auto-explain, logged explains contain query text. So, when such explain is then pasted on explain.depesz.com, it stands to reason that it should be able to extract the query on its own, without having to manually extract it and put it in query box. … Continue reading "Changes on explain.depesz.com – extracted query from auto-explain plans"
Posted by Elizabeth Garrett Christensen in Crunchy Data on 2021-04-14 at 15:54

Additional Contributors: David Christensen, Jonathan Katz, and Stephen Frost

In part 1 of this blog mini-series on machine learning, we looked at how to setup PostgreSQL so that we can perform regression analysis on our data using TensorFlow from within the database server using the pl/python3 procedural language. We also looked at some of the basics of pl/python3 as a primer for what is to come in this and part 3 of the series.

In this part, we'll look into data pre-processing. In order to ensure that our regression analysis is as accurate as possible, we first need to analyse and understand the raw data. We can use what we learn to remove outlier values from the training and validation datasets to ensure those values don't skew the results. In addition, we need to look at the columns in the data to understand which are strongly correlated and which are loosely or even uncorrelated with the result. These columns (or features in machine learning parlance) can be removed from the training and validation data to simplify the work of the neural network and increase its accuracy. [Continue reading...]
Query parallelism is supported in PostgreSQL since quite a while now. People in the PG community call it "Parallel query", but by now it is not limited to just SELECT queries. Index build leverages multiple cores; and even utilities like VACUUM now make use of parallelism. Furthermore, community is working on parallelizing COPY and INSERTs.
                                                                                
I was interested to do kind-of "sanity" check of this capability specifically on ARM64 platform. Let's see how it goes. And also at the same time, we will try to understand little bit of how to interpret the parallelism part of the plan output. Subqueries and partitions are not covered in this blog; probably I will add it in another blog.
                                                                                
For running the queries I generated a scale-5 TPC-H benchmark schema with the help of scripts taken from https://github.com/tvondra/pg_tpch.git. My machine is an 8 CPU VM with 15GB memory and Ubuntu 18.04, running on a "Kunpeng 920" 2.6 GHz host. The PostgreSQL build was using git master branch,  so you can treat it somewhere between PostgreSQL 13 and 14. All the tests were run with max_parallel_workers_per_gather = 4. The tables were pre-warmed, so I reduced seq_page_cost and random_page_cost to as low as 0.1.
                                                                                
The JIT-related part of the EXPLAIN output is omitted from the plans to keep the focus on the main query plan. Also, estimated costs are omitted in order to make the plan output compact.
                                                      


Parallel sequential scan
                                                                                
This is the simplest one, and the one with which query parallelism got introduced in PostgreSQL 9.6.
                                                                                
Just a plain
[...]

PostgreSQL allows end users to store BLOBs (binary large objects) in the database. Many people use these functions to handle data directly in SQL. There has long been a discussion about whether this is a good thing or not. We expect that particular discussion to go on indefinitely. However, if you are in the “pro BLOB” camp, we want to share some insights into how binary data can be handled in PostgreSQL with maximum efficiency.

Loading files into the database

BLOBs are important: to show how they work, I have created a simple file which we can import into the database later on:

iMac:~ hs$ echo abcdefgh > /tmp/file.txt

To import this file, we can call the lo_import function and pass the name of the file you want to load (as superuser). Here is how it works:


test=# SELECT lo_import('/tmp/file.txt');
 lo_import
-----------
 98425
(1 row)

What we see here is that PostgreSQL gives us a number (= object ID). Note that the filename has “vanished”, so you have to store it somewhere if you want to remember those values. The way to do this is to create a basic table:


test=# CREATE TABLE t_file (name text, oid_number oid);
CREATE TABLE

test=# INSERT INTO t_file VALUES ('/tmp/file.txt', lo_import('/tmp/file.txt')) RETURNING *;
          name | oid_number
---------------+------------
 /tmp/file.txt | 98432
(1 row)

PostgreSQL gives you the option of using the OID data type to store object IDs. Storing the filename is easy – a text or varchar column will do the job, in case the path is needed later on. In general, the large object is totally independent of the file in the filesystem – it has nothing to do with it. Storing the filename is therefore merely done in order to remember what we imported.

However, don’t let’s worry about a single file. Let’s import it a million times, and see what happens:


test=# INSERT INTO t_file SELECT '/tmp/file.txt', lo_import('/tmp/file.txt')
          FROM generate_series(1, 1000000);
INSERT 0 1000000
test=# TABLE t_file LIMIT 10;
          name | oid_numb
[...]
Posted by Bruce Momjian in EDB on 2021-04-12 at 16:00

Postgres makes extensive use of operating system shared memory, and I have already written a presentation about it. The Postgres documentation gives specific instructions on how to determine the amount of shared memory allocated, specifically for sizing huge pages.

However, exactly what is inside Postgres shared memory was somewhat of a mystery to end users until Postgres 13 added pg_shmem_allocations, which gives a detailed list of the uses and sizes of allocated shared memory:

SELECT *, pg_size_pretty(allocated_size) FROM  pg_shmem_allocations ORDER BY size DESC;
 
                name                 |    off     |    size    | allocated_size | pg_size_pretty 
-------------------------------------+------------+------------+----------------+----------------
 Buffer Blocks                       |   69908224 | 6442450944 |     6442450944 | 6144 MB
 Buffer Descriptors                  |   19576576 |   50331648 |       50331648 | 48 MB
                          |     (null) |   43655168 |       43655168 | 42 MB
 Buffer IO Locks                     | 6512359168 |   25165824 |       25165824 | 24 MB
 Checkpointer Data                   | 6597411712 |   18874432 |       18874496 | 18 MB
 XLOG Ctl                            |      53888 |   16803472 |       16803584 | 16 MB
 Checkpoint BufferIds                | 6537524992 |   15728640 |       15728640 | 15 MB
 Xact                                |   16857856 |    2116320 |        2116352 | 2067 kB
 (null)                              | 6616370560 |    1921664 |        1921664 | 1877 kB
 Subtrans                            |   19107968 |     267008 |         267008 | 261 kB
 CommitTs                            |   18974208 |     133568 |         133632 | 131 kB
 MultiXactMember                     |   19441792 |     133568 |         133632 | 131 kB
 Serial                              | 6596798464 |     133568 |         133632 | 131 kB
 Backend Activity Buffer             | 6597153664 |     132096 |         132096 | 129 kB
 shmInvalBuffer        
[...]
Posted by Andreas 'ads' Scherbaum on 2021-04-12 at 14:00
PostgreSQL Person of the Week Interview with Devrim Gündüz: I’m a community person. Ex-Turkey, now Londoner. Started using Linux in 1996, and PostgreSQL in 1998. Love to travel, listen to music, and enjoy life with friends. Started working around 2000. Joined PostgreSQL community around 1999. Working for EnterpriseDB since 2010, and also running two PostgreSQL companies in Turkey and in the UK.

One of the most critical topics to understand when administering a PostgresSQL database is the concept of transaction IDs (TXID) and that they can be exhausted if not monitored properly. However, this blog post isn’t going to go into the details of what it TXID exhaustion actually is. The Routine Vacuuming section of the documentation is probably one of the most important to read and understand so I will refer you there. What this blog post is going to cover is an easy way to monitor for it and what can be done to prevent it ever being a problem.

Monitoring

Most people initially aware of the issue consider the TXID wraparound itself to be the problem that they’re monitoring for, but it’s technically the exhaustion of the TXIDs that’s the real issue. PostgreSQL is technically capable of handling things just fine if the TXID value wraps around. However, if the wraparound point is reached, and the TXIDs are close to being used up, that is why wraparound itself is such a cause for concern.

The following query is the one that we use at Crunchy Data in our PGmonitor tool to provide very simple data points to trend/alert on.

WITH max_age AS ( 
    SELECT 2000000000 as max_old_xid
        , setting AS autovacuum_freeze_max_age 
        FROM pg_catalog.pg_settings 
        WHERE name = 'autovacuum_freeze_max_age' )
, per_database_stats AS ( 
    SELECT datname
        , m.max_old_xid::int
        , m.autovacuum_freeze_max_age::int
        , age(d.datfrozenxid) AS oldest_current_xid 
    FROM pg_catalog.pg_database d 
    JOIN max_age m ON (true) 
    WHERE d.datallowconn ) 
SELECT max(oldest_current_xid) AS oldest_current_xid
    , max(ROUND(100*(oldest_current_xid/max_old_xid::float))) AS percent_towards_wraparound
    , max(ROUND(100*(oldest_current_xid/autovacuum_freeze_max_age::float))) AS percent_towards_emergency_autovac 
FROM per_database_stats

The percent_towards_wraparound metric is the one that is really critical that an alert be set up for. Since it is using the age() function to determine th

[...]

Last month we released Citus 10 and we’ve received an overwhelming amount of positive feedback on the new columnar compression and single node Citus features, as well as the news that we’ve open sourced the shard rebalancer.

The new and exciting Citus 10 features are bringing in lots of new users of Citus open source and the managed Hyperscale (Citus) option in Azure Database for PostgreSQL. And many of you are asking:

“How exactly does Citus scale out Postgres?”

How does Citus work?

As it happens, I recently gave a talk as part of the Vaccination Database Tech Talks series at Carnegie Mellon University, organized by Andy Pavlo. If you missed my talk—titled “Citus: Distributed PostgreSQL as an Extension”—the good news is that the CMU team recorded it. The vaccination database tech talk on Citus is now available for you to watch online.

My goal in this talk was to explain how Citus turns PostgreSQL into a distributed database—without changing any Postgres code, by leveraging the Postgres extension APIs.

The talk may contain traces of C code because my primary audience was database students, but I used diagrams in my slides to keep the distributed database concepts accessible for everyone. My talk also covers the workloads and the broader context for which the Citus extension to Postgres was developed.

Key moments in my Vaccination Database Tech Talk at CMU

By watching this video about Citus and Postgres, you can learn about:

  • When to use Citus to distribute Postgres: Not every Postgres workload necessarily benefits from the ability to scale. And not every workload that needs to scale out has any business with Postgres. In this bit, I cover 4 of the common workloads where it makes sense to use Citus. [watch at 6:54]
  • Using the PostgreSQL extension APIs: What it means to be a PostgreSQL extension and how you can use the extension APIs to create brand new capabilities without changing any Postgres code? [watch at 12:24]
  • Transparent sharding: How Ci
[...]
Posted by Avinash Vallarapu in MigOps on 2021-04-09 at 23:52

A reliable backup tool with an ability to perform a point in time recovery (PITR) is a usual enterprise need. Postgres includes a backup tool called pg_basebackup for full backups. It is great for a database of a few hundreds of GBs in size but currently lacks the support for parallel, incremental and differential backups. This makes it slightly challenging for bigger postgres databases that are Terrabytes in size. pgBackRest is our savior in such a case. While migrating from Oracle to Postgres, users find pgBackRest as an Oracle RMAN equivalent for PostgreSQL with some more additional features found in pgBackRest only. This is because pgBackRest supports parallel, incremental and differential backups including the support for backup to Cloud. We will discuss many other features of pgBackRest that makes it the best backup tool for PostgreSQL.

Through this Article we would firstly like to thank the very active pgBackRest community who are of course contributors to PostgreSQL too. In one of our recent experiences, we created a bug about pgBackRest backups resulting into Segmentation fault when there are thousands of tablespaces involved (which is highly unusual for Postgres as it serves a different purpose when compared with Oracle).

David Steele from Crunchy Data and the Community maintaining pgBackRest were very prompt and submitted a patch for this unique feature request in no time. Adding such features may be pretty common for the Community. But, our Customers saved a lot of time and some thousands of dollars in building a backup tool that supports their unique requirement. This is an example of how Open Source and the PostgreSQL Community works. A transparent communication and discussion that makes it easier for a user to understand if the feature request sounds reasonable and if it can be targeted against a future release. A user gets the opportunity to discuss a feature, contribute to a feature or test and provide feedback on a feature. This is where Open Source always wins.

By the way, before

[...]

The pgBackRest team is pleased to announce the introduction of multiple repository support in v2.33. Backups already provide redundancy by creating an offline copy of your PostgreSQL cluster that can be used in disaster recovery. Multiple repositories allow you to have copies of your backups and WAL archives in separate locations to increase your redundancy and provide even more protection for your data. This feature is the culmination of many months of hard work, so let's delve into why we think multiple repositories are so important and how they can help preserve your data.

Posted by Bruce Momjian in EDB on 2021-04-09 at 13:30

I have previously blogged about scaling of read-only and read/write workloads. Inspired by this email thread, I want to be more specific about exactly what scaling you can expect by using read-only replicas. First, as the email poster pointed out, read replicas still have to replay write transactions from the primary server. While the replica servers don't need to execute the write queries when using binary replication, they still need to perform the heap/index writes that the primary performs. (Logical replication does add significant CPU overhead.)

Let's look at rough numbers of the load on binary-replication replicas. Suppose the write load on the primary is 30% — that leaves 70% of the I/O for read-only queries, and even more than 70% of the CPU capacity since binary replication replay uses much less CPU than running the actual write queries. If there are three replicas, that means you have 210% of the primary's read-query capacity available on the replicas.

However, if you have a 75% write load on the primary, you only have an extra 25% capacity on the replicas, so your scaling ability is much more limited — it would take four replicas just to double the read capacity of the primary. This explains why high write workloads only effectively scale read-only processing using sharding.

A hierarchical query is built upon parent-child relationship, the relationship exist in the same table or view. The relationship dictates that each child can have one parent while a parent can have many children. Hierarchical query is a SQL query that handles data of hierarchical model i.e. an organisation structure where every employee has one manager and one manager who is also an employee can have many employees in his reporting, another example is a family tree where one person can only have one parent while a person can have many children. There are many examples where the data contains parent-child relationship i.e. hierarchical model and traversing through a self-reference table using the standard SQL constructs can be a challenging and difficult task. The simple queries of hierarchical data in the same table or view is possible using the table join etc but as the relationship becomes deeper, it is difficult to traverse the data to get expected results.

All major database engines including PostgreSQL, Oracle, DB2 and MS SQL provide support for querying such kind of data. The hierarchical query constructs were one of the first features that were added to Oracle more than 20 years ago. PostgreSQL has introduced the WITH Queries clause (Common table expression) to handle such type of data, using WITH Queries along with the optional recursive clause, it is possible to refer to its own output in a single query. The CTE can also be thought of as a defining temporary table that exist just for one query.

In this blog we will be using an organisation structure in order to demonstrate the usage of hierarchical queries. The blog will show the usage of Hierarchical queries feature in Oracle and then demonstrate how we can use the CTE with Recursive keyword to achieve similar results in PostgreSQL. We will also discuss the ongoing work in the PostgreSQL community for improving the recursive self-referencing queries. We will be using the well known (EMP) table (as shown below) that contains a hierarchical typ

[...]

Visualizing spatial data is an important task. In this mini-tutorial, you will learn how to quickly visualize OpenStreetMap (OSM) data with PostGIS and QGIS. The idea is to get you started as fast as possible, and to make the most out of your data.

Installing QGIS

QGIS is freely available as Open Source software and can be used free of charge. It is a powerful tool to edit and visualize spatial data which is ideally stored in a PostGIS-enabled PostgreSQL database.

First, download QGIS from the official website.

getting started with qgis: Download

Once the package has been downloaded, you can simply follow the installation instructions. The following gallery shows how this works. Installing QGIS should be relatively easy.

Deploying PostgreSQL and loading OSM data

The next thing we have to do is to install PostgreSQL and to load OpenStreetMap (OSM) data. There are various ways to load OSM data. Many people use osm2pgsql, which has become pretty popular over the years. If you want to learn more about loading OSM data into a PostgreSQL database, please check out Florian Nadler’s excellent OSM / PostgreSQL tutorial dealing with this issue.

Since we want to primarily focus on QGIS-related questions here, I have used Scalefield to create a small PostgreSQL instance and to load OSM data directly. Scalefield comes with a built-in mechanism to populate data sets directly during the provisioning process of your PostgreSQL service, as shown in the next gallery.

Creating maps with QGIS

Make sure that you have deployed your desired OpenStreetMap data. In my example, I simply loaded OSM data for Andorra because it is the quickest way to get started. However, any other data set will serve the same purpose.

If you want to use QGIS to visualize your spatial data, the first thing is to define the database connection you want to use. Click on “Create a New PostGIS connection” and fill out the connection data in the form which pops up.

connect QGIS and PostGIS

 

The beauty here is that QGIS allows

[...]
Posted by Bruce Momjian in EDB on 2021-04-07 at 14:45

In the 1990's, most server operating systems were closed-source and produced by a few billion-dollar companies. In recent years, Red Hat Enterprise Linux (RHEL) became the default enterprise server operating system, with CENTOS filling the need for installations on smaller platforms and in the cloud. (CENTOS was sometimes the only operating system used.) With the move of CENTOS away from being a stable releases tracking RHEL, there is renewed interest in other operating systems.

There are many viable operating system choices, and with so many choices, selection can be difficult. When choosing an operating system to host a database like Postgres, there are some important criteria:

  • Does the operating system have a server focus, or desktop/laptop/mobile focus?
  • How easy is it to administer, and does your staff have the appropriate skills?
  • What support options are available? What is the operating system's support life span?
  • Are upgrades easily handled? How often do operating system upgrades require rebooting?
  • Is the operating system easily installed on developer workstations, laptops, and smaller devices?
  • Is there sufficient hardware support? Is performance acceptable?
  • Is cloud deployment easy?

Once you deploy a database like Postgres on an operating system, changing operating system can require significant testing and disruption, so it is wise to choose the right operating system at the start. While I didn't mention the operating system's license before, the license can often influence the items listed above, like easy installation on smaller hardware and in the cloud, and multiple support providers. Debian, Ubuntu, FreeBSD, Fedora, SUSE, Rocky Linux, and others are all going to have different strengths. I think the above bulleted list is a good start in evaluating those options. Fortunately, Postgres is supported by all popular operating systems, so Postgres should not be a limiting factor in choosing an operating system.

Join us virtually on April 13, 2021 at 12:00PM,

Muffy Barkocy presents, “Tidying Up Your Data”.

Note: You will need to sign up via SFPUG’s Meetup announcement in order to get the link to join the virtual event.

Tidy up your relational databases by moving old data to alternative storage solutions and keeping live data live. A thought experiment based on Instacart’s process for creation, fulfillment, and maintaining history of orders.

Muffy has worked as a software engineer for more than 30 years. She started with packaged software for database interfaces, but latched onto Web development from the outset, excited by how it spoke to her vision of computers as a way to connect people around the world and change their lives. She has worked at Instacart since 2015, where she enjoys building tools for everyone, mentoring other engineers, and helping non-engineers learn more about technical topics.

Schedule:
12:00 PM Announcements, speaker introduction, presentation, Q&A.
1:15 PM Closing announcements, wrap up.
1:30 PM Event ends.

Link and password for the GoToMeeting room will be sent the evening before the event.