PostgreSQL
The world's most advanced open source database
Top posters
Number of posts in the past month
Top teams
Number of posts in the past month
Feeds
Planet
  • Policy for being listed on Planet PostgreSQL.
  • Add your blog to Planet PostgreSQL.
  • List of all subscribed blogs.
  • Manage your registration.
Contact
  • Get in touch with the Planet PostgreSQL administrators at planet at postgresql.org.
Before PostgreSQL 10, Postgres users partitioned their data using inheritance based partitioning. The method used constraints to define the partitions and rules or triggers to route the data to appropriate partition. A user had to write and maintain code for all that. PostgreSQL 10 introduced declarative partitioning, which is much easier to setup and requires almost no maintenance. PostgreSQL 11
is adding a number of partitioning related enhancements that work with declarative partitioning. Users who have implemented inheritance based partitioning would want to move to declarative partitioning (after upgrading to v11, of course) to benefit from those features. Here's how they can do so.

Example setup

You may have created a parent table and several child tables, one per partition, and triggers, rules and constraints as required. Here's an example setup similar to the one described in PostgreSQL documentation.

\d+ measurement
                                Table "inh_part.measurement"
  Column   |  Type   | Collation | Nullable | Default | Storage | Stats target | Description 
-----------+---------+-----------+----------+---------+---------+--------------+-------------
 city_id   | integer |           | not null |         | plain   |              | 
 logdate   | date    |           | not null |         | plain   |              | 
 peaktemp  | integer |           |          |         | plain   |              | 
 unitsales | integer |           |          |         | plain   |              | 
Child tables: measurement_y2006m02,
              measurement_y2006m03,
              measurement_y2006m04,
              measurement_y2006m05,
              measurement_y2006m06

-- here's how a child looks like
\d+ measurement_y2006m03
                            Table "inh_part.measurement_y2006m03"
  Column   |  Type   | Collation | Nullable | Default | Storage | Stats target | Description 
-----------+---------+-----------+----------+---------+---------+--------------+-------------
 city_id   | integer |           | n[...]

Cost based optimization

A query may be executed in many different ways, modelled as plans in query optimizer, differing in resources required and/or execution time. A typical DBMS's query optimizer tries to find all the possible plans for executing a given query and chooses the fastest plan amongst those. But it's not possible to calculate the time required by a plan unless the query is executed. Thus an optimizer tries to associate an estimate of execution time with each possible plan and choose the one with the least estimated value. PostgreSQL is no different. It associates a cost with each possible plan. The cost is a rough estimation of the time required to execute the query. The plan with the lowest cost is chosen for execution. The time required to execute a query is sum of time required to perform various operations involved in the plan being executed e.g. time required to scan the tables in the query, time required to compute joins, etc. Thus a plan's cost is sum of the costs of operations involved in the plan. In order to efficiently and correctly estimate the cost of a plan, PostgreSQL maintains the statistics about sizes of tables, indexes, the values stores in various columns of tables and so on. In a DBMS, where data keeps changing, the statistics often gets stale and needs to be updated. PostgreSQL keeps the statistics up-to-date by frequently sampling the tables. This works reasonably well as long as the tables involved in the query are part of the DBMS.

But now a days, often, applications run queries which require data external to the DBMS. PostgreSQL supports querying external data through a Foreign Data Wrapper (FDW in short), a method based on SQL/MED standard. We will discuss the methods employed by the query optimizer to plan such queries and methods to maintain the statistics about the external data, esp. the data residing in other PostgreSQL server/s, in this post.

Foreign tables and statistics

PostgreSQL allows external data to be represented as "foreign tables". While PostgreSQL scans t[...]

PostgreSQL is one of the most popular open-source databases in the world and has successful implementations across several mission-critical environments across various domains, using real-time high-end OLTP applications performing millions and billions of transactions per day. PostgreSQL I/O is quite reliable, stable and performant on pretty much any hardware, including even cloud.

To ensure that databases perform at the expected scale with expected response times, there is a need for some performance engineering. Well, the accomplishment of good database performance depends on various factors. Database performance can go bad for various reasons such as infrastructure dimensioning, inefficient database maintenance strategy, poor SQL code or badly configured database processes that fail to utilize all the available resources - CPU, memory, network bandwidth and disk I/O.

What can cause database performance to degrade?

  • Badly written queries with bad joins, logic etc. that take a lot of CPU and memory
  • Queries performing full-table-scans on big tables due to improper Indexing
  • Bad database maintenance with no proper statistics in place
  • Inefficient capacity planning resulting in inadequately dimensioned infrastructure
  • Improper logical and physical design
  • No connection pooling in place, which cause applications to make huge number of connections in an uncontrollable manner

So that’s a lot of potential areas which can cause performance problems. One of the significant areas I would like to focus on in this blog is how to tune PostgreSQL I/O (Input / Output) performance. Tuning the Input / Output operations of PostgreSQL is essential, especially in a high-transactional environment like OLTP or in a Data warehousing environment with complex data analysis on huge size data sets.

Most of the times, database performance problems are caused mainly due to high I/O. This means, database processes are spending more time either writing to or reading from the disk. Any real-time data operation is I/O bound, it is imperative to

[...]
Posted by Paul Ramsey in PostGIS on 2018-06-21 at 20:00

One of the joys of geospatial processing is the variety of tools in the tool box, and the ways that putting them together can yield surprising results. I have been in the guts of PostGIS for so long that I tend to thing in terms of primitives: either there’s a function that does what you want or there isn’t. I’m too quick to ignore the power of combining the parts that we already have.

A community member on the users list asked (paraphrased): “is there a way to split a polygon into sub-polygons of more-or-less equal areas?”

I didn’t see the question and answer, which is lucky, because I would have said: “No, you’re SOL, we don’t have a good way to solve that problem.” (An exact algorithm showed up in the Twitter thread about this solution, and maybe I should implement that.)

PostGIS developer Darafei Praliaskouski did answer, and provided a working solution that is absolutely brilliant in combining the parts of the PostGIS toolkit to solve a pretty tricky problem. He said:

The way I see it, for any kind of polygon:

  • Convert a polygon to a set of points proportional to the area by ST_GeneratePoints (the more points, the more beautiful it will be, guess 1000 is ok);
  • Decide how many parts you’d like to split into, (ST_Area(geom)/max_area), let it be K;
  • Take KMeans of the point cloud with K clusters;
  • For each cluster, take a ST_Centroid(ST_Collect(point));
  • Feed these centroids into ST_VoronoiPolygons, that will get you a mask for each part of polygon;
  • ST_Intersection of original polygon and each cell of Voronoi polygons will get you a good split of your polygon into K parts.

Let’s take it one step at a time to see how it works.

We’ll use Peru as the example polygon, it’s got a nice concavity to it which makes it a little trickier than an average shape.

CREATE TABLE peru AS 
  SELECT *
  FROM countries
  WHERE name = 'Peru'

Original Polygon (Petu)

Now create a point field that fills the polygon. On average, each randomly placed point ends up “occupying” an equal area within the polygon.

CREATE TABLE peru_pts AS
  SELECT (ST_Dump(ST_Ge
[...]
Posted by Craig Kerstiens in CitusData on 2018-06-21 at 17:15

In our previous Fun with SQL post on the Citus Data blog, we covered window functions. Window functions are a special class of function that allow you to grab values across rows and then perform some logic. By jumping ahead to window functions, we missed so many of the other handy functions that exist within Postgres natively. There are in fact several hundred built-in functions. And when needed, you can also create your own user defined functions (UDFs), if you need something custom. Today we’re going to walk through just a small sampling of SQL functions that can be extremely handy in PostgreSQL.

Arrays

First, arrays are a first class datatype within Postgres. You can have an array of text or an array of numbers, personally I love using arrays when dealing with category tags. You can also index arrays which can make querying extremely fast. But even if you’re not putting arrays directly into your database, you may want to build up arrays within your query.

A good example might be when you have a team feature within your application, and you want to group all members that belong to that team into a single row. To do this, we can use the array_agg function which will aggregate some column together into a single row:

SELECT teams.id, 
       teams.name, 
       array_agg(users.email)
FROM teams,
     users
WHERE users.team_id = teams.id
GROUP BY 1, 2;
 id | name  |                          array_agg
----+-------+--------------------------------------------------------------
  2 | ACME  | {jennifer@acmecorp.com,tom@acmecorp.com,peyton@acmecorp.com}
  1 | Citus | {craig@citusdata.com,farina@citusdata.com}
(2 rows)

The above by itself is pretty handy, but we can go even further. If we want to send this directly to some email client, we can go ahead and parse out the array to just be a comma separated list by wrapping our array_agg function with a function to convert it to a string array_to_string:

SELECT teams.id, 
       teams.name, 
       array_to_string(array_agg(users.email), ', ')
FROM teams,
     users
WHE
[...]
Posted by Andrew Dunstan in 2ndQuadrant on 2018-06-21 at 15:04

Recently I have been refining and adding utilities to look after our Perl code.  You might be surprised to learn that as well as 1.3 million or so lines of C code, there are about 30,000 lines of Perl code in our sources. This a sizeable body of code, even if it’s dwarfed by our C code. What does it do? Well, lots of things. It runs some very critical code in building from source, so the code to set up our catalogs is created by some Perl code.  All the new data setup for catalogs is in fact Perl code. That’s another 20,000 lines or so of code on top of the 30,000 mentioned above. We also use Perl to run TAP tests, such as testing initdb and pg_dump. And it runs building and testing when we’re building with the Microsoft tool-sets on Windows.

So, what changes have been made? First, we’ve refined slightly the setup for pgperltidy, our utility for formatting perl code. This utility, based on a well known perl utility, does for perl code what pgindent does for C code.

Second, we’ve added a script and a profile to run a utility called perlcritic. This utility checks to see if perl code complies with a set of “best practises”.  Currently we’re only testing for the “worst” practices, but I hope in future to be able to check in a much stricter way. The infrastructure is now there to support it.

Finally, there have been some code adjustments to allow us to check all the perl files for compile time errors and warnings, and a utility to run those checks.

These changes mirror some changes I have made in the buildfarm client and server code.

It’s easy to forget about these things, so I’ve also added a Buildfarm module to run the check on perl. If it finds any policy violations or compiler time errors or warnings we’ll soon know about it.

Posted by Robert Haas in EnterpriseDB on 2018-06-21 at 14:47
I admit it: I invented force_parallel_mode.  I believed then, and still believe now, that it is valuable for testing purposes.  Certainly, testing using force_parallel_mode=on or force_parallel_mode=regress has uncovered many bugs in PostgreSQL's parallel query support that would otherwise have been very difficult to find.  At the same time, it's pretty clear that this setting has caused enormous confusion, even among PostgreSQL experts.  In fact, in my experience, almost everyone who sets force_parallel_mode is doing so for the wrong reasons.
Read more »
Posted by Federico Campoli on 2018-06-21 at 00:00

The ansible is a fictional device made popular by the book Ender’s game written by Orson Scott Gards’s. The ansible allows instant communication between two points across the space, regardless of the distance.

It doesn’t surprise that the Red Hat’s ansible shares the same name as it gives a simple and efficient way to manage servers and automate tasks, yet remaining very lightweight with minimal requirements on the target machines.

Businesses and services deliver value based on data. Availability, consistent state, and durability are top priorities for keeping customers and end-users satisfied. Lost or inaccessible data could possibly equate to lost customers.

Database backups should be at the forefront of daily operations and tasks.

We should be prepared for the event that our data becomes corrupted or lost.

I'm a firm believer in an old saying I’ve heard: "It's better to have it and not need it than to need it and not have it."

That applies to database backups as well. Let's face it, without them, you basically have nothing. Operating on the notion that nothing can happen to your data is a fallacy.

Most DBMS's provide some means of built-in backup utilities. PostgreSQL has pg_dump and pg_dumpall out of the box.

Both present numerous customization and structuring options. Covering them all individually in one blog post would be next to impossible. Instead, I'll look at those examples I can apply best, to my personal development/learning environment.

That being said, this blog post is not targeted at a production environment. More likely, a single workstation/development environment should benefit the most.

What are pg_dump and pg_dumpall?

The documentation describes pg_dump as: “pg_dump is a utility for backing up a PostgreSQL database”

And the pg_dumpall documentation: “pg_dumpall is a utility for writing out (“dumping”) all PostgreSQL databases of a cluster into one script file.”

Backing up a Database and/or Table(s)

To start, I'll create a practice database and some tables to work with using the below SQL:

postgres=# CREATE DATABASE example_backups;
CREATE DATABASE
example_backups=# CREATE TABLE students(id INTEGER,
example_backups(# f_name VARCHAR(20),
example_backups(# l_name VARCHAR(20));
CREATE TABLE
example_backups=# CREATE TABLE classes(id INTEGER,
example_backups(# subject VARCHAR(20));
CREATE TABLE
example_backups=# INSERT INTO students(id, f_name, l_name)
example_backups-# VALUES (1, 'John', 'Thorn'), (2, 'Phil', 'Hampt'),
ex
[...]
Posted by REGINA OBE in PostGIS on 2018-06-18 at 21:42

PostGIS 2.5 is just around the corner. One of the new functions coming is the ST_OrientedEnvelop. This is something I've been waiting for for years. It is the true minimum bounding rectangle, as opposed to ST_Envelope which is an axis aligned bounding rectable.

Below is a pictorial view showing the difference between the two.

Continue reading "Coming PostGIS 2.5 ST_OrientedEnvelope"

Performance is one of the most important and most complex tasks when managing a database. It can be affected by the configuration, the hardware or even the design of the system. By default, PostgreSQL is configured with compatibility and stability in mind, since the performance depends a lot on the hardware and on our system itself. We can have a system with a lot of data being read but the information does not change frequently. Or we can have a system that writes continuously. For this reason, it is impossible to define a default configuration that works for all types of workloads.

In this blog, we will see how one goes about analyzing the workload, or queries, that are running. We shall then review some basic configuration parameters to improve the performance of our PostgreSQL database. As we mentioned, we will see only some of the parameters. The list of PostgreSQL parameters is extensive, we would only touch on some of the key ones. However, one can always consult the official documentation to delve into the parameters and configurations that seem most important or useful in our environment.

EXPLAIN

One of the first steps we can take to understand how to improve the performance of our database is to analyze the queries that are made.

PostgreSQL devises a query plan for each query it receives. To see this plan, we will use EXPLAIN.

The structure of a query plan is a tree of plan nodes. The nodes in the lower level of the tree are scan nodes. They return raw rows from a table. There are different types of scan nodes for different methods of accessing the table. The EXPLAIN output has a line for each node in the plan tree.

world=# EXPLAIN SELECT * FROM city t1,country t2 WHERE id>100 AND t1.population>700000 AND t2.population<7000000;
                               QUERY PLAN                                
--------------------------------------------------------------------------
Nested Loop  (cost=0.00..734.81 rows=50662 width=144)
  ->  Seq Scan on city t1  (cost=0.00..93.19 rows=347 width=31)
        Fil
[...]

When people are talking about database performance monitoring they usually think of inspecting one PostgreSQL database server at a time. While this is certainly useful it can also be quite beneficial to inspect the status of an entire database cluster or to inspect a set of servers working together at once. Fortunately there are easy means to achieve that with PostgreSQL. How this works can be outlined in this post.

pg_stat_statements: The best tool to monitor PostgreSQL performance

If you want to take a deep loop at PostgreSQL performance there is really no way around pg_stat_statements. It offers a lot of information and is really easy to use.

To install pg_stat_statements, the following steps are necessary:

  • run “CREATE EXTENSION pg_stat_statements” in your desired database
  • add the following line to postgresql.conf:
    • shared_preload_libraries = ‘pg_stat_statements’
  • restart PostgreSQL

Once this is done, PostgreSQL will already be busy collecting data on your database hosts. However, how can we create a “clusterwide pg_stat_statements” view so that we can inspect an entire set of servers at once?

Using pg_stat_statements to check an entire database cluster

Our goal is to show data from a list of servers in a single view. One way to do that is to make use of PostgreSQL’s foreign data wrapper infrastructure. We can simply connect to all servers in the cluster and unify the data in a single view.

Let us assume we have 3 servers, a local machine, “a_server”, and “b_server”. Let us get started by connecting to the local server to run the following commands:

CREATE USER dbmonitoring LOGIN PASSWORD 'abcd' SUPERUSER;
GRANT USAGE ON SCHEMA pg_catalog TO dbmonitoring;
GRANT ALL ON pg_stat_statements TO dbmonitoring;

In the first step I created a simple user to do the database monitoring. Of course you can handle users and so on differently but it seems like an attractive idea to use a special user for that purpose.

The next command enables the postgres_fdw extension, which is necessary to connect to those remote serve

[...]
While back go, I posted an article titled "More load balancing fine control" to explain a new feature of upcoming Pgpool-II 4.0. Today I would like talk about yet another new feature of load balancing in Pgpool-II.

Pgpool-II is already able to control read query load balancing in several granularity:
(See the document for more details).
  •  Whether enable load balancing or not (load_balance_mode)
  • By database (database_redirect_preference_list)
  • By application name (app_name_redirect_preference_list)
  • By functions used in the query (white_function_list, black_function_list)
  • Statement level (/*NO LOAD BALANCE*/ comment)
 The last method allows the statement level control on load balancing but it needs to rewrite a query which is often impossible if you are using commercial software.

A new configuration parameter called "black_query_pattern_list" allows you to disable load balancing for queries specified by the parameter. The parameter takes a regular expression string. If a query matched with the expression, load balancing is disabled for the query: the query is sent to the primary (master) PostgreSQL server.

Here are some examples.

We have:
black_query_pattern_list = 'SELECT \* FROM t1\;;'
in pgpool.conf. Note that some special characters must be qualified by using '\' (back slash) characters.

We have no SELECTs issued to both primary and standby.

test=# show pool_nodes;
 node_id | hostname | port  | status | lb_weight |  role   | select_cnt | load_balance_node | replication_delay | last_status_change 
---------+----------+-------+--------+-----------+---------+------------+-------------------+-------------------+---------------------
 0       | /tmp     | 11002 | up     | 0.500000  | primary | 0          | false             | 0                 | 2018-06-15 11:57:25
 1       | /tmp     | 11003 | up     | 0.500000  | standby | 0          | true              | 0                 | 2018-06-15 11:57:25
(2 rows)


If following query is issued, then the "select_cnt" column of standby should be incremented since standb[...]

Many companies generate large volumes of time series data from events happening in their application. It’s often useful to have a real-time analytics dashboard to spot trends and changes as they happen. You can build a real-time analytics dashboard on Postgres by constructing a simple pipeline:

  1. Load events into a raw data table in batches
  2. Periodically aggregate new events into a rollup table
  3. Select from the rollup table in the dashboard

For large data streams, Citus (an open source extension to Postgres that scales out Postgres horizontally) can scale out each of these steps across all the cores in a cluster of Postgres nodes.

One of the challenges of maintaining a rollup table is tracking which events have already been aggregated—so you can make sure that each event is aggregated exactly once. A common technique to ensure exactly-once aggregation is to run the aggregation for a particular time period after that time period is over. We often recommend aggregating at the end of the time period for its simplicity, but you cannot provide any results before the time period is over and backfilling is complicated.

Building rollup tables in a new and different way

In this blog post, we’ll introduce a new approach to building rollup tables which addresses the limitations of using time windows. When you load older data into the events table, the rollup tables will automatically be updated, which enables backfilling and late arrivals. You can also start aggregating events from the current time period before the current time period is over, giving a more real time view of the data.

We assume all events have an identifier which is drawn from a sequence and provide a simple SQL function that enables you to incrementally aggregate ranges of sequence values in a safe, transactional manner.

We tested this approach for a CDN use case and found that a 4-node Citus database cluster can simultaneously:

  • Ingest and aggregate over a million rows per second
  • Keep the rollup table up-to-date within ~10s
  • Answer analytical queries in u
[...]
Posted by gabrielle roth on 2018-06-14 at 13:35

When: 6-8pm Thursday June 21, 2018
Where: iovation
Who: Mark Wong
What: Intro to OmniDB with PostgreSQL

OmniDB is an open source browser-based app designed to access and manage many different Database Management systems, e.g. PostgreSQL, Oracle and MySQL. OmniDB can run either as an App or via Browser, combining the flexibility needed for various access paths with a design that puts security first.

OmniDB’s main objective is to offer an unified workspace with all functionalities needed to manipulate different DBMS. It is built with simplicity in mind, designed to be a fast and lightweight browser-based application.

Get a tour of OmniDB with PostgreSQL!

 

Mark leads the 2ndQuadrant performance practice as a Performance Consultant for English Speaking Territories, based out of Oregon in the USA. He is a long time Contributor to PostgreSQL, co-organizer of the Portland PostgreSQL User Group, and serves as a Director and Treasurer
for the United States PostgreSQL Association.


If you have a job posting or event you would like me to announce at the meeting, please send it along. The deadline for inclusion is 5pm the day before the meeting.

Our meeting will be held at iovation, on the 3rd floor of the US Bancorp Tower at 111 SW 5th (5th & Oak). It’s right on the Green & Yellow Max lines. Underground bike parking is available in the parking garage; outdoors all around the block in the usual spots. No bikes in the office, sorry! For access to the 3rd floor of the plaza, please either take the lobby stairs to the third floor or take the plaza elevator (near Subway and Rabbit’s Cafe) to the third floor. There will be signs directing you to the meeting room. All attendess must check in at the iovation front desk.

See you there!

Posted by Andrew Dunstan in 2ndQuadrant on 2018-06-14 at 12:39

If you have Docker installed on your development machine, there is a simple way to road test your code using the buildfarm client, in a nicely contained environment.

These preparatory steps only need to be done once. First clone the repository that has the required container definitions:

git clone https://github.com/PGBuildFarm/Dockerfiles.git bf-docker

Then build a container image to run the command (in this example we use the file based on Fedora 28):

cd bf-docker
docker build --rm=true -t bf-f28 -f Dockerfile.fedora-28 .

Make a directory to contain all the build artefacts:

mkdir buildroot-f28

That’s all the preparation required. Now you can road test your code with this command:

docker run -v buildroot-f28:/app/buildroot \
  -v /path/to/postgres/source:/app/pgsrc bf-f28 \ 
  run_build.pl --config=build-fromsource.conf

The config file can be customized if required, but this is a pretty simple way to get started.

Posted by Regina Obe in PostGIS on 2018-06-14 at 07:12

Most of our use-cases for the built-in json support in PostgreSQL is not to implement schemaless design storage, but instead to remold data. Remolding can take the form of restructuring data into json documents suitable for web maps, javascript charting web apps, or datagrids. It also has uses beyond just outputting data in json form. In addition the functions are useful for unraveling json data into a more meaningful relational form.

One of the common cases we use json support is what we call UNPIVOTING data. We demonstrated this in Postgres Vision 2018 presentation in slide 23. This trick won't work in other relational databases that support JSON because it also uses a long existing feature of PostgreSQL to be able to treat a row as a data field.

Continue reading "Unpivoting data using JSON functions"

Internet of Things tends to generate large volumes of data at a great velocity. Often times this data is collected from geographically distributed sources and aggregated at a central location for data scientists to perform their magic i.e. find patterns, trends and make predictions.

Let’s explore what the IoT Solution using Postgres-BDR has to offer for Data Analytics. Postgres-BDR is offered as an extension on top of PostgreSQL 10 and above. It is not a fork. Therefore, we get the power of complete set of analytic functions that PostgreSQL has to offer. In particular, I am going to play around with PostgreSQL’s Window Functions here to analyze a sample of time series data from temperature sensors.

Let’s take an example of IoT temperature sensor time series data spread over a period of 7 days. In a typical scenario, temperature sensors are sending readings every minute. Some cases could be even more frequent. For the sake of simplicity, however, I am going to use one reading per day. The objective is to use PostgreSQL’s Window Functions for running analytics and that would not change with increasing the frequency of data points.

Here’s our sample data. Again, the number of table fields in a real world IoT temperature sensor would be higher, but our fields of interest in this case are restricted to timestamp of the temperature recording, device that reported it and the actual reading.

CREATE TABLE iot_temperature_sensor_data (
    ts timestamp without time zone,
    device_id text,
    reading float
);

and we add some random timeseries temperature sensor reading data for seven consecutive days

         ts          |            device_id             | reading 

---------------------+----------------------------------+---------

 2017-06-01 00:00:00 | ff0d1c4fd33f8429b7e3f163753a9cb0 |      10

 2017-06-02 00:00:00 | d125efa43a62af9f50c1a1edb733424d |       9

 2017-06-03 00:00:00 | 0b4ee949cc1ae588dd092a23f794177c |       1

 2017-06-04 00:00:00 | 8b9cef086e02930a808ee97b87b07b03 |       3

 2017-06-05 00:00:00 
[...]
Posted by Pavel Stehule on 2018-06-13 at 08:52
My new article is in Czech language, but Google translator can help.
Posted by Craig Kerstiens in CitusData on 2018-06-12 at 16:44

work_mem is perhaps the most confusing setting within Postgres. work_mem is a configuration within Postgres that determines how much memory can be used during certain operations. At its surface, the work_mem setting seems simple: after all, work_mem just specifies the amount of memory available to be used by internal sort operations and hash tables before writing data to disk. And yet, leaving work_mem unconfigured can bring on a host of issues. What perhaps is more troubling, though, is when you receive an out of memory error on your database and you jump in to tune work_mem, only for it to behave in an un-intuitive manner.

Setting your default memory

The work_mem value defaults to 4MB in Postgres, and that’s likely a bit low. This means that per Postgres activity (each join, some sorts, etc.) can consume 4MB before it starts spilling to disk. When Postgres starts writing temp files to disk, obviously things will be much slower than in memory. You can find out if you’re spilling to disk by searching for temporary file within your PostgreSQL logs when you have log_temp_files enabled. If you see temporary file, it can be worth increasing your work_mem.

On Citus Cloud (our fully-managed database as a service that scales out Postgres horizontally), we automatically tune work_mem based on the overall memory available to the box. Our tuning is based on the years of experience of what we’ve seen work for a variety of production Postgres workloads, coupled with statistics to compute variations based on cluster sizing.

It’s tough to get the right value for work_mem perfect, but often a sane default can be something like 64 MB, if you’re looking for a one size fits all answer.

It’s not just about the memory for queries

Let’s use an example to explore how to think about optimizing your work_mem setting.

Say you have a certain amount of memory, say 10 GB. If you have 100 running Postgres queries, and each of those queries has a 10 MB connection overhead, then 100*10 MB (1 GB) of memory is taken up by the 100 connections—whi

[...]

The first time I got paid for doing PostgreSQL work on the side, I spent most of the proceeds on the mortgage (boring, I know), but I did get myself one little treat: a boxed set of DVDs from a favorite old television show. They became part of my evening ritual, watching an episode while cleaning the kitchen before bed. The show features three military draftees, one of whom, Frank, is universally disliked. In one episode, we learn that Frank has been unexpectedly transferred away, leaving his two roommates the unenviable responsibility of collecting Frank’s belongings and sending them to his new assignment. After some grumbling, they settle into the job, and one of them picks a pair of shorts off the clothesline, saying, “One pair of shorts, perfect condition: mine,” and he throws the shorts onto his own bed. Picking up another pair, he says, “One pair of shorts. Holes, buttons missing: Frank’s.”

The other starts on the socks: “One pair of socks, perfect condition: mine. One pair socks, holes: Frank’s. You know, this is going to be a lot easier than I thought.”

“A matter of having a system,” responds the first.

I find most things go better when I have a system, as a recent query writing task made clear. It involved data from the Instituto Nacional de Estadística y Geografía, or INEGI, an organization of the Mexican government tasked with collecting and managing country-wide statistics and geographical information. The data set contained the geographic outline of each city block in Mexico City, along with demographic and statistical data for each block: total population, a numeric score representing average educational level, how much of the block had sidewalks and landscaping, whether the homes had access to the municipal sewer and water systems, etc. We wanted to display the data on a Liquid Galaxy in some meaningful way, so I loaded it all in a PostGIS database and built a simple visualization showing each city block as a polygon extruded from the earth, with the height and color of the polygon proportional to

[...]

In a previous article of mine I’ve been bitten by the “temporary function issue” (which isn’t an issue at all, of course).

I needed a way to use the now() function in different ways to do a sort of “time travel” over a history table.

I’ve found a few easy™ ways to accomplish the same task and have now distilled one that seems to me to be the “best one™. I will make use of a temporary table.

The trick is that a new function, called mynow() will access a table without an explicit schema qualifier but relying on a controlled search_path. This approach opens the door to a controlled table masking thanks to the temporary schema each session has. Let’s see this function first.

create or replace function mynow( out ts timestamp )
language plpgsql
as $l0$
begin
  select * into ts from mytime;
  if not found then
    ts := now();
  end if;
end;
$l0$;

If you put a timestamp in the table mytime, then that timestamp will be used a the current time. If there’s nothing in there, then the normal function output will be used.

First of all, you need to create an always-empty non-temporary table like this in the public schema:

create table mytime ( ts timestamp );

So the function will start working soon with the default time line. If I inserted a time stamp straight in there, I’d set the function behavior for all sessions at once. But this isn’t normally the objective.

As soon as a user needs a different reference time for its time travelling, she needs to do the following:

set search_path = pg_temp,"$user", public; -- 1st
create table mytime ( like public.mytime including all ); -- 2nd
insert into mytime values ( '2000-01-01' ); -- 3rd

That’s it. More or less.

The first statement alters the search_path setting so the pg_temp schema becomes the first one to be searched in and, thus, the “default” schema.

The second one creates a temporary table (re-read the previous sentence, please!) “just like the public one“. Please note the schema qualifying used with the like predicate.

The third one will insert into that temporary table (you

[...]
A bad query plan ...
© Laurenz Albe 2018

 

We all know that you have to pay a price for a new index you create — data modifying operations will become slower, and indexes use disk space. That’s why you try to have no more indexes than you actually need.

But most people think that SELECT performance will never suffer from a new index. The worst that can happen is that the new index is not used.

However, this is not always true, as I have seen more than once in the field. I’ll show you such a case and tell you what you can do about it.

An example

We will experiment with this table:

CREATE TABLE skewed (
   sort        integer NOT NULL,
   category    integer NOT NULL,
   interesting boolean NOT NULL
);

INSERT INTO skewed
   SELECT i, i%1000, i>50000
   FROM generate_series(1, 1000000) i;

CREATE INDEX skewed_category_idx ON skewed (category);

VACUUM (ANALYZE) skewed;

We want to find the first twenty interesting rows in category 42:

EXPLAIN (ANALYZE, BUFFERS)
SELECT * FROM skewed
WHERE interesting AND category = 42
ORDER BY sort
LIMIT 20;

This performs fine:

                             QUERY PLAN
--------------------------------------------------------------------
 Limit  (cost=2528.75..2528.80 rows=20 width=9)
        (actual time=4.548..4.558 rows=20 loops=1)
   Buffers: shared hit=1000 read=6
   ->  Sort  (cost=2528.75..2531.05 rows=919 width=9)
             (actual time=4.545..4.549 rows=20 loops=1)
         Sort Key: sort
         Sort Method: top-N heapsort  Memory: 25kB
         Buffers: shared hit=1000 read=6
         ->  Bitmap Heap Scan on skewed
                        (cost=19.91..2504.30 rows=919 width=9)
                        (actual time=0.685..4.108 rows=950 loops=1)
               Recheck Cond: (category = 42)
               Filter: interesting
               Rows Removed by Filter: 50
               Heap Blocks: exact=1000
               Buffers: shared hit=1000 read=6
               ->  Bitmap Index Scan on skewed_category_idx
                        (cost=0.00..19.68 rows=967 width=0)
                   
[...]
Posted by Regina Obe in PostGIS on 2018-06-09 at 08:56

Leo and I attended PostgresVision 2018 which ended a couple of days ago.

We gave a talk on spatial extensions with main focus being PostGIS. Here are links to our slides PostgresVision2018_SpatialExtensions HTML version PDF.

Unfortunately there are no slides of the pgRouting part, except the one that says PGRouting Live Demos because Leo will only do live demos. He has no fear of his demos not working.

Side note, if you are on windows and use the PostGIS bundle, all the extensions listed in the PostGIS box of the spatial extensions diagram, as well as the pointcloud, pgRouting, and ogr_fdw are included in the bundle.

Continue reading "PostgresVision 2018 Slides and Impressions"

PostgreSQL 11 Beta 1 has been out for more than couple of weeks. The best way to experience it is  to try out the new version and test drive it yourself.

Rather than building it directly from source, I nowadays take the easy way out and deploy it in the cloud. Fortunately, it is already available in Amazon RDS Database Preview Environment.



For this post I am going to use the AWS CLI since it is easy to understand the command line and copy/paste it and also easier to script it for repetitive testing. To use the Database Preview environment, the endpoint has to be modified to use https://rds-preview.us-east-2.amazonaws.com/ instead of the default for the region.

Because there might be multiple PostgreSQL 11 beta releases possible, it is important to understand which build version is being deployed.  I can always leave it to the default which typically would be the latest preferred version but lot of times I want to make sure on the version I am deploying. The command to get all the versions of PostgreSQL 11 is describe-db-engine-versions.

$ aws rds describe-db-engine-versions --engine postgres --db-parameter-group-family postgres11 --endpoint-url  https://rds-preview.us-east-2.amazonaws.com/ 
{
    "DBEngineVersions": [
        {
            "Engine": "postgres", 
            "DBParameterGroupFamily": "postgres11", 
            "SupportsLogExportsToCloudwatchLogs": false, 
            "SupportsReadReplica": true, 
            "DBEngineDescription": "PostgreSQL", 
            "EngineVersion": "11.20180419", 
            "DBEngineVersionDescription": "PostgreSQL 11.20180419 (68c23cba)", 
            "ValidUpgradeTarget": [
                {
                    "Engine": "postgres", 
                    "IsMajorVersionUpgrade": false, 
                    "AutoUpgrade": false, 
                    "EngineVersion": "11.20180524"
                }
            ]
        }, 
        {
            "Engine": "postgres", 
            "DBParameterGroupFamily": "postgres11", 
            "SupportsLogExportsToCloudwatch
[...]

Earlier this week as I was waiting to begin a talk at a conference, I chatted with someone in the audience that had a few questions. They led off with this question: is Citus a good fit for X? The heart of what they were looking to figure out: is the Citus distributed database a better fit for analytical (data warehousing) workloads, or for more transactional workloads, to power applications? We hear this question quite a lot, so I thought I’d elaborate more on the use cases that make sense for Citus from a technical perspective.

Before I dig in, if you’re not familiar with Citus; we transform Postgres into a distributed database that allows you to scale your Postgres database horizontally. Under the covers, your data is sharded across multiple nodes, meanwhile things still appear as a single node to your application. By appearing still like a single node database, your application doesn’t need to know about the sharding. We do this as a pure extension to Postgres, which means you get all the power and flexibility that’s included within Postgres such as JSONB, PostGIS, rich indexing, and more.

OLAP - Data warehousing as it was 5 years ago

Once upon a time (when Citus Data was first started ~7 years ago), we focused on building a fast database to power analytics. Analytical workloads often had different needs and requirements around them. Transactions weren’t necessarily needed. Data was often loaded in bulk as opposed to single row inserts and updates. Analytics workloads have evolved and moved from pure-OLAP to a mix of OLTP and OLAP, and so Citus has too.

Data warehousing for slower storage and massive exploration

Going down the data warehousing rabbit hole, you’ll find use cases for storing hundreds of terabytes of data. This can range from historical audit logs, analytics data, to event systems. Much of the data is seldomly accessed, but one day you may need it so you want to retain it. This data is typically used internally by a data analyst that has some mix of defined reports (maybe they run them month

[...]
Posted by Bruce Momjian in EnterpriseDB on 2018-06-07 at 20:00

I had the opportunity to present an unusual topic at this year's Postgres Vision conference: Will Postgres Live Forever? It is not something I talk about often but it brings out some interesting contrasts in how open source is different from proprietary software, and why innovation is fundamental to software usage longevity. For the answer to the question, you will have to read the slides.

Posted by pgCMH - Columbus, OH on 2018-06-07 at 04:00

The June meeting will be held at 18:00 EST on Tues, the 26th. Once again, we will be holding the meeting in the community space at CoverMyMeds. Please RSVP on MeetUp so we have an idea on the amount of food needed.

What

Our very own Douglas will be presenting this month. He’s going to tell us all about the most common datatypes you’ll see in PostgreSQL. This will be a two-part talk, concluding in July.

Where

CoverMyMeds has graciously agreed to validate your parking if you use their garage so please park there:

You can safely ignore any sign saying to not park in the garage as long as it’s after 17:30 when you arrive.

Park in any space that is not marked ‘24 hour reserved’.

Once parked, take the elevator/stairs to the 3rd floor to reach the Miranova lobby. Once in the lobby, the elevator bank is in the back (West side) of the building. Take a left and walk down the hall until you see the elevator bank on your right. Grab an elevator up to the 11th floor. (If the elevator won’t let you pick the 11th floor, contact Doug or CJ (info below)). Once you exit the elevator, look to your left and right; one side will have visible cubicles, the other won’t. Head to the side without cubicles. You’re now in the community space:

Community space as seen from the stage

The kitchen is to your right (grab yourself a drink) and the meeting will be held to your left. Walk down the room towards the stage.

If you have any issues or questions with parking or the elevators, feel free to text/call Doug at +1.614.316.5079 or CJ at +1.740.407.7043

Hi. My name is Konstantin Evteev, I’m a DBA Unit Leader of Avito.ru, one of worlds top classifieds. In Avito, ads are stored in PostgreSQL databases. At the same time, for many years already the logical replication has been actively used. With its help, the following issues are successfully solved: the growth of data volume and growth of number of requests to it, the scaling and the distribution of the load, the delivery of data to the DWH and to the search subsystems, inter-base and intersystem data synchronization etc. But nothing happens “for free” — at the output we have a complex distributed system. Hardware failures can happen — you need to be always ready for it. There is plenty of samples of logical replication configuration and lots of success stories about using it. But with all this documentation there is nothing about samples of the recovery after crashes and data corruptions, moreover there are no ready-made tools for it. Over the years of constantly using PgQ replication, we have gained extensive experience, implemented our own add-ins and extensions to restore and synchronize data after crashes in distributed data systems.

In this report, we would like to show how our recovery use cases around Londiste (PGQ in general) in distributed data processing could be switched to a new logical replication in PostgreSQL 10. This research was done by Mikhail Tyurin (tmihail@bk.ru), Sergey Burladyan (eshkinkot@gmail.com) and me (konst583@gmail.com). And many thanks to Stas Kelvich from Postgres Professional for useful comments, discussions and review. We started making it when PostgreSQL 10 Alpha was released and are still working on it.

I want to highlight one aspect of infrastructure: streaming and logical replication types are set up in asynchronous mode. Synchronous replication may not cope with a huge OLTP load.

Logical replication is a common case of data denormalization. We usually put some data in normal form in one place and then we need to redistribute it to a different place and even to a different

[...]
Simple example:
create table bigtable(id bigint, ...)

...
declare _id numeric;
begin
_id := ...
FOR r IN SELECT * FROM bigtable WHERE id = _id
LOOP
...

In this case, PostgreSQL newer use index due different type of query parameter (type of parameter is defined by type of PLpgSQL variable) and table attribute. This time this error is more usual due migration from Oracle. Id in tables are declared as bigint, int, but variables in functions are often declared as numeric.

PLpgSQL can identify some symptom of this issue - implicit cast inside predicate - and can throw performance warning. See commit.

Example:
create table bigtable(id bigint, v varchar);
create or replace function test()
returns void as $$
declare
r record;
_id numeric;
begin
select * into r from bigtable where id = _id;
for r in select * from bigtable where _id = id
loop
end loop;
if (exists(select * from bigtable where id = _id)) then
end if;
end;
$$ language plpgsql;

select * from plpgsql_check_function('test()', performance_warnings => true);
plpgsql_check_function
-------------------------------------------------------------------------------------------------------------------------------
performance:42804:6:SQL statement:implicit cast of attribute caused by different PLpgSQL variable type in WHERE clause
Query: select * from bigtable where id = _id
-- ^
Detail: An index of some attribute cannot be used, when variable, used in predicate, has not right type like a attribute
Hint: Check a variable type - int versus numeric
performance:42804:7:FOR over SELECT rows:implicit cast of attribute caused by different PLpgSQL variable type in WHERE clause
Query: select * from bigtable where _id = id
-- ^
Detail: An index of some attribute cannot be used, when variable, used in predicate, has not right type like a attribute
Hint: Check a variable type - int versus numeric
performance:42804:10:IF:implicit cast of attr
[...]