PostgreSQL
The world's most advanced open source database
Top posters
Number of posts in the past month
Top teams
Number of posts in the past month
Feeds
Planet
  • Policy for being listed on Planet PostgreSQL.
  • Add your blog to Planet PostgreSQL.
  • List of all subscribed blogs.
  • Manage your registration.
Contact
  • Get in touch with the Planet PostgreSQL administrators at planet at postgresql.org.
Frankly, I don't know where the practical limit for the number of rows in a single PostgreSQL table is from experience, but the interwebs seems to agree on 10^9 for narrow tables.

After a lively discussion with a NoSQL afficionado yesterday about the (in)ability to effectively store timeseries data in a RDBMS I made a quick calculation.

Timeseries data is usually a triple of the form key timestamp value, so it can be stored in a pretty narrow table, hence I stick to the 10^9 rows limit.

If we get a data point every second, we can store 10^9 seconds worth of data. 10^9 seconds is 16666666.6667 minutes, which is 277777.777778 hours, which is 11574.0740741 days, which is good for about 31 years of recording.

Every second of 31 years. Per table.

The right answer is of course “Use PostgreSQL”. It’s the main distro and we want you to use that as often as possible.

The Postgres-BDR and Postgres-XL projects are also fully open source projects, using the same copyright and licence as the main PostgreSQL project. So if you’re using PostgreSQL, they are also options to consider if you want extended functionality.

What does Postgres-BDR do? BDR allows you to have a widely distributed cluster of nodes that give you multiple full copies of a database. So you can have copies of the database in London, New York, San Francisco, Rome, Dubai, New Delhi, Hong Kong, Tokyo, Sydney, São Paulo, Buenos Aires and Johannesburg. And all your customers in those places get fast access for read AND write to the database. As long as your application doesn’t update the same data in two different places at once, you’re fine. If it does, then you probably need to rethink what you’re trying to do, but we do provide a conflict resolution scheme and options for logging.

What does Postgres-XL do? XL allows you to scale a database up from one node to many nodes. It’s a shared nothing cluster, with all the nodes in one place, though with HA and DR options. So yes, its scalable, just like MongoDB, Cassandra etc.. though XL runs both schemaless JSON and structured relational data. And most importantly, Postgres-XL does both Big Data analytics (SQL, joins etc..) and fine-grained OLTP with MVCC in the same platform. So you can load your data and then immediately run queries against it, no need to export it to another kind of database.

So both Postgres-BDR and Postgres-XL are very significant enhancements to PostgreSQL core.

Why do we have two? Why Postgres-BDR and Postgres-XL? Each variant has its own use case, so the separation isn’t really an issue. Eventually we’ll be able to take advantage of all that functionality in one system, but that will take a few years while we get that to work. And we are working as actively as we can on returning that all into the main project. pglogical and the p

[...]
Posted by Paul Ramsey in PostGIS on 2016-08-25 at 09:05
Posted by Jim Nasby on 2016-08-24 at 16:38

This post is the first in our series on PostgreSQL, a highly customizable and standards compliant open source object-relational database system.

The modern world isn’t just a sea of data, it’s also awash in data storage options. It used to be a company just had to choose between one of the “Big 3” (DB2, MS-SQL, or Oracle), or MySQL or PostgreSQL. Today there’s a few more relational databases, a bunch of NqSQL databases, as well as specialty databases such as graph databases or time-series databases. Many companies now run multiple different database technologies.

What if there was one solution that satisfied all these needs?

The hidden costs of data

There’s one thing that’s important to understand: data is not easy. If something claims to make it easy, either it’s glossing over details or just ignoring them. A choice that makes one data problem easy makes other problems harder:

  • Lack of schema makes ingestion easy and analysis difficult
  • Strong OLTP performance reduces OLAP performance
  • Consistency, Availability, Performance: Pick 2 (the CAP theorem)

Understanding how your data technologies work under the hood is critical for making informed decisions and designing an appropriate data architecture.

Here’s your magic bullet!

Just kidding. Sort of.

Postgres has a lot of features that most users aren’t aware of. Unfortunately, there’s no make_data_easy = true setting, but there are a lot of features that make various problems easier. In fact, it’s best not to think of Postgres as a database, because it goes far beyond that. Think of it as a data platform.

Features you’ve never heard of

This is certainly not a complete list, but I think it’s some of the more overlooked features.

Arrays and Composite types

Tables are not the only way to store sets of information in Postgres, nor are they the only way to pass information around your systems. Arrays and composite types work together the same way that lists and dictionaries/hashes do in other languages: they allow you to create a single object that represents an arbitrari

[...]
Posted by Yann Larrivee on 2016-08-23 at 16:30
ConFoo Montreal: March 8th-10th 2016

ConFoo Montreal: March 8th-10th 2016

Want to get your web development ideas in front of a live audience? The call for papers for the ConFoo Montreal 2017 web developer conference is open! If you have a burning desire to hold forth about PHP, Java, Ruby, Python, or any other web development topics, we want to see your proposals. The window is open only from August 21 to September 20, 2016, so hurry. An added benefit: If your proposal is selected and you live outside of the Montreal area, we will cover your travel and hotel.

You’ll have 45 minutes to wow the crowd, with 35 minutes for your topic and 10 minutes for Q&A. We can’t wait to see your proposals. Knock us out!

ConFoo Montreal will be held on March 8-10, 2017. For those of you who already know about our conference, be aware that this annual tradition will still be running in addition to ConFoo Vancouver. Visit our site to learn more about both events.

Posted by gabrielle roth on 2016-08-23 at 01:16
We’re currently redesigning our data warehouse, and we’re experimenting with a clock (or time, or time-of-day) dimension to represent the time of day, separate from the date dimension we already use. This table should contain a record for each minute of a day, allowing us to easily determine business hours in various timezones, etc etc. I’m […]

Walbouncer was covered on the blog already when it was first announced, but that was almost 2 years ago – so it would be a good time to echo it out again, especially in light of compatibility update to support PostgresSQL 9.5, addition of a helper script for bootstrapping replicas called Walbouncer-companion and making the […]

The post Walbouncer refreshed – a proxy for selective PostgreSQL physical replication appeared first on Cybertec - The PostgreSQL Database Company.

Posted by Pavel Stehule on 2016-08-21 at 07:40
I finished the patch for XMLTABLE in PostgreSQL. It is working well. Examples what I found on net works:

postgres=#   SELECT t.id, x.*
FROM employees t,
XMLTABLE ('/Employees/Employee[age>40]'
PASSING t.data
COLUMNS firstname VARCHAR(30) PATH 'firstname',
lastname VARCHAR(30) PATH 'lastname',
age VARCHAR(30) PATH 'age') x
WHERE t.id = 1;
┌────┬───────────┬──────────┬─────┐
│ id │ firstname │ lastname │ age │
╞════╪═══════════╪══════════╪═════╡
│ 1 │ Jim │ Moriarty │ 52 │
│ 1 │ Mycroft │ Holmes │ 41 │
└────┴───────────┴──────────┴─────┘
(2 rows)

Time: 1.619 ms
postgres=# SELECT t.id, x.*
FROM employees t,
XMLTABLE ('/Employees/Employee[@emplid=2222]'
PASSING t.data
COLUMNS firstname VARCHAR(30) PATH 'firstname',
lastname VARCHAR(30) PATH 'lastname') x
WHERE t.id = 1;
┌────┬───────────┬──────────┐
│ id │ firstname │ lastname │
╞════╪═══════════╪══════════╡
│ 1 │ Sherlock │ Homes │
└────┴───────────┴──────────┘
(1 row)

Time: 1.606 ms
postgres=# SELECT emp.id, x.*
FROM employees emp,
XMLTABLE ('/Employees/Employee'
PASSING emp.data
COLUMNS firstname VARCHAR(30) PATH 'firstname',
type VARCHAR(30) PATH '@type') x;
┌────┬───────────┬───────┐
│ id │ firstname │ type │
╞════╪═══════════╪═══════╡
│ 1 │ John │ admin │
│ 1 │ Sherlock │ admin │
│ 1 │ Jim │ user │
│ 1 │ Mycroft │ user │
└────┴───────────┴───────┘
(4 rows)

Time: 1.556 ms

Please, test it, check it.
Posted by Shaun M. Thomas on 2016-08-19 at 18:42

All database engines, even Postgres, occasionally present a seemingly intractable problem that will drive everyone insane while attempting to isolate it. All it takes is the perfect storm of situational circumstances, and even a perfectly running stack of software will grind to a screeching halt. It’s situations like this that we must turn to various Postgres forensic tools to track down the issue before management starts firing people in frustration.

We’ve already discussed how connections can be blocked by improper operation ordering. But what if thing’s aren’t so straight-forward? When there are more than two actors involved, the potential for weird interactions increases to a point where replicating the issue in a clean environment is often difficult or impossible. So let’s make sure a few things are in place.

First things first: logging. We should tweak log settings from the defaults, as they provide a wealth of after-the-fact diagnostic information. All of these should be in postgresql.conf:

log_checkpoints = on
log_statement = ddl
log_min_duration_statement = 1000
log_line_prefix = '%p|%u|%d|%r|%t|'

There are a lot of other log settings, but these are the ones I personally consider essential modifications. Why? Let’s examine them one by one.

  • log_checkpoints: A checkpoint happens when Postgres writes pending modifications to table files. If this setting is enabled, we learn when that process started, how long it took, how much data was written, how much time was spent syncing to disk and thus waiting for the underlying OS to flush the data, and so on. Enabling this can single-handedly track down insufficient write performance of a storage device, or reveal times of heavy write activity that may suggest better transaction log coverage, and so on. It’s a crime this isn’t enabled by default.
  • log_statement: The ddl setting will log any modification to a database object. Create a table? Logged. Modify a view? Logged. Drop an index? Logged. This is the minimal level of activity auditing a DBA should expect.
[...]
Posted by Chris Travers on 2016-08-19 at 05:43
So one of the folks I do work with is moving a large database from PostgreSQL to Hadoop.  The reasons are sound -- volume and velocity are major issues for them, and PostgreSQL is not going away in their data center and in their industry there is a lot more Hadoop usage and tooling than there is PostgreSQL tooling for life science analytics (Hadoop is likely to replace both PostgreSQL and, hopefully, a massive amount of data on NFS).  However this has provided an opportunity to think about big data problems and solutions and their implications.  At the same time I have seen as many people moving from Hadoop to PostgreSQL as the other way around.  No, LedgerSMB will never likely use Hadoop as a backend.  It is definitely not the right solution to any of our problems.

Big data problems tend to fall into three categories, namely managing ever increasing volume of data, managing increasing velocity of data, and dealing with greater variety of data structure.  It's worth noting that these are categories of problems, not specific problems themselves, and the problems within the categories are sufficiently varied that there is no solution for everyone.  Moreover these solutions are hardly without their own significant costs.  All too often I have seen programs like Hadoop pushed as a general solution without attention to these costs and the result is usually something that is overly complex and hard to maintain, may be slow, and doesn't work very well.

So the first point worth noting is that big data solutions are specialist solutions, while relational database solutions for OLTP and analytics are generalist solutions.  Usually those who are smart start with the generalist solutions and move to the specialist solutions unless they know out of the box that the specialist solutions address a specific problem they know they have.  No, Hadoop does not make a great general ETL platform.....

One of the key things to note is that Hadoop is built to solve all three problems simultaneously.  This means that you effectively bu[...]
Posted by Simon Riggs in 2ndQuadrant on 2016-08-18 at 15:30

Postgres-BDR has now reached 1.0 production status.

Over the last 2 years, Postgres-BDR has been used daily for mission critical production systems.

As you might imagine, it’s been improved by both bug fixes and feature enhancements that allow it to be used smoothly, so its mature, robust and feature-rich.

The BDR Project introduced logical replication for PostgreSQL, now available as pglogical. In addition, it introduced replication slots, background workers and many other features. But is it still relevant?

Postgres-BDR delivers all of these features that aren’t yet in PostgreSQL 9.6, and likely won’t all be in PostgreSQL 10.0 either

  • Automatic replication of DDL, reducing maintenance costs for DBAs
  • Automatic replication of sequences
  • Conflict resolution and logging
Posted by Ernst-Georg Schmid on 2016-08-18 at 12:10
Did you know that you can make a Hexastore from a RDF triple in just one line of SQL? (This needs PostgreSQL 9.4 or better, because of the multi-array unnest)

CREATE OR REPLACE FUNCTION hexify(
    IN sub text,
    IN pred text,
    IN obj text)
  RETURNS TABLE(ord text, a text, b text, c text) AS
$$select A.t || B.t || C.t as ord, A.v, B.v, C.v from (select * from unnest(ARRAY[sub, pred, obj],ARRAY['s', 'p', 'o'])) as A(v, t) cross join (select * from unnest(ARRAY[sub, pred, obj],ARRAY['s', 'p', 'o'])) as B(v, t) cross join (select * from unnest(ARRAY[sub, pred, obj],ARRAY['s', 'p', 'o'])) as C(v, t) where a.v != b.v and a.v != c.v and b.v != c.v order by ord desc$$
  LANGUAGE sql IMMUTABLE STRICT
  COST 100
  ROWS 6;

SELECT* FROM hexify('subject','predicate','object');


Sometimes, PostgreSQL SQL is just awesome...

More on Hexastores here and here.

Part six of six in our annual PyCon Must-See Series, a weekly highlight of talks our staff especially loved at PyCon. With so many fantastic talks, it’s hard to know where to start, so here’s our short list.

Coming from a heavy database admin background, I found Craig Kerstiens’s “Postgres Present and Future “to be incredibly well organized and engaging. Of particular interest to me, because I am somewhat new to Postgres (while having more background history with MS SQL), was the deep dive into indexes in Postgres.

Check out 5:44-8:39 to find out when to use different types of indexes, outside of the standard B-Tree. For instance, Gin indexes are helpful when searching multiple values for a single column, ie. an array field or a JSONB field.

Click over to 17:22-19:33 to learn about the new Bloom Filter in Postgres 9.6, which is coming out in a few months. This extension seems like it will be incredibly useful to speed up queries on wide tables with a bunch of different options.


More in the annual PyCon Must-See Talks Series.

PostgreSQL is an awesome project and it evolves at an amazing rate. We’ll focus on evolution of fault tolerance capabilities in PostgreSQL throughout its versions with a series of blog posts. This is the fourth post of the series and we’ll talk about synchronous commit and its effects on fault tolerance and dependability of PostgreSQL.

If you would like to witness the evolution progress from the beginning, please check the first three blog posts of the series below. Each post is independent, so you don’t actually need to read one to understand another.

  1. Evolution of Fault Tolerance in PostgreSQL 
  2. Evolution of Fault Tolerance in PostgreSQL: Replication Phase 
  3. Evolution of Fault Tolerance in PostgreSQL: Time Travel

synchronous

Synchronous Commit

By default, PostgreSQL implements asynchronous replication, where data is streamed out whenever convenient for the server. This can mean data loss in case of failover. It’s possible to ask Postgres to require one (or more) standbys to acknowledge replication of the data prior to commit, this is called synchronous replication (synchronous commit).

With synchronous replication, the replication delay directly affects the elapsed time of transactions on the master. With asynchronous replication, the master may continue at full speed.

Synchronous replication guarantees that data is written to at least two nodes before the user or application is told that a transaction has committed.

The user can select the commit mode of each transaction, so that it is possible to have both synchronous and asynchronous commit transactions running concurrently.

This allows flexible trade-offs between performance and certainty of transaction durability.

Configuring Synchronous Commit

For setting up synchronous replication in Postgres we need to configure synchronous_commit parameter in postgresql.conf.

The parameter specifies whether transaction commit will wait for WAL records to be written to disk before the command returns a success indication to the client. Valid values are onremote_applyremote_w

[...]

I have updated my Rich in the Jungle presentation with new pricing for AWS vs. Softlayer. Things haven't changed much, in terms of raw performance per dollar (which is not the only qualifier) Softlayer is clearly the winner.

During recent years there has been quite a lot of fuzz about “insert-only” approaches and some database-like products (Datomic, Apache Samza) have emerged, being inspired by the idea of having an immutable datastore. In light of cheap storage and powerful hardware, I see the idea definitely having potential for certain use cases. So why not […]

The post Insert-only data modeling with PostgreSQL? appeared first on Cybertec - The PostgreSQL Database Company.

In almost every releases of Pgpool-II, we import the latest version of PostgreSQL's SQL parser (more precisely, the raw parser). This time, our new Pgpool-II developer faced with an interesting problem while importing PostgreSQL 9.6's parser.

In PostgreSQL, the SQL parser is written in bion, a general-purpose parser generator. The particular file including the SQL grammar rules is named "gram.y". gram.y used to include "scan.l", which is a lexical scanner written in flex.

 In reality, gram.y is translated into a C source file by bison, then compiled. Same thing can be said to scan.l, which is translated by flex though.

So the main part of SQL parser  source file was single big file consisted of gram.y and scan.l.

From PostgreSQL 9.6, however, PostgreSQL developers decided to keep gram.y and flex.l separated.

Build backend/parser/scan.l and interfaces/ecpg/preproc/pgc.l standalone.

This gives enough confusion to the developer in charge of the work and took some time before realize the change. I would say it's a fun part of the task when we work on an OSS project:-)  However I cannot stop saying that it would be nice if the SQL parser is exported as a separate library so that we do not need this kind of work in every releases of Pgpool-II.
So for those of you who know, I now spend most of my time doing more general PostgreSQL consulting and a fair bit of time still on LedgerSMB.  One of my major projects lately has been on a large scientific computing platform currently run on PostgreSQL, but due to volume and velocity of data being moved to Hadoop (the client maintains other fairly large PostgreSQL instances with no intention of moving btw).

With this client's permission I have decided to take a lot of the work I have done in optimizing their job queue system and create an extension under PostgreSQL for it..  The job queue currently runs tens of millions of jobs per day (meaning twice that number of write queries, and a fair number of read queries too) and is one of the most heavily optimized parts of the system, so this will be based on a large number of lessons learned on what is a surprisingly hard problem.

It is worth contrasting this to pg_message_queue of which I am also the author.  pg_message_queue is intended as a light-weight, easy to use message queue extension that one can use to plug into other programs to solve common problems where notification and message transfer are the main problems.  This project will be an industrial scale job queuing system aimed at massive concurrency.  As a result simplicity and ease of use take second place to raw power and performance under load.  In other words here I am not afraid to assume the dba and programming teams know what they are doing and has the expertise to read the manual and implement appropriately.

The first version (1.x) will support all supported versions of PostgreSQL and make the following guarantees:


  1. massively multiparallel, non-blocking performance  (we currently use with 600+ connections to PostgreSQL by worker processes.
  2. Partitioning, coalescing, and cancelling of jobs similar in some ways to TheSchwartz
  3. Exponential pushback based on number of times a job has failed
  4. Jobs may be issued again after deletion but that this can always be detected and bad jobs pruned
  5. Optionally jo
[...]
Posted by Shaun M. Thomas on 2016-08-12 at 16:48

“Hey! That row shouldn’t be in that table! How the heck did that get there!? Alright, who wrote the application client filters, because you’re fired!”

Good application developers know never to trust client input, but not all realize that a single app is rarely the only vector into a database. Databases don’t just preserve data with various levels of paranoia, they’re also the central nexus of a constellation of apps, scripts, APIs, GUIs, BMIs, HMOs, and STDs. As such, unless every single one of those share a common ancestor that sanitizes, boils, renders, and formats content before storage, there’s bound to be inconsistencies. That’s just how things work.

One of the major benefits of using an RDBMS, is that engines like Postgres provide several mechanisms for ensuring data integrity beyond crash durability and transaction control. Continuing our discussion on database objects from last week, we vaguely referred to other types of constraint. So, what other mechanisms are available aside from primary keys and unique constraints?

Let’s start with foreign keys, since they illustrate how tables are related and enforce that relation to reject invalid content. Here are two simply related tables and a couple of rows:

CREATE TABLE pet_type
(
  type_id  SERIAL   PRIMARY KEY,
  animal   VARCHAR  UNIQUE NOT NULL
);
 
CREATE TABLE pet
(
  pet_id      SERIAL   PRIMARY KEY,
  type_id     INT      NOT NULL,
  pet_name    VARCHAR  NOT NULL,
  owner_name  VARCHAR  NOT NULL
);
 
ALTER TABLE pet
  ADD CONSTRAINT fk_pet_pet_type FOREIGN KEY (type_id)
      REFERENCES pet_type (type_id);
 
INSERT INTO pet_type (animal) VALUES ('cat'), ('dog');
 
INSERT INTO pet
  (type_id, pet_name, owner_name)
VALUES
  (1, 'Meow Meow Fuzzyface', 'Cedric'),
  (2, 'Mr. Peanutbutter', 'Paul');

Foreign keys provide a buffer between pending modifications by enforcing the relationship. In this case, we can’t remove “dog” as a pet type because at least one pet references it. We also can’t insert a pet fish, because there’s no corresponding type.

T

[...]
Posted by Craig Ringer in 2ndQuadrant on 2016-08-12 at 05:22

I’m pleased to say that we’ve just released Postgres-BDR 1.0, based on PostgreSQL 9.4.9.

This release contains significant improvements to DDL replication locking, global sequences, documentation, performance, and more. It also removes the deprecated UDR component in favour of pglogical.

It’s taken a lot of work to get to this point. This release sets the foundation to port BDR to PostgreSQL 9.6 and to enhance its high-availability capabilities, and I’m excited to be pushing BDR forward.

Posted by gabrielle roth on 2016-08-11 at 23:10

When: 6-8pm Thursday August 18, 2106
Where: iovation
Who: Brian Panulla, Marty Zajac, Gabrielle Roth
What: ETL Throwdown (or up)

For August, we’re having a panel discussion on various ETL tools we’ve tried.
Brian: moving from bulk ETL to near real time ETL with Tungsten Replicator to replicate from MySQL to PostgreSQL and Solr
Marty: Pentaho
Gabrielle: CloverETL, home-grown solution using postgres_fdw, and a brief rant about Informatica if we have time.


If you have a job posting or event you would like me to announce at the meeting, please send it along. The deadline for inclusion is 5pm the day before the meeting.

Our meeting will be held at iovation, on the 32nd floor of the US Bancorp Tower at 111 SW 5th (5th & Oak). It’s right on the Green & Yellow Max lines. Underground bike parking is available in the parking garage; outdoors all around the block in the usual spots. No bikes in the office, sorry!

iovation provides us a light dinner (usually sandwiches or pizza).

Elevators open at 5:45 and building security closes access to the floor at 6:30.

See you there!


PostgreSQL has a bagful of server configuration parameters (249 according to my counting for version 9.5) at your disposal, which mostly is a good thing as it enables to take the maximum out of your hardware if you’re willing to put in the necessary time. But some of the parameters might leave the door open […]

The post Logging of data modifications and the “log_statement” configuration parameter appeared first on Cybertec - The PostgreSQL Database Company.

Posted by Paul Ramsey in PostGIS on 2016-08-10 at 16:05
Before doing a deep dive into the subject, a short outline about PgBouncer, its a lightweight connection pooler for PostgreSQL that dramatically reduces the processing time and resources for maintaining a large number of client connections to one or more databases. Typically used to increase the number of user connections that can be handled in a high performance environment. For more details on Installing/Configuring PgBouncer refer to the documentation here.
Like other tools, PgBouncer has a  stderr/syslog logging architecture to record connection, disconnection, and  pooler_errors with different verbosity levels. As of now, the greater part of logging go to one single file "pgbouncer.log" and grows endlessly. Sometimes, it might be a potential risk of making a system unresponsive due to lack of disk space on the log file location. At present, PgBouncer logging has no in-built configuration to rotate logs on the basis of age or size, hence it forces users to choose alternative methods. IMO, there are two approaches to handle it :-
  1. Configure PgBouncer in "syslog" method to rely on OS log rotation or
  2. Configure log rotation using OS utilities on "pgbouncer.log" file.

Method 1:

Its pretty straightforward to configure syslog in PgBouncer, set "syslog" to 1 (default 0); give a name to begin the log line in OS logs in "syslog_ident" (default 'pgbouncer') and specify the facility details in "syslog_facility" (default daemon). A sample output from my OS logs(/var/log/messages):
Aug 5 16:54:27 raghavt pgbouncer[62549]: C-0x1cdfe60: postgres/postgres@unix(62621):6432 login attempt: db=postgres user=postgres tls=no
Aug 5 16:54:27 raghavt pgbouncer[62549]: S-0x1ce4b10: postgres/postgres@127.0.0.1:5432 new connection to server (from 127.0.0.1:38947)
Aug 5 16:54:27 raghavt pgbouncer[62549]: C-0x1cdfe60: postgres/postgres@unix(62621):6432 closing because: client close request (age=0)
Note: If "syslog" enabled, comment or blank out the "logfile" parameter, else it will be additional logging. 

Method 2:

Logrotate is
[...]
Posted by Tomas Vondra in 2ndQuadrant on 2016-08-08 at 15:05

I’ve published multiple benchmarks comparing different PostgreSQL versions, as for example the performance archaeology talk (evaluating PostgreSQL 7.4 up to 9.4), and all those benchmark assumed fixed environment (hardware, kernel, …). Which is fine in many cases (e.g. when evaluating performance impact of a patch), but on production those things do change over time – you get hardware upgrades and from time to time you get an update with a new kernel version.

For hardware upgrades (better storage, more RAM, faster CPUs, …), the impact is usually fairly easy to predict, and moreover people generally realize they need to assess the impact by analyzing the bottlenecks on production and perhaps even testing the new hardware first.

But for what about kernel updates? Sadly we usually don’t do much benchmarking in this area. The assumption is mostly that new kernels are better than older ones (faster, more efficient, scale to more CPU cores). But is it really true? And how big is the difference? For example what if you upgrade a kernel from 3.0 to 4.7 – will that affect the performance, and if yes, will the performance improve or not?

From time to time we get reports about serious regressions with a particular kernel version, or sudden improvement between kernel versions. So clearly, kernel versions may affects performance.

I’m aware of a single PostgreSQL benchmark comparing different kernel versions, made in 2014 by Sergey Konoplev in response to recommendations to avoid 3.0 – 3.8 kernels. But that benchmark is fairly old (the last kernel version available ~18 months ago was 3.13, while nowadays we have 3.19 and 4.6), so I’ve decided to run some benchmarks with current kernels (and PostgreSQL 9.6beta1).

PostgreSQL vs. kernel versions

But first, let me discuss some significant differences between policies governing commits in the two projects. In PostgreSQL we have the concept of major and minor versions – major versions (e.g. 9.5) are released roughly once a year, and include various new features. Minor versions

[...]
It's possible to use EXPLAIN output in plpgsql. In the default text format, the result comes back as a set of text values, so you can process them in a loop like this:

declare
   exp text;
begin
   for exp in explain myquery
   loop
      raise notice '%', exp;
   end loop;
end;


If you use json format output, however, the result comes back as a single json document rather than a set of lines. You still might need a loop - I haven't found another way yet of getting the output from EXPLAIN into a variable - but the loop will only have one iteration. Here is an example taken from a function I wrote the other day that lets you get the estimated number of rows from the plan:


declare
   exp json;
begin
   for exp in explain (format json) myquery
   loop
      raise notice 'rows: %', exp#>>'{0,Plan,Plan Rows}';
   end loop;
end;

Sometimes it happens that a query slows down dramatically for no obvious reason. In other cases queries go south in case a certain set of input parameters is used. In some of those situations wrong optimizer estimates can be the root cause of all evil. One of my “favorite” issues with wrong planner estimates: Underestimated […]

The post Exploding runtime: How nested loops can destroy speed appeared first on Cybertec - The PostgreSQL Database Company.

This is a technical post !

What if you want to search %asd% pattern and do it fast ? Wildspeed extension provides GIN index support for wildcard search for LIKE operator.


Authors

Oleg Bartunov , Moscow, Moscow University, Russia
Teodor Sigaev , Moscow, Moscow University,Russia

License

Stable version, included into PostgreSQL distribution, released under BSD license. Development version, available from this site, released under the GNU General Public License, version 2 (June 1991)

Downloads

Latest version of wildspeed is available from
git clone git://sigaev.ru/wildspeed

Wildspeed provides opclass (wildcard_ops) and uses partial match feature of GIN, available since 8.4. Also, it supports full index scan.

The size of index can be very big, since it contains entries for all permutations of the original word, see [1] for details. For example, word hello will be indexed as well as its all permutations:

=# select permute('hello');
               permute                
--------------------------------------
 {hello$,ello$h,llo$he,lo$hel,o$hell}


Notice, symbol '$' is used only for visualization, in actual implementation null-symbol '\0' is used.

Search query rewritten as prefix search:

*X  -> X$*
X*Y -> Y$X*
*X* -> X*


For example, search for 'hel*o' will be rewritten as 'o$hel'.

Special function permute(TEXT), which returns all permutations of argument, provided for test purposes.

Performance of wildspeed depends on search pattern. Basically, wildspeed is less effective than btree index with text_pattern_ops for prefix search (the difference is greatly reduced for long prefixes) and much faster for wildcard search.

Wildspeed by default uses optimization (skip short patterns if there are long one), which can be turned off in Makefile by removing define -DOPTIMIZE_WILDCARD_QUERY.

References:

1. http://www.cs.wright.edu/~tkprasad/courses/cs499/L05TolerantIR.ppt, see also, http://nlp.stanford.edu/IR-book/html/htmledition/permuterm-indexes-1.html


Examples:

Table words contains 747358 records, w[...]