PostgreSQL
The world's most advanced open source database
Top posters
Number of posts in the past month
Top teams
Number of posts in the past month
Feeds
Twitter
Planet
  • Policy for being listed on Planet PostgreSQL.
  • Add your blog to Planet PostgreSQL.
  • List of all subscribed blogs.
  • Manage your registration.
Contact
  • Get in touch with the Planet PostgreSQL administrators at planet at postgresql.org.
Posted by Hadi Moshayedi in CitusData on 2015-07-29 at 08:03:01

Citus Data is excited to announce the release of cstore_fdw 1.3, which is available on GitHub at github.com/citusdata/cstore_fdw. cstore_fdw is an open source columnar store extension for PostgreSQL created by Citus Data which reduces the data storage footprint and disk I/O for PostgreSQL databases.

cstore_fdw Changes

The changes in this release include:

  • ALTER FOREIGN TABLE ADD/DROP COLUMN ... support. You can now add/drop columns on existing cstore tables. Default values on newly added columns are also supported with some restrictions.
  • Improved COPY FROM support. Column references in the COPY command is now supported. You can now specify list of columns to be copied from a file to a cstore table.
  • Query performance improvements. Table row count estimation works better which allows the query planner to better prepare query plans.
  • (BugFix) Whole row references. You can now use row references in function calls like SELECT to_json(*) FROM cstore_table.
  • (BugFix) Insert concurrency. Deadlock issue during concurrent insert is resolved.

For installation and update instructions, you can download the cstore_fdw Quick Start Guide from our website or you can view installation and update instructions on the cstore_fdw page on GitHub.

To learn more about what’s coming up for cstore_fdw, see our development roadmap.

Got questions?

If you have questions about cstore_fdw, please contact us using the cstore-users Google group.

If you discover an issue when using cstore_fdw, please submit it to cstore_fdw’s issue tracker on GitHub.

Further information about cstore_fdw is available on our website where you can also find information on ways to contact us with questions.

Read more...

Posted by Josh Berkus in pgExperts on 2015-07-28 at 16:52:00
If you've been following the tech news, you might have noticed that we have a new open source PostgreSQL fork called "PipelineDB".  Since I've joined the advisory board of PipelineDB, I thought I'd go over what it is, what it does, and what you'd use it for.  If you're not interested in Pipeline, or Postgres forks, you can stop reading now.

PipelineDB is a streaming database version of PostgreSQL.  The idea of a streaming database, first introduced in the PostgreSQL fork TelegraphCQ back in 2003, is that queries are run against incoming data before it is stored, as a kind of stream processing with full query support.  If the idea of a standard database is "durable data, ephemeral queries" the idea of a streaming database is "durable queries, ephemeral data".  This was previously implemented in StreamBase, StreamSQL, and the PostgreSQL fork Truviso. In the Hadoop world, the concept is implemented in Apache SparkSQL.

On a practical level, what streaming queries do is allow you to eliminate a lot of ETL and redundant or temporary storage.

PipelineDB 0.7.7 is 100% of PostgreSQL 9.4, plus the ability to create Continuous Views, which are actually standing queries which produce different data each time you query them depending on the incoming stream.  The idea is that you create the queries which filter and/or summarize the data you're looking for in the stream, and store only the data you want to keep, which can go in regular PostgreSQL tables.

As an example of this, we're going to use PipelineDB to do tag popularity counts on Twitter.  Twitter has a nice streaming API, which gives us some interesting stream data to work with.  First I spun up a PipelineDB Docker container.  Connecting to it, I created the "twitter" database and a static stream called "tweets":

Creating a static stream isn't, strictly speaking, necessary; you can create a Continuous View without one.  As a career DBA, though, implied object names give me the heebie-jeebies.  Also, in some future release of PipelineDB, static streams will have per

[continue reading]

We’re finally at the end of the 10-part PGDB (PostgreSQL) performance series I use to initiate new developers into the database world. To that end, we’re going to discuss something that affects everyone at one point or another: index criteria. Or to put it another way:

Why isn’t the database using an index?

It’s a fairly innocuous question, but one that may have a surprising answer: the index was created using erroneous assumptions. Let’s explore what happens in a hospital environment with a pared-down table of patients.

DROP TABLE IF EXISTS sys_patient;
 
CREATE TABLE sys_patient
(
    patient_id  SERIAL   NOT NULL,
    full_name   VARCHAR  NOT NULL,
    birth_dt    DATE     NOT NULL,
    sex         CHAR     NOT NULL
);
 
INSERT INTO sys_patient (full_name, birth_dt, sex)
SELECT 'Crazy Person ' || a.id,
       CURRENT_DATE - (a.id % 100 || 'y')::INTERVAL
                    + (a.id % 365 || 'd')::INTERVAL,
       CASE WHEN a.id % 2 = 0 THEN 'M' ELSE 'F' END
  FROM generate_series(1, 1000000) a(id);
 
ALTER TABLE sys_patient ADD CONSTRAINT pk_patient_id
      PRIMARY KEY (patient_id);
 
CREATE INDEX idx_patient_birth_dt ON sys_patient (birth_dt);
CREATE INDEX idx_patient_sex ON sys_patient (sex);
 
ANALYZE sys_patient;

This particular hospital has a few queries that operate based on the sex of the patient, so someone created an index on that column. One day, another developer is doing some code refactoring and, being well-trained by the resident DBA, runs the query through EXPLAIN to check the query plan. Upon seeing the result, the dev curses a bit, tries a few variants, and ultimately takes the issue to the DBA.

This is what the developer saw:

EXPLAIN ANALYZE 
SELECT *
  FROM sys_patient
 WHERE sex = 'F';
 
                             QUERY PLAN                             
--------------------------------------------------------------------
 Seq Scan ON sys_patient  
      (cost=0.00..19853.00 ROWS=498233 width=29)
      (actual TIME=0.018..541.738 ROWS=500000 loops=1)
   FILTER: (sex = 'F'::bp

[continue reading]

Posted by Michael Paquier on 2015-07-24 at 05:24:38

Docker is well-known, is used everywhere, is used by everybody and is a nice piece of technology, there is nothing to say about that.

Now, before moving on with the real stuff, note that for the sake of this ticket all the experiments done are made on a Raspberry PI 2, to increase a bit the difficulty of the exercise and to grab a wider understanding of how to manipulate Docker containers and images at a rather low level per the reasons of the next paragraph.

So, to move back to Docker... It is a bit sad to see that there are not many container images based on ARM architectures even if there are many machines around. And also, the size of a single container image can reach easily a couple of hundred megabytes in its most simple shape (it does not change the fact that some of those images are very popular, so perhaps the author of this blog should not do experimentations on such small-scale machines to begin with).

Not all the container images are that large though, there is for example one container based on the minimalistic distribution Alpine Linux, with a size of less than 5MB. Many packages are available as well for it so it makes it a nice base image for more extended operations. Now, the fact is that even if Alpine Linux does publish deliverables for ARM, there are no Docker container around that make use of it, and trying to use a container image that has been compiled for example x86_64 would just result on an epic failure.

Hence, extending a bit a script from the upstream Docker facility of Alpine Linux, it is actually easily possible to create from scratch a container image able to run on ARM architectures (the trick has been to consider the fact that Alpine Linux publishes its ARM deliverables with the alias armhf). Note in any case the following things about this script: - root rights are needed - ARM environment needs to be used to generate an ARM container - the script is here Roughtly, what this script does is fetching a minimal base image of Alpine Linux and then importing it in an image using "d

[continue reading]

Posted by Josh Berkus in pgExperts on 2015-07-23 at 20:46:00
By now you've read about PostgreSQL 9.5 and our shiny new UPSERT feature.  And you're thinking, "hey, 9.5 is still in alpha, how do I get me some UPSERT action now?"  While there are some suggestions for workarounds on the PostgreSQL wiki, I'm going to show you one more method for approximating UPSERT while you help us test 9.5 so that it will come out sooner.

The general suggested way for handling UPSERT situations is one of the following:
  1. Try to do an INSERT.  On error, have the application retry with an UPDATE.
  2. Write a PL/pgSQL procedure which does insert-then-update or update-then-insert.
Both of these approaches have the drawback of being very high overhead: the first involves multiple round-trips to the database and lots of errors in the log, and the second involves major subtransaction overhead.  Neither is concurrency-safe, but then the method I'm about to show you isn't either.  At least this method avoids a lot of the overhead, though.

What's the method?  Using writeable WITH clauses.  This feature, introduced in 9.1, allows you to do a multi-step write transaction as a single query.  For an example, let's construct a dummy table with a unique key on ID and a value column, then populate it:

     create table test_upsert ( id int not null primary key, 
        val text );
     insert into test_upsert select i, 'aaa'
       from generate_series (1, 100) as gs(i);


Now, let's say we wanted to update ID 50, or insert it if it doesn't exist.  We can do that like so:

    WITH
    newrow ( id, val ) as (
        VALUES ( 50::INT, 'bbb'::TEXT ) ),
    tryupdate as (
        UPDATE test_upsert SET val = newrow.val
        FROM newrow
        WHERE test_upsert.id = newrow.id
        RETURNING test_upsert.id
    )
    INSERT INTO test_upsert
    SELECT id, val
        FROM newrow
    WHERE id NOT IN ( SELECT id FROM tryupdate );


The above tries to update  ID=50.  If no rows are updated, it inserts them.  This also works for multiple rows:

    WITH
    newrow ( id, val ) as (
        VALUES ( 75::INT, 'cc

[continue reading]

pgchem::tigress can generate molecular formulae like C3H6NO2- from chemical structures.

But what if we need access to the elemental composition as a relation, e.g:

element count
C 3
N 1
O 2

Fortunately, PostgreSQL is awesome:

CREATE OR REPLACE FUNCTION elemental_composition(molformula TEXT)
  RETURNS TABLE(element TEXT, count INTEGER) AS
$BODY$
DECLARE token TEXT[];
DECLARE elements TEXT[];
BEGIN
elements := ARRAY['C','N','O','P','S','Cl'];
molformula := REPLACE(REPLACE(molformula,'-',''),'+','');

FOREACH element IN ARRAY elements LOOP
count := 1;
token := REGEXP_MATCHES(molformula, element || '[\d?]*');

IF (token[1] IS NOT NULL) THEN
    token :=
REGEXP_MATCHES(token[1],'[0-9]+');
        IF (token[1] iS NOT NULL) THEN
            count := token[1]::INTEGER;
        END IF;
RETURN NEXT;
END IF;
END LOOP;
RETURN;
END;
$BODY$
  LANGUAGE plpgsql IMMUTABLE STRICT
  COST 1000;


SELECT * FROM elemental_composition('C3H6NO2-');

And that's it. Did I already mention that PostgreSQL is awesome? :-)

As a member of the PGConf Silicon Valley Conference Committee, I have been extremely happy with the volume and quality of the talks submitted to the conference. The Committee has been working hard on sorting through the talks, and I am pleased to announce the first 5 of the 24 total breakout sessions:

  • Grant McAlister, Senior Principal Engineer for Amazon.com, on “Cloud Amazon RDS for PostgreSQL - What’s New and Lessons Learned”
  • Kenny Gorman, CTO of Data for Rackspace, on “Cloud PostgreSQL Automation Management with Ansible”
  • Magnus Hagander, Database Architect, Systems Administrator & Developer for Redpill Linpro, on “What’s New in PostgreSQL 9.5”
  • Ryan Lowe, Production Engineer at Square, on “Postgres for MySQL DBAs”
  • Matthew Kelly, In House Postgres Expert for TripAdvisor, on “At the Heart of a Giant: Postgres at TripAdvisor”

PGConf Silicon Valley is November 17-18, 2015 at the South San Francisco Conference Center. It is a technical conference aimed at the local Silicon Valley PostgreSQL community and is an opportunity for leading industry experts and the local PostgreSQL community to discuss and learn about the major new capabilities of PostgreSQL 9.4 (and 9.5!) and how to optimize a PostgreSQL environment.

If you plan to attend, Super Saver pricing is available through July 25, 2015. You can reserve your seat now by visiting the conference website.

In addition to great talks, we're also pleased to see great support from sponsors with Platinum level sponsor 2ndQuadrant joined by additional sponsors EnterpriseDB, VividCortex, PostgreSQL Experts, and Consistent State. We're also happy to welcome our first two media sponsors, Database Trends & Applications and Datanami.

We hope you will register now to attend the conference at the highly discounted Super Saver rates, which end on July 25, 2015. See you in November!

Read more...

Posted by Julien Rouhaud in Dalibo on 2015-07-22 at 10:48:16

Have you ever wished to know what configuration changed during the last weeks, when everything was so much faster, or wanted to check what happened on your beloved cluster while you were in vacation?

pg_track_settings is a simple, SQL only extension that helps you to know all of that and more very easily. As it’s designed as an extension, it requires PostgreSQL 9.1 or more.

Some insights

As amost any extension, you have to compile it from source, or use the pgxn client, since there’s no package yet. Assuming you just extract the tarball of the release 1.0.0 with a typical server configuration:

$ cd pg_track_settings-1.0.0
$ sudo make install

Then the extension is available. Create the extension on the database of your choice:

postgres=# CREATE EXTENSION pg_track_settings ;
CREATE EXTENSION

In order to historize the settings, you need to schedule a simple function call on a regular basis. This function is the pg_track_settings_snapshot function. It’s really cheap to call, and won’t have any measurable impact on your cluster. This function will do all the smart work of storing all the parameters that changed since the last call.

For instance, if you want to be able to know what changed on your server within a 5 minutes accuracy, a simple cron entry like this for the postgres user is enough:

*/5 *  * * *     psql -c "SELECT pg_track_settings_snapshot()" > /dev/null 2>&1

A background worker could be used on PostgreSQL 9.3 and more, but as we only have to call one function every few minutes, it’d be overkill to add one just for this. If you really want one, you’d better consider settting up PoWA for that, or another extension that allows to run task like pgAgent.

How to use it

Let’s call the snapshot function to get ti initial values:

postgres=# select pg_track_settings_snapshot()
 ----------------------------
  t
  (1 row)

A first snapshot with the initial settings values is saved. Now, I’ll just change a setting in the postgresql.conf file (ALTER SYSTEM could also be used on a PostgreSQL 9.4 or more rele

[continue reading]

On 3rd of July, Tom Lane committed patch: Add psql \ev and \sv commands for editing and showing view definitions.   These are basically just like the \ef and \sf commands for functions.   Petr Korobeinikov, reviewed by Jeevan Chalke, some changes by me It's not a huge patch, but it's the first patch for […]
Posted by Ernst-Georg Schmid on 2015-07-21 at 11:04:00
1st impression of pg_shard:

  1. It works as described, documented limitations apply
  2. Status '3' of a shard_placement means inactive, '1' means online
  3. Read the issues list - unless you want to discover them yourselves :-)
  4. If a whole worker node goes offline this may go unnoticed, since there seems to be no heartbeat between the head and worker nodes unless you try to write data
PostgreSQL 9.5 is just around the corner and many cool new features have been added to this wonderful release. One of the most exciting ones is definitely SKIP LOCKED. To make sure that concurrent operations don’t lead to race conditions, SELECT FOR UPDATE has been supported for many years now and it is essential to […]
Posted by Julien Rouhaud in Dalibo on 2015-07-20 at 10:48:16

Have you ever wished to know what configuration changed during the last weeks, when everything was so much faster, or wanted to check what happened on your beloved cluster while you were in vacation?

pg_track_settings is a simple, SQL only extension that helps you to know all of that and more very easily. As it’s designed as an extension, it requires PostgreSQL 9.1 or more.

Some insights

As amost any extension, you have to compile it from source, or use the pgxn client, since there’s no package yet. Assuming you just extract the tarball of the release 1.0.0 with a typical server configuration:

$ cd pg\_track\_settings-1.0.0
$ sudo make install

Then the is available. Create the extension on the database of your choice:

postgres=# CREATE EXTENSION pg\_track\_settings ;
CREATE EXTENSION

In order to historize the settings, you need to schedule a simple function call on a regular basis. This function is the pg_track_settings_snapshot function. It’s really cheap to call, and won’t have any measurable impact on your cluster. This function will do all the smart work of storing all the parameters that changed since the last call.

For instance, if you want to be able to know what changed on your server within a 5 minutes accuracy, a simple cron entry like this for the postgres user is enough:

\*/5 \*  \* \* \*     psql -c "SELECT pg\_track\_settings\_snapshot()" > /dev/null 2>&1

A background worker could be used on PostgreSQL 9.3 and more, but as we only have to call one function every few minutes, it’d be overkill to add one just for this. If you really want one, you’d better consider settting up PoWA for that, or another extension that allows to run task like pgAgent.

How to use it

Let’s call the snapshot function to get ti initial values:

postgres=# select pg\_track\_settings\_snapshot()
 ----------------------------
  t
  (1 row)

A first snapshot with the initial settings values is saved. Now, I’ll just change a setting in the postgresql.conf file (ALTER SYSTEM could also be used on a PostgreSQL 9.4 or more

[continue reading]

An easy way to give PGDB (PostgreSQL) a performance boost is to judiciously use indexes based on queries observed in the system. For most situations, this is as simple as indexing columns that are referenced frequently in WHERE clauses. PGDB is one of the few database engines that takes this idea even further with partial indexes. Unfortunately as a consequence of insufficient exposure, most DBAs and users are unfamiliar with this extremely powerful functionality.

Imagine we have an order system that tracks order state, such that entries are marked as new, processing, or done. These kinds of transient states are not uncommon in various inventory management systems, so it’s a great example for this use case. Often with such systems, data is distributed in such a way that more than 90% of orders are marked as ‘done’. To make this interesting, let’s just cap the done state at 90%, and distribute another 5% to processing, and 5% to new.

This somewhat complex SQL should emulate the above scenario:

DROP TABLE IF EXISTS sys_order;
 
CREATE TABLE sys_order
(
    order_id     SERIAL       NOT NULL,
    account_id   INT          NOT NULL,
    product_id   INT          NOT NULL,
    item_count   INT          NOT NULL,
    order_state  CHAR         NOT NULL DEFAULT 'N',
    order_dt     TIMESTAMPTZ  NOT NULL DEFAULT now(),
    valid_dt     TIMESTAMPTZ  NULL
);
 
INSERT INTO sys_order (
         account_id, product_id, item_count, order_state,
         order_dt, valid_dt
       )
SELECT (a.id % 100) + 1, (a.id % 100000) + 1, (a.id % 100) + 1,
       CASE WHEN b.id BETWEEN 1 AND 5 THEN 'N'
            WHEN b.id BETWEEN 6 AND 10 THEN 'P'
            ELSE 'D'
       END,
       now() - (a.id % 1000 || 'd')::INTERVAL
             + (a.id || 'ms')::INTERVAL,
       CASE WHEN a.id % 499 = 0
            THEN NULL
            ELSE now() - (a.id % 999 || 'd')::INTERVAL
       END
  FROM generate_series(1, 10000) a(id),
       generate_series(1, 100) b(id);
 
ALTER TABLE sys_order ADD CONSTRAINT pk_order_id
      PRIMARY KEY (orde

[continue reading]

Posted by Pavel Stehule on 2015-07-16 at 04:08:00
When you run psql without arguments, then the psql tries to connect to database named like current user. I dislike this behave. Surely I have not database "pavel". There is trivial solution - add to your .bashrc code:
function psql { 
if [[ $# -eq 0 ]]; then
env psql postgres
else
env psql "$@"
fi
}

When: 7-8pm Wed July 22
Where: Oregon Convention Center, Room E141

We’re having a Birds of a Feather session at OSCON instead of our usual July meeting.

Registration is not required to attend the BoF. O’Reilly would prefer we register, though – here’s a code for a free pass if you would like to also check out the Expo Hall: https://twitter.com/oscon/status/621013720257269760


Posted by Greg Sabino Mullane in End Point on 2015-07-15 at 18:36:00

Being able to disable Postgres triggers selectively can be an important skill when doing tasks like bulk updates, in which you only want a subset of the triggers on the table to be fired. Read below for the long explanation, but the TL;DR version of the best solution is to set a WHEN clause on the trigger you wish to skip, making it conditional on a variable such as session_replication_role, or application_name


CREATE TRIGGER mytrig AFTER INSERT ON foobar FOR EACH 
  ROW WHEN (current_setting('session_replication_role') <> 'local') EXECUTE PROCEDURE myfunc();
BEGIN;
SET LOCAL session_replication_role = 'local';
UPDATE foobar SET baz = 123;
COMMIT;

I decided to spin up a free Heroku "Hobby Dev" database to illustrate the solutions. Generating a test table was done by using the Pagila project, as it has tables which contain triggers. Heroku gives you a randomly generated user and database name. To install the Pagila schema, I did:

$ export H="postgres://vacnvzatmsnpre:2iCDp-46ldaFxgdIx8HWFeXHM@ec2-34-567-89.compute-1.amazonaws.com:5432/d5q5io7c3alx9t"
$ cd pagila-0.10.1
$ psql $H -q -f pagila-schema.sql
$ psql $H -q -f pagila-data.sql

Errors appeared on the import, but they can be safely ignored. One error was because the Heroku database does not have a user named "postgres", and the other error was due to the fact that the Heroku user is not a superuser. The data, however, was all intact. The sample data is actually quite funny, as the movie titles were semi auto-generated at some point. For example, seven random movie descriptions:

  • A Brilliant Panorama of a Madman And a Composer who must Succumb a Car in Ancient India
  • A Touching Documentary of a Madman And a Mad Scientist who must Outrace a Feminist in An Abandoned Mine Shaft
  • A Lackluster Reflection of a Eskimo And a Wretch who must Find a Fanny Pack in The Canadian Rockies
  • A Emotional Character Study of a Robot And a A Shark who must Defeat a Technical Writer in A Manhattan Penthouse
  • A Amazing Yarn of a Hunter And a Butler

[continue reading]

Posted by Andrew Dunstan in pgExperts on 2015-07-15 at 13:07:00
Say you have a partitioned table and you want to add a column. There is no problem - you just add the column to the parent table, and it is added to all the children. But what if you want to drop a column? Then things are no so straightforward. If the child's column was created before it was inherited then it won't be dropped just by dropping it on the parent. So it very much depends on how the child is set up. If you do:
create table child() like (parent);
then dropping a column in the parent drops it in the child too. But if you do:
create table child (like parent);
alter table child inherit parent;
then dropping a column in the parent won't drop it in the child. The pg_partman package follows this second pattern when setting up child partitions, as I discovered yesterday when a client ran into this problem. In this case you have to delete the column from the children yourself. I devised the following snippet of code to accomplish this after you have deleted the column from the parent:
do $$
declare
child oid;
begin
for child in
select inhrelid
from pg_inherits
where inhparent = 'parent'::regclass
loop
execute 'alter table ' || child::regclass ||
' drop column if exists some_column';
end loop;
end;
$$;

PostgreSQL 9.5 is coming up with a new feature aimed at simplifying tracking of GUC parameters when those are set in a multiple set of files by introducing a new system view called pg_file_settings:

commit: a97e0c3354ace5d74c6873cd5e98444757590be8
author: Stephen Frost <sfrost@snowman.net>
date: Fri, 8 May 2015 19:09:26 -0400
Add pg_file_settings view and function

The function and view added here provide a way to look at all settings
in postgresql.conf, any #include'd files, and postgresql.auto.conf
(which is what backs the ALTER SYSTEM command).

The information returned includes the configuration file name, line
number in that file, sequence number indicating when the parameter is
loaded (useful to see if it is later masked by another definition of the
same parameter), parameter name, and what it is set to at that point.
This information is updated on reload of the server.

This is unfiltered, privileged, information and therefore access is
restricted to superusers through the GRANT system.

Author: Sawada Masahiko, various improvements by me.
Reviewers: David Steele

In short, pg_file_settings can prove to be quite useful when using a set of configuration files to set the server when including them using for example include or include_if_not_exists. Hence, for example let's imagine a server with the following, tiny configuration:

$ cat $PGDATA/postgresql.conf
shared_buffers = '1GB'
work_mem = '50MB'
include = 'other_params.conf'
$ cat $PGDATA/other_params.conf
log_directory = 'pg_log'
logging_collector = on
log_statement = 'all'

Then this new system view is able to show up from where each parameter comes from and the value assigned to it:

=# SELECT * FROM pg_file_settings;
          sourcefile          | sourceline | seqno |       name        | setting | applied | error
------------------------------+------------+-------+-------------------+---------+---------+-------
 /to/pgdata/postgresql.conf   |          1 |     1 | shared_buffers    | 1GB     | t       | null
 /to/pgdata/postgresql.conf   |    

[continue reading]

When I need to execute some statement in every database of some PostgreSQL cluster, I am using a script:
for db in `psql -At -c "select datname from pg_database where datname not in ('template0','template1')"`; 
do
psql -At -c "select current_database(), pg_size_pretty(pg_table_size('pg_attribute')) where pg_table_size('pg_attribute') > 100 * 1024 * 1024" $db;
done
Today I needed to run VACUUM statement for selected databases. So I needed to find a way, how to run this slow statement. I was to surprised how it is simple task due strong xarg command. And nice bonus - I can run this slow queries parallel - because xargs can run entered command in more workers (-P option):
# find databases with bloated pg_attribute table, and enforce VACUUM
for db in `psql -At -c "select datname from pg_database where datname not in ('template0','template1')"`;
do
psql -At -c "select current_database() where pg_table_size('pg_attribute') > 100 * 1024 * 1024" $db;
done | xargs -P 3 -I % psql % -c "vacuum full verbose analyze pg_attribute"
Posted by Hans-Juergen Schoenig in Cybertec on 2015-07-13 at 13:40:09
When checking out those new features of Sybase 15.7 (yes, from time to time I got to see what commercial databases are up to),  I stumbled over an interesting and yet amusing line: “Fully Recoverable DDL”. The details seem to indicate that not just Sybase is still having a hard time to handle transactional and […]
I released a new versions of these packages: Orafce and Plpgsql_check. Its mostly bugfix releases with PostgreSQL 9.5 support.

I apologize for putting this series on a short hiatus last week for the 4th of July. But worry not, for this week is something special for all the developers out there! I’m going to try to make your life easier for a change. Screw the database!

As a PGDB (PostgreSQL) DBA, it’s easy to get tied up in performance hints, obscure syntax, and mangled queries, but it’s really all about the people. These men and women who hit our databases all day long in an attempt to hit insane deadlines often stare at the screen in defeat, trying to get something to work because they’re afraid to ask for help. I know, because I used to be one of them in my bygone developer days.

Sometimes performance isn’t just about the database. Queries can be insanely complex for seemingly simple requests. Even developers who are familiar with SQL syntax don’t speak it fluently, and will often come up with rather bizarre—yet functional—solutions to get a specific data set. Nobody wins when this happens, not the poor dev who fought a truculent query all day, nor the DBA who probably doesn’t understand all the business logic that drove it.

Let’s use an example from a company I worked with in the past. This data set should give us something fairly close to what happened back then.

CREATE TABLE sys_order
(
    order_id     SERIAL       NOT NULL,
    account_id   INT          NOT NULL,
    product_id   INT          NOT NULL,
    item_count   INT          NOT NULL,
    order_dt     TIMESTAMPTZ  NOT NULL DEFAULT now(),
    valid_dt     TIMESTAMPTZ  NULL
);
 
INSERT INTO sys_order (account_id, product_id, item_count,
       order_dt, valid_dt)
SELECT (a.id % 1000) + 1, (a.id % 100000) + 1, (a.id % 100) + 1,
       now() - (id % 1000 || 'd')::INTERVAL + (id || 'ms')::INTERVAL,
       CASE WHEN a.id % 499 = 0
            THEN NULL
            ELSE now() - (id % 999 || 'd')::INTERVAL
       END
  FROM generate_series(1, 1000000) a(id);
 
ALTER TABLE sys_order ADD CONSTRAINT pk_order_id
      PRIMARY KEY (order_id);
 
CREATE INDEX idx_order_account_id
  

[continue reading]

This tip works very well for me because of my physical location (Bellingham, WA) but it would also work reasonably well for anyone flying from Denver->West Coast including places such as Houston. It does take a little bit of patience though.

A normal trip for myself would mean driving down to SEA which is 90 minutes to 2 hours. This year, I decided on whim to see what it would take to fly out of YVR (Vancouver, B.C.) which is only 60 minutes driving.

Since I would be flying out of YVR on a non-connecting flight, I paid Canadian Dollars. For those that haven't been paying attention, the U.S. dollar has been doing very well lately (even overtaking the euro on some days). For example, the Canadian dollar is 27% cheaper right now than the U.S. dollar and thus my flight was 27% cheaper.

You can't connect to YVR, you must fly out of YVR. Therefore if you are in the aforementioned areas, you would fly into Seattle or Bellingham and then drive to YVR to connect to a new flight. Be patient and give yourself enough time (to not miss your flight), and you are going to save a lot of money.

Cheerio and make sure you register for my class!

The quick summary of this issue is that the backup_label file is an integral part of your database cluster binary backup, and removing it to allow the recovery to proceed without error is very likely to corrupt your database.  Don't do that.

Note that this post does not attempt to provide complete instructions for how to restore from a binary backup -- the documentation has all that, and it is of no benefit to duplicate it here; this is to warn people about a common error in the process that can corrupt databases when people try to take short-cuts rather than following the steps described in the documentation.

How to Lose Data


The Proximate Cause

If you are not careful to follow the documentation's instructions for archiving, binary backup, and PITR restore the attempt to start the restored database may fail, and you may see this in the log:
FATAL:  could not locate required checkpoint record
HINT:  If you are not restoring from a backup, try removing the file "$PGDATA/backup_label".
... where $PGDATA is the path to the data directory.  It is critically important to note that the hint says to try removing the file "If you are not restoring from a backup".  If you are restoring from a backup, removing the file will prevent recovery from knowing what set of WAL records need to be applied to the copy to put it into a coherent state; it will assume that it is just recovering from a crash "in place" and will be happy to apply WAL forward from the completion of the last checkpoint.  If that last checkpoint happened after you started the backup process, you will not replay all the WAL needed to achieve a coherent state, and you are very likely to have corruption in the restored database.  This corruption could result in anything from the database failing to start to errors about bad pages to silently returning incorrect results from queries when a particular index is used.  These problems may appear immediately or lie dormant for months before causing visible problems.

Note that you might sometimes get lucky and not

[continue reading]

The pg_shard extension provides a transparent, automatic sharding solution for PostgreSQL. It can shard a table across a cluster of PostgreSQL nodes, storing shards in regular tables. All communication between the pg_shard master and worker nodes happens using regular SQL commands, which allows almost any PostgreSQL server to act as a worker node, including Amazon RDS instances.

Apart from simplifying database administration, using Amazon RDS or a similar solution with pg_shard has another benefit. RDS instances have automatic failover using streaming replication, which means that it is not necessary to use pg_shard’s built-in replication for high availability. Without replication, pg_shard can be used in a multi-master / masterless set-up.

At this week’s PGDay UK, we demonstrated a distributed PostgreSQL cluster consisting of 4 worker nodes on RDS and 2 master nodes on EC2 with pg_shard installed (as shown below). We showed how the cluster automatically recovers when you terminate workers or master nodes while running queries. To make it even more interesting, we put the master nodes in an auto-scaling group and put a load-balancer in front of them. This architecture is somewhat experimental, but it can support a very high number of transactions per second and very large data sizes.

Architecture diagram - click to enlarge

Now what good is a demo if you can’t play with it yourself? To start your very own state-of-the-art distributed PostgreSQL cluster with 4 worker nodes on Amazon RDS: Launch it using CloudFormation! Make sure to enter a (long) database password and your EC2 keypair in the Parameters screen. You can leave the other settings on their defaults.

Once stack creation completes (~25 minutes), you can find the hostname of the load-balancer in the Outputs tab in the CloudFormation console (may require refresh), the hostnames of the master nodes can be found in the EC2 console, and the worker nodes in the RDS console.

We recommend you start by connecting to one of the master nodes over SSH. On the master node, run psql and enter the following

[continue reading]

Posted by Dinesh Kumar in OpenSCG on 2015-07-09 at 12:51:00

Hi,

I am pretty sure that, there will be a right heading for this post. For now, i am going with this. If you could suggest me proper heading, i will update it :-)

OK. let me explain the situation. Then will let you know what i am trying to do here, and how i did it.

Situation here is,

We have a table, which we need to run update on “R” no.of records. The update query is using some joins to get the desired result, and do update the table. To process these “R” no.of records, it is taking “H” no.of hours. That too, it’s giving load on the production server. So, we planned to run this UPDATE as batch process. Per a batch process, we took “N” no.or records. To process this batch UPDATE, it is taking “S” no.of seconds.

With the above batch process, production server is pretty stable, and doing great. So, we planned to run these Batch updates parallel. I mean, “K” sessions, running different record UPDATEs. Of-course, we can also increase the Batch size here. But we want to use much cpus to complete all these UPDATES as soon as possible.

Problem here is,

So, as i said, we need to run multiple UPDATEs on multiple records in parallel. But, how can one session is going to communicate with other sessions on this batch records.
I mean,
If one session is running updates on 1 to 1000, how could the second session knows that the other session was processing from 1 to 1000.
If the second session knows this information, this will start from 1001 to 2000 in parallel. This is the problem i am trying to solve here.

I am not sure whether this is the optimal solution, but as per my requirement it’s working. :-) Let me know if you see any problems in it.

Object Definitions
                      Table "public.test"
Column | Type | Modifiers
--------+---------+----------------------------------------------------
t | text |
i | boolean |
seq | bigint | not null default nextval('test_seq_seq'::regclass)

postgres=# INSERT INTO test VALUES(generate_series(1, 9000), false, generate_series(1, 9000));
IN

[continue reading]

Version 0.9.2 of the BDR (Bi-Directional Replication) extension for PosgreSQL has been released.

This is a maintenance release in the current stable 0.9.x series, focused on bug fixes, stability and usability improvements. In particular bdr_init_copy, the pg_basebackup-based node bring-up tool, is significantly improved in this update.

This release also updates the BDR-patched version of PostgreSQL its self to version 9.4.4.

Sources and RPMs (for RHEL and CentOS 6 and 7, Fedora 21 and 22) are available. Debian/Ubuntu packages will be pushed to the repository soon.

As before, the sources will build and run fine on any Linux/BSD or on Mac OS X, but do not currently support Windows.

The release notes have more detail on what has changed in this release.

Posted by Paul Ramsey on 2015-07-07 at 00:00:00

Due to a number of bugs capable of crashing a back-end, we have released 2.1.8. If you are running an earlier version on a public site, we recommend that you update as soon as possible.

http://download.osgeo.org/postgis/source/postgis-2.1.8.tar.gz

View all closed tickets for 2.0.8.

Finding its origin as a fix for pg_rewind where rewind process could fail if a file marked as listed in the source server was removed before fetching its data (see more details here, this could impact temporary files or relation file for example), here is a small feature of PostgreSQL 9.5 that may be useful for application developers:

commit: cb2acb1081e13b4b27a76c6b5311115528e49c59
author: Heikki Linnakangas <heikki.linnakangas@iki.fi>
date: Sun, 28 Jun 2015 21:35:46 +0300
Add missing_ok option to the SQL functions for reading files.

This makes it possible to use the functions without getting errors, if there
is a chance that the file might be removed or renamed concurrently.
pg_rewind needs to do just that, although this could be useful for other
purposes too. (The changes to pg_rewind to use these functions will come in
a separate commit.)

The read_binary_file() function isn't very well-suited for extensions.c's
purposes anymore, if it ever was. So bite the bullet and make a copy of it
in extension.c, tailored for that use case. This seems better than the
accidental code reuse, even if it's a some more lines of code.

Michael Paquier, with plenty of kibitzing by me.

Postgres has a set of superuser functions allowing to have a look at the files of PGDATA from the SQL interface:

  • pg_ls_dir, to list the files of a given path.
  • pg_read_file, to read a given file and return its data as text.
  • pg_read_binary_file, to read a given file and return its data as bytea.
  • pg_stat_file, to get statistics about a file similarly to the system function stat().

With the above commit, all those functions have gained a missing_ok option allowing to not fail should a file (or path for pg_ls_dir) selected by the user not exist, returning a NULL result instead (or more or less an empty record for pg_ls_dir). In each case, the default is false.

In the case of pg_read_file and pg_read_binary_file, this new boolean flag is available with a new 4-argument version of those functions (for backward-compatibility purposes).

=# SELEC

[continue reading]

Posted by Barry Jones on 2015-07-03 at 15:04:23
PostgreSQL is becoming the relational database of choice for web development for a whole host of good reasons. That means that development teams have to make a decision on whether to host their own or use a database as a service provider. The two biggest players in the world of PostgreSQL are Heroku PostgreSQL and Amazon RDS for PostgreSQL. Here's a detailed comparison.