PostgreSQL
The world's most advanced open source database
Top posters
Number of posts in the past month
Top teams
Number of posts in the past month
Feeds
Twitter
Planet
  • Policy for being listed on Planet PostgreSQL.
  • Add your blog to Planet PostgreSQL.
  • List of all subscribed blogs.
  • Manage your registration.
Contact
  • Get in touch with the Planet PostgreSQL administrators at planet at postgresql.org.

PostgreSQL JDBC Driver Version 9_4_1202 released


Lots of bug fixes and some awesome performance enhancements, including statement caching

Version 9.4-1201 (2015-02-25)

  • ResultSet positioning methods in some particular cases PR #296 (282536b)
Author: Craig Ringer
  • Disable binary xfer on batches returning generated keys PR #273 (763ae84)
  • Add a new test case demonstrating a bug in returning support PR #273 (4d2b046)
  • Always Describe a query in a batch that returns generated keys PR #273 (a6bd36f)
Author: Dave Cramer
  • chore: fix build.xml to allow releasing to maven PR #262 (34f9361)
  • fix: BlobInputStream ignores constructor parameters #263 PR #273 (c1c6edc)
  • don't reset forceBinary transfer if we setPreparedThreshold (937a11c)
  • Revert "perf: Remove expensive finalize methods from Statement and Connection" PR #293 (a0d3997)
  • updated copyright PR #312 (263375c)
  • Revert "Issue 250 -- Adding setURL/getURL to BaseDataSource.java" PR #312 (a1ac380)
  • fixed mailing list href PR #326 (c3e86a6)
  • increment driver version PR #346 (b8ee75d)
Author: David R. Bild:
  • feat: add equality support to PSQLState PR #277 (7698cd9)
  • Improve version checking PR #355 (f7a84db)
Author: Eugene Koontz
  • Add support within "private Object buildArray (PgArrayList input, int index, int count)" for array elements whose type is jsonb PR #349 (d313138)
  • Added setter method for logging level. The method exactly matches property name in documentation. PR #282 (d9595d1)
  • Added getter method. PR #282 (65759f0)
  • Adding XML catalog to help unit tests not remote entity resolution. PR #284 (cb87067)
  • Added support to locally resolve dtd or entity files. PR #284 (017970d)
  • Disable verbose logging of the catalog resolver. PR #284 (fcc34f5)
Author: Kris Jurka
  • Improve error message for failure to update multicolumn primary key RSs. PR #284 (05ff811)
  • Remove all JDBC3 code as JDK 1.4/1.5 are no longer supported. PR #284 (f9a956b)
  • Add preliminary support for JDBC4.2. PR #284 (bd05fd2)
Author: Lo

[continue reading]

I've been writing a little provisioning script for a vagrant Centos 7 machine that will be a buildfarm client.

Stripped of the vagrant stuff and some stuff that's special for the machine, here is what I did to get it running:

sudo yum install -y wget gcc make flex bison ccache git \
perl-libwww-perl perl-Digest-SHA perl-devel perl-ExtUtils-Embed \
zlib-devel openssl-devel readline-devel python-devel tcl-devel \
libxml2-devel libxslt-devel openldap-devel

mkdir bf
cd bf
HERE=`pwd`
wget http://www.pgbuildfarm.org/downloads/latest-client.tgz
mv latest-client.tgz buildfarm-latest-client.tgz
tar -z --strip-components=1 -xf buildfarm-latest-client.tgz
mkdir root
sed -i "s!build_root =>.*!build_root => '$HERE/root',!" build-farm.conf
At this stage the buildfarm client will run quite successfully, as can be tested using:
perl run_build.pl --test

While hacking PostgreSQL it’s very useful to know pid of the backend you are working with. You need to know pid of the process to attach debugger, profiler etc. Luckily, .psqlrc provides us an elegant way to define the shortcuts for psql. Using config line below one can find out backend pid just by typing :pid.

\set pid 'SELECT pg_backend_pid();'

=# :pid
 pg_backend_pid
----------------
          99038
(1 row)

In 9.6 it becomes possible to even include backend pid into psql prompt.

However, it’s possible to automate more complex actions in psql. I’ve configured my psql to run gdb attached to current backend in new tab of iTerm2 just by typing :gdb.

The :gdb command selects pid of current backend and puts it to the input of pg_debug script.

\set gdb 'SELECT pg_backend_pid() \\g |pg_debug'

pg_debug extracts pid from its input and then runs OSA script which runs gdb in the new tab of iTerm2.

#!/bin/bash

IFS=''

while read line
do
    # Extended display off
    if [[ $line =~ ^\ +([0-9]+) ]]; then
        PID=${BASH_REMATCH[1]}
        break
    fi
    # Extended display on
    if [[ $line =~ ^pg_backend_pid.*\ ([0-9]+) ]]; then
        PID=${BASH_REMATCH[1]}
        break
    fi
done

# Open gdb session
osascript -e "
tell application \"iTerm\"
    activate
    tell the current terminal
        set mysession to (the current session)
        launch session \"Default Session\"
        tell the last session
            write text \"gdb --pid=$PID -x <(echo continue)\"
        end tell
        select mysession
    end tell
end tell"

This script works for Mac OS X and iTerm2, but the same approach should work for other platforms and terminal emulators.

Posted by Pavan Deolasee in 2ndQuadrant on 2015-08-26 at 11:12:40

It’s been busy few months as we work towards merging Postgres-XL with the latest and greatest release of PostgreSQL. Postgres-XL is an open source fork of PostgreSQL that provides a scalable platform for OLTP and Business Intelligence. The current release of Postgres-XL is based on PostgreSQL 9.2, so it lacks all the improvements made to PostgreSQL over the last three years.

2ndQuadrant and other companies are working on bringing distributed scalability into PostgreSQL core as well as building tools and extensions outside the core. As part of that, Postgres-XL has a number of features that we’d like to bring back into core PostgreSQL, so 2ndQuadrant has picked up the task of updating the Postgres-XL code base to the latest PostgreSQL release as the first step. After more than 3 months work, PostgreSQL 9.5 is still in alpha stage, so we wanted to give a progress report on how the work is proceeding. I also need to say the magic words: This ongoing work on Postgres-XL is part of the AXLE project, funded by the European Union under grant agreement 318633.

Preparation for the Merge

Since PostgreSQL and Postgres-XL both use GIT as source control system, it makes the merge process much simpler since GIT provides many tools to assist the process. But as soon as we tried the merge, we faced the first hurdle.

We realised that the current Postgres-XL repository is based on an older minor 9.2 release of PostgreSQL. That means there were commits and changes in the Postgres-XL master branch which either never made to PostgreSQL’s master branch or had different commit ids. So merge with PostgreSQL master branch threw a lot more conflicts than what we would have other expected. So the first task we must accomplish was to rebase the Postgres-XL 9.2 repository on a later commit point. This obviously required careful treading, making sure that nothing breaks during the process. Once we had the basic rebase done, we also merged all Postgres-XL bug fixes and enhancements, created a Postgres-XL 9.2 stable branch and merged the 9.2 br

[continue reading]

Posted by Yann Larrivee on 2015-08-26 at 00:50:48

ConFoo is once more seekconfooing passionate speakers for the upcoming conference.

The event is happening in Montreal, Canada, between February 24th and 26th, 2016. It is an exciting conference for web developers with speakers from all over the world. It unites many web programming languages under one roof, as well as other topics related to web development. The call for papers closes on September 20th.

ConFoo renews 50% of its speakers each year. If you’re new to this conference, you should definitely submit.

If you would just like to attend, there is a discount until October 13th.

Posted by Terry Erisman in CitusData on 2015-08-25 at 17:13:49

We are pleased to announce that the PGConf Silicon Valley breakout sessions and tutorials are now posted on the conference website. Our outstanding Conference Committee has worked through the large pool of submissions and selected the sessions which will be presented November 17-18, 2015 at the South San Francisco Conference Center. As the first dedicated PostgreSQL conference in the Bay Area, the conference is a great opportunity for PostgreSQL community members to exchange technical knowledge and make new connections. Early Bird registration prices are available through October 4, 2015.

Breakout Sessions

The breakout sessions are organized into five tracks:

  • DevOps
  • In the Trenches
  • Features
  • Hackery
  • PostgreSQL at Scale

The outstanding list of speakers include representatives from companies including TripAdvisor, Urban Airship, Square, Heap, Chartio, AWS, Rackspace, Joyent, and Heroku. The breakout sessions, speakers, and talk descriptions are available on the conference website.

Tutorials

The tutorials will be presented by leading PostgreSQL practitioners and range from beginner to advanced level. Presenters include representatives from 2ndQuadrant, EnterpriseDB, Citus Data, OmniTI, and PostgreSQL Experts. Check the conference website for talk descriptions and speaker information.

More Information Coming Soon

The schedule for the tutorials and breakout sessions will be posted to the PGConf Silicon Valley conference website in the near future. We will also post a series of interviews with conference speakers who will provide insights into their sessions and what they are looking forward to at the conference. 

Take advantage of Early Bird pricing before it ends on October 4th. I hope to see you PGConf Silicon Valley!

 

PGConf Silicon Valley Breakout Sessions and Tutorials are Now Posted

November 17-18, 2015 at the South San Francisco Conference Center

ALTER TABLE has been known for many years in the Postgres ecosystem as being a command taking systematically an ACCESS EXCLUSIVE lock on the relation being modified, preventing all operations on the relation in parallel. Those locks are getting more and more relaxed, with for example in Postgres 9.4 the following commands that got improvements:

  • VALIDATE CONSTRAINT
  • CLUSTER ON
  • SET WITHOUT CLUSTER
  • ALTER COLUMN SET STATISTICS
  • ALTER COLUMN SET and ALTER COLUMN RESET for attribute options

In 9.5 as well those commands have been improved:

  • ENABLE TRIGGER and DISABLE TRIGGER
  • ADD CONSTRAINT FOREIGN KEY

Now, Postgres 9.6, which is currently in development, brings in more lock reduction, with the following commit:

commit: 47167b7907a802ed39b179c8780b76359468f076
author: Simon Riggs <simon@2ndQuadrant.com>
date: Fri, 14 Aug 2015 14:19:28 +0100
Reduce lock levels for ALTER TABLE SET autovacuum storage options

Reduce lock levels down to ShareUpdateExclusiveLock for all
autovacuum-related relation options when setting them using ALTER TABLE.

Add infrastructure to allow varying lock levels for relation options in
later patches. Setting multiple options together uses the highest lock
level required for any option. Works for both main and toast tables.

Fabrízio Mello, reviewed by Michael Paquier, mild edit and additional
regression tests from myself

Code speaking, ALTER TABLE SET has been improved to be able to define different types of locks depending on the parameter touched, and in the case of this commit all the parameters tuning autovacuum and auto-analyze at relation level have been updated to use SHARE UPDATE EXCLUSIVE LOCK. In short, this allows read as well as write operations to occur in parallel of the ALTER TABLE, something that will definitely help leveraging activity bloat on such relations.

Note as well that when multiple subcommands are used, the stronger lock of the whole set is taken for the duration of the ALTER TABLE command. So for example, should an update on the parameter fillfactor be mixed with a

[continue reading]

Posted by Gulcin Yildirim in 2ndQuadrant on 2015-08-23 at 16:01:35

PgBouncer is lightweight connection pooler for PostgreSQL.

The latest release of PgBouncer was announced on 1st of August 2015. In this blog post we’ll talk about the major new improvements of PgBouncer.

Main new features of PgBouncer

Load user password hash from postgres database

With this hot feature, PgBouncer allows loading user’s password from database with two config parameters which are auth_user and auth_query.

Note: These config parameters are defined in the configuration file pgbouncer.ini.

  • auth_user
    If auth_user is set, any user not specified in auth_file will be queried from pg_shadow in the database using auth_user. Auth_user’s password will be taken from auth_file. This parameter can be set per-database too.

    Note: auth_file is the name of the file to load user names and passwords from.

  • auth_query
    This parameter allow us to write a SQL query to load user’s password from database. It runs under auth_user.See the default query below:
    SELECT usename, passwd FROM pg_shadow WHERE usename=$1
    

Pooling mode can be configured both per-database and per-user

With this feature, independent of the main pooling mode, clients can now connect to different databases with one of the 3 pooling modes described below. This is also applicable to users as well. For example, if the pooling mode is session pooling, a specific user can be configured to use transaction pooling. This gives us database-level and user-level flexibility to apply more appropriate pooling options.

PgBouncer provides 3 connection pooling modes:

  • Session pooling
    During the lifetime of a client connection, an existing server connection is assigned to the client and after the client disconnection, the assigned server connection is put back to the connection pool.
  • Transaction pooling
    In this mode, a server connection is not assigned to a connected client immediately: only during a transaction a server connection is assigned. As soon as the transaction is over, the connection is put back into the pool.
  • Statement pooling
    This is similar to the transa

[continue reading]

Posted by Shaun M. Thomas on 2015-08-21 at 16:32:15

PG Phriday: Basic Partitioning

Most PGDB (PostgreSQL) users who are familiar with partitioning use the method described in the partitioning documentation. This architecture comes in a fairly standard stack:

  • One empty base table for structure.
  • At least one child table that inherits the base design.
  • A trigger to redirect inserts based on the partitioning scheme.
  • A constraint on each child table to enforce the partition scheme, and help the planner exclude child partitions from inapplicable queries.

It looks simple, but there’s a lot of SQL involved. Let’s take the sensor_log table we introduced a couple weeks ago] and turn it into a partitioned structure.

TRUNCATE TABLE sensor_log;
 
CREATE TABLE sensor_log_part_2012 (
    LIKE sensor_log INCLUDING INDEXES,
    CHECK (reading_date >= '2012-01-01' AND
           reading_date < '2013-01-01')    
) INHERITS (sensor_log);
 
CREATE TABLE sensor_log_part_2013 (
    LIKE sensor_log INCLUDING INDEXES,
    CHECK (reading_date >= '2013-01-01' AND
           reading_date < '2014-01-01')    
) INHERITS (sensor_log);
 
CREATE TABLE sensor_log_part_2014 (
    LIKE sensor_log INCLUDING INDEXES,
    CHECK (reading_date >= '2014-01-01' AND
           reading_date < '2015-01-01')    
) INHERITS (sensor_log);
 
CREATE TABLE sensor_log_part_2015 (
    LIKE sensor_log INCLUDING INDEXES,
    CHECK (reading_date >= '2015-01-01' AND
           reading_date < '2016-01-01')    
) INHERITS (sensor_log);
 
CREATE OR REPLACE FUNCTION sensor_partition()
RETURNS TRIGGER AS $$
BEGIN
  CASE EXTRACT(YEAR FROM NEW.reading_date)
    WHEN 2015 THEN
      INSERT INTO sensor_log_part_2015 VALUES (NEW.*);
    WHEN 2014 THEN
      INSERT INTO sensor_log_part_2014 VALUES (NEW.*);
    WHEN 2013 THEN
      INSERT INTO sensor_log_part_2013 VALUES (NEW.*);
    WHEN 2012 THEN
      INSERT INTO sensor_log_part_2012 VALUES (NEW.*);
  END CASE;
 
  RETURN NULL;
END;
$$ LANGUAGE plpgsql;
 
CREATE TRIGGER t_correct_partition
BEFORE INSERT ON sensor_log
   FOR EACH ROW EXECUTE PROCEDURE sensor_partition();

[continue reading]

Posted by Andreas Scherbaum on 2015-08-20 at 20:17:49
Author
Andreas 'ads' Scherbaum

What happens, if you ask the Chinese PostgreSQL community for a Meetup-like event, one or two speakers? You end up getting a full day conference ;-)

 

On September 12, Saturday, a full day PostgreSQL conference will take place at the Pivotal office in Beijing. If you want to attend, please sign up here.

The full address is: 17/F, South Block, Tower C, Raycom Info Tech Park, NO.2, Kexueyuan South Road, Beijing 100190, China.

 


Continue reading "PostgreSQL Day in Beijing"

At this year's PGCon, we gave a talk on pg_shard that included a futuristic pg_shard demo: a distributed table with JSONB fields, backed by a dynamically changing row and columnar store. The demo is based on the Github archive data, which comprises a log of all events across all public Github repositories with detailed metadata in JSON format.

The pg_shard extension can store very large event logs in a PostgreSQL table by transparently sharding the table across many servers. We envision a use-case where new events are added to the table in real-time and frequent look-ups occur on data less than 7 days old. Data older than 7 days is compressed inside the database using cstore_fdw. A video of the pg_shard talk and demo is available on our website and at the end this post.

In our demo, the events table was range-partitioned by time, such that each shard contains events for a specific hour of the day. One of the advantages of range-partitioning a table by time is that older shards are read-only, which makes it easier to perform certain operations on the shards, such as replacing the shard with an equivalent cstore_fdw table. We consider this demo to be futuristic since pg_shard does not yet have functions to create a range-partitioned table. It can only be set up by manually changing the pg_shard metadata tables, for which we created several PL/pgSQL functions.

The architecture used in the demo resembles a shard assembly line as shown below. In the first stage of the assembly line the shard is filled wih data, after that the shard is frequently read by SELECT queries, and after 7 days the shard is compressed for archival. When pg_shard receives an INSERT, it compares the timestamp of the time ranges of the shards in the metadata tables to find the shard for the current hour. Assuming the INSERTS only happen for the current time, then there is only ever one shard to which new data is being written. At the end of the hour, INSERTs will start going into a new shard. The shard for the next hour needs to be set up in ad

[continue reading]

On 30th of July, Andrew Dunstan committed patch: Add IF NOT EXISTS processing to ALTER TABLE ADD COLUMN   Fabrízio de Royes Mello, reviewed by Payal Singh, Alvaro Herrera and Michael Paquier. Nice. I always like when there is new “IF EXISTS" or “IF NOT EXISTS", because it makes my life as dba simpler. This […]
Posted by Josh Berkus on 2015-08-18 at 23:58:57

For September, two of the founders of PipelineDB are going to be introducing this streaming PostgreSQL fork. Come learn what PipelineDB is about and how you'd use it alongside PostgreSQL. This meetup will be at a new venue, If(we), a social networking company which uses PostgreSQL together with internal databases. RSVP on meetup.

P.S. we are looking for a lightning talk for this meetup! Contact josh-at-postgresql.org.

Posted by Glyn Astill on 2015-08-18 at 18:52:44
Prior to pg 9.1, you could connect to any database with psql regardless of encoding and you’d get the server encoding as your client encoding unless you set it: That wasn’t quite right; the client_encoding is a lie. On a modern psql version that’s quite rightly prevented: This is not an issue if you want […]
Posted by Julien Rouhaud in Dalibo on 2015-08-18 at 14:03:34

work_mem?

The work memory, or work_mem is one of the hardest thing to configure. It can be used for various purposes. It’s mainly used when sorting data or creating hash tables, but it can also be used by set returning functions using a tuplestore for instance, like the generate_series() function. And each node of a query can use this amount of memory. Set this parameter too low, and a lot of temporary files will be used, set it too high and you may encounter errors, or even an Out Of Memory (OOM) depending on your OS configuration.

I’ll focus here on the amount of memory needed when sorting data, to help you understand how much memory is required when PostgreSQL runs a sort operation.

Truth is out

I often hear people say there is a correlation between the size of the temporary files generated and the amount of data needed. It’s wrong, you can’t make any assumption on the value of work_mem based on the size of a sort temporary file.

It’s because when the data to be sorted don’t fit in the allowed memory, PostgreSQL switches to an external sort. In addition to the currently used memory, a temporary file is used multiple times, to avoid wasting disk space. If you want more details on this, the relevant source code is present in tuplesort.c and logtapes.c. As a brief introduction, the header of tuplesort.c says:

[…] This module handles sorting of heap tuples, index tuples, or single Datums (and could easily support other kinds of sortable objects, if necessary). It works efficiently for both small and large amounts of data. Small amounts are sorted in-memory using qsort(). Large amounts are sorted using temporary files and a standard external sort algorithm.

See Knuth, volume 3, for more than you want to know about the external sorting algorithm. We divide the input into sorted runs using replacement selection, in the form of a priority tree implemented as a heap (essentially his Algorithm 5.2.3H), then merge the runs using polyphase merge, Knuth’s Algorithm 5.4.2D. The logical “tapes” used by Algorithm D are imp

[continue reading]

Q: So, what is TABLESAMPLE actually good for?

A: To speed things up that can be answered from a small sample instead of the whole population!

Like so:

CREATE TABLE public.ts_test
(
  id integer NOT NULL DEFAULT nextval('ts_test_id_seq'::regclass),
  val double precision,
  CONSTRAINT ts_test_pkey PRIMARY KEY (id)
)
WITH (
  OIDS=FALSE
);

Insert 100 million random values in the range 0.0 <= x < 1.0:

INSERT INTO public.ts_test (val)
SELECT
    random()
FROM
    generate_series(1, 100000000);

And now, some queries to get the standard deviation.

For the whole population:

SELECT round(stddev_pop(val)::numeric,5) FROM ts_test;

0.28869 in 15005 ms
Error: 0%
Speedup: 1x

For a sample of size 20000 (good for 1% error and 99% confidence):

SELECT round(stddev_samp(val)::numeric,5) FROM ts_test TABLESAMPLE BERNOULLI(0.02);

0.28867 in 2387 ms
Error: < 1%
Speedup: 6.3x

SELECT round(stddev_samp(val)::numeric,5) FROM ts_test TABLESAMPLE SYSTEM(0.02);

0.28791 in 25 ms
Error: < 1%
Speedup: 600x

With a table that large, both methods BERNOULLI and SYSTEM are accurate enough to give results well below the 1% error target.

(For potential problems with SYSTEM, see this post)

SYSTEM aces BERNOULLI with a speedup of 600x vs. 6.3x relative to a full population query.

And this is hard to beat...
On 17th of July, Robert Haas committed patch: Add new function pg_notification_queue_usage.   This tells you what fraction of NOTIFY's queue is currently filled.   Brendan Jurd, reviewed by Merlin Moncure and Gurjeet Singh. A few further tweaks by me. LISTEN / NOTIFY are very cool features, that don't get enough recognition. For completeness sake […]
Posted by Federico Campoli on 2015-08-17 at 21:19:00
Friday 14th August we kicked off the Brighton PostgreSQL  Meetup.
We had a nice evening with cool people all togheter discussing about PostgreSQL and how we can run effectively the meetup.

We decided to have a regular monthly meetup hosted around Brighton, possibly, by companies or any suitable venue.

The next meetup will be the 25th of September and this time there will be some PostgreSQL talks. The general interest favours the standby servers and the streaming replication. Another topic could be the query planning and execution.
I can cover both of the topics. We just need to decide how many talks will be in one hour session.

After the meetup we’ll head to the nearest pub for a nice beer.

I can only say a big thank you, to all the meetup members.

That’s all for now.  
On 7th of July, Andres Freund committed patch: Add psql PROMPT variable showing the pid of the connected to backend.   The substitution for the pid is %p.   Author: Julien Rouhaud Discussion: <a href="http://git.postgresql.org/gitweb/?p=postgresql.git;a=object;h=116262CF971C844FB6E793F8809B51C6E99D48">116262CF971C844FB6E793F8809B51C6E99D48</a>@BPXM02GP.gisp.nec.co.jp It was over a month ago, but in the mean time I was on vacation, and I wanted to test […]
Posted by Andrew Dunstan in pgExperts on 2015-08-17 at 18:19:00
Found in the last couple of days.

1. If you have more than one active database, and you're using text format logs, make sure you have %d (the database name) in your log line prefix. Not knowing which database the log line comes from can be confusing and makes life more difficult.

2. If you report on a possible database error, report from the Postgres logs, not from your client's logs. We need to see things from the database's point of view. In the case in point, the Postgres logs conatined no relevant errors at all, i.e. this wasn't a Postgres error.

3. Found when trying to resize a data volume on Amazon: certain Linux disk partitioning utilities don't optimally align partitions by default, notably sfdisk and cfdisk. Stay away from those, or make sure you explicitly choose proper alignment for partitions. Programs that do the right thing include fdisk and parted. In general you are probably best to stick to using parted.
Posted by David Fetter in Farmers Business Network on 2015-08-17 at 16:00:49
PostgreSQL has MVCC, which is awesome until you want an approximate count on what could be a long view...until now. Here's how. Continue reading "Approximate Counts"
Posted by Hans-Juergen Schoenig in Cybertec on 2015-08-17 at 07:44:44
There are still many people out there, who cannot spell the name of their favorite database. “PostgreSQL”, “PostgresSQL”, “Postgre SQL” … the list goes on and on. Can we blame those people? Actually no. Some words are simply pretty tricky. Each of us has failed once in a while. After all database work is not […]

So I had this PostgreSQL database that was getting a bit too big, and since it was really only for analytics, I figured it would be a good fit for putting in Hadoop+Hive instead.

(For those not completely familiar with this: Hadoop is sort of a job tracker and distributed file system. Hive is an SQL-like layer on top of that. I know the cool kids are now using Spark. Maybe for another day.)

The first thing you need to learn about the Hadoop ecosystem is its idiosyncratically fragmented structure. With PostgreSQL, you basically have the community website, the community mailing lists, the community source code distribution, the community binaries, and a handful of binaries made by Linux distributions. If you search the web for a problem with PostgreSQL, you will normally gets hits on one or more of: the documentation, the mailing lists, third-party mirrors of the mailing lists, or Stack Overflow. With Hadoop, you have the resources provided by the Apache Software Foundation, including the source distribution, bug tracker, documentation, and then bunch of commercial vendors with their parallel universes, including their own mutually incompatible binary distributions, their own copy of the documentation, their own mailing lists, their own bug trackers, etc. When you search for a problem with Hadoop, you will typically get hits from three separate copies of the documentation, about eight mailing lists, fifteen tutorials, and one thousand blog posts. And about 20 unanswered posts on Stack Overflow. Different vendors also favor different technology extensions. So if, say, you read that you should use some storage method, chances are it’s not even supported in a given distribution.

The next thing to know is that any information about Hadoop that is older than about two years is obsolete. Because they keep changing everything from command names to basic architecture. Don’t even bother reading old stuff. Don’t even bother reading anything.

So Hive. The basic setup is actually fairly well documented. You set up a Hadoop clu

[continue reading]

Posted by Josh Berkus in pgExperts on 2015-08-14 at 23:14:00
Since we're still working on the schedule, I wanted to give folks a preview of the talks we've picked for pgConfSV.  We still have a few more talks to pick and the tutorials to finalize.  Regardless, it's looking like a really exciting lineup! 
We'll have folks from Heap, Wanelo, TreasureData, Rackspace, Turnitin.com, EMC, Joyent, Square, and more.   The sad part will be missing three talks in each timeslot.

Note that this set of talks is not final; we're still swapping a couple of things, and some speakers have not confirmed.

pgConfSV is November 17th and 18th in South San Francisco.  If you're going to AnsibleCon, that's the day after, so you can do both!
Posted by Shaun M. Thomas on 2015-08-14 at 16:36:39

What’s a good table to partition? It’s not always a question with an obvious answer. Most often, size and volume determine whether or not a table should be broken into several chunks. However, there’s also cases where business or architecture considerations might use partitioning to preserve a shared table structure, or drive aggregate analysis over a common core. In PGDB (PostgreSQL), this is even more relevant because of how partitioning is handled through inheritance.

For now, let’s focus on elements that can be directly quantified, such as size and row count. Luckily, the PGDB system catalog contains this information, so let’s give it a look. Let’s see what the catalog has to say about the sensor_log table introduced in last week’s article on The Case for Partitioning.

SELECT oid::regclass::text AS TABLE_NAME,
       pg_size_pretty(pg_total_relation_size(oid)) AS total_size,
       CEIL(reltuples / 1000000) AS million_rows
  FROM pg_class
 WHERE relkind = 'r'
   AND reltuples > 10000000
 ORDER BY relpages DESC;
 
 TABLE_NAME | total_size | million_rows 
------------+------------+--------------
 sensor_log | 12 GB      |          100

With the help of a couple administration functions, we can focus on any tables that are immediate outliers. In this case, the only criteria that really matters, is that we want to see any tables with more than 10-million rows. Why that particular number? Tables that cross this threshold often continue to grow relatively quickly due to a consistent data stream. This volume tends to increase over time, and large tables face a number of problems.

Consider the sensor_log table as a basis for analysis. At 12GB, it probably fits in memory fairly easily. In addition, it’s extremely narrow, having only four columns. Yet even with this going in our favor, Anything that needs to read the entire table will encounter occasional issues. Specifically:

  • Rebuilding a single index takes 2-5 minutes on a completely idle server with no memory pressure. Change any of these variables and the

[continue reading]

Today here are highlights about new features regarding replication slots that have been lately committed and will be present as part of PostgreSQL 9.5:

commit: 6fcd88511f8e69e38defb1272e0042ef4bab2feb
author: Andres Freund <andres@anarazel.de>
date: Tue, 11 Aug 2015 12:34:31 +0200
Allow pg_create_physical_replication_slot() to reserve WAL.

When creating a physical slot it's often useful to immediately reserve
the current WAL position instead of only doing after the first feedback
message arrives. That e.g. allows slots to guarantee that all the WAL
for a base backup will be available afterwards.

Logical slots already have to reserve WAL during creation, so generalize
that logic into being usable for both physical and logical slots.

When 9.4 has introduced replication slots, both physical slots (for replication) and logical slots (for logical decoding), one difference between both slot types is that at the time of their creation, a logical slot retains WAL all the time, and a physical slot did not. The commit above reduces the difference gap by making possible to retain WAL when creating a physical slot as well, with the addition of a boolean switch in pg_create_physical_replication_slot which is false by default, meaning that no WAL is kept until the slot is not used at least once. This is particularly useful for example for base backups, that have been extended a couple of days before this commit with an additional --slot option to ensure that WAL is present on source while taking a backup. Here is how this feature behaves:

=# SELECT * FROM pg_create_physical_replication_slot('default_slot', false);
  slot_name   | xlog_position
--------------+---------------
 default_slot | null
(1 row)
=# SELECT * FROM pg_create_physical_replication_slot('reserve_slot', true);
  slot_name   | xlog_position
--------------+---------------
 reserve_slot | 0/1738850
=# SELECT slot_name, restart_lsn from pg_replication_slots;
  slot_name   | restart_lsn
--------------+-------------
 default_slot | null
 reserve_slot | 0/173

[continue reading]

Posted by Federico Campoli on 2015-08-12 at 22:12:00
There is just one day left and we'll start the Brighton PostgreSQL Meetup. I invested some resources in this project like and I truly believe it can be a success.
I still can't believe that in just one month 25 people already have shown  the interest on being part of the Brighton PostgreSQL Group. And today another nice suprise. I received the new shiny mascot for our group.

He's Marvin, the sea elephant. The author if this fantastic drawing is Elena, an incredibly talentuous artist from Italy. 

I've also opened a twitter account for our social media presence.

So we are ready to start. I'm very excited and I'll write a detailed report of our first meetup in few days.


cstore_fdw, the column-store extension for PostgreSQL by CitusData, is a really good way to add compressed storage for archival data, and analytic data intended to be aggregated, to your application.  Because it's a column store, though, cstore wants new data added in batches, the bigger the better.  This means that you need to find some way to batch-up incoming data, preferably one able to accept a lot of new data rapidly. 

This is where PipelineDB comes in.  Since PipelineDB is open source now, and based on 9.4.4, I can add extensions to it, including cstore_fdw.  I've done so with the PipelineDB 0.7.7a Docker container, so if you use Docker it's simply available.

As a demonstration of this, I'll set up some fake clickstream data, archived to a cstore table hourly.  First, I wrote a quick python script to generate it continuously and push it to a Pipeline stream.

Then I created the stream and continuous view in PipelineDB:

    CREATE STREAM seenstream ( user_id int, page_id int, ts timestamptz );

    CREATE CONTINUOUS VIEW seenfeed as select user_id, page_id, ts
    FROM seenstream
    WHERE arrival_timestamp > ( clock_timestamp() - interval '90 minutes' );


Next, I created the cstore table:

    CREATE EXTENSION cstore_fdw;

    CREATE SERVER cstore_server FOREIGN DATA WRAPPER cstore_fdw;

    CREATE FOREIGN TABLE seen (
        user_id int,
        page_id int,
        ts timestamp tz
    )
    SERVER cstore_server
    OPTIONS (compression 'pglz');


Finally, I added a simple script which ran the following query once per hour:

    INSERT INTO seen
    SELECT user_id, page_id, ts
    FROM seenfeed
    WHERE ts >= ( now() - interval '1 hour' )
    ORDER BY user_id, page_id, ts;


... and then I started everything in motion.

Now, for cstore the ORDER BY is vitally important; it determines how the blocks you create for the column store are organized.  In this particular case, I knew that I would be doing more analysis by user.  But most users would do ORDER BY ts instead.

After a few hours, I checked back,

[continue reading]

While Bucardo is known for doing "multi-master" Postgres replication, it can do a lot more than simple "master to master" replication (better known as "source to source" replication). As people have been asking for simple Bucardo Bucardo 5 recipes based on pgbench, I decided to present a few here. Since Bucardo allows any number of sources and targets, I will demonstrate a source-source-source-target replication. Targets do not have to be Postgres, so let's also show that we can do source - MariaDB - SQLite replication. Because my own boxes are so customized, I find it easier and more honest when writing demos to start with a fresh system, which also allows you to follow along at home. For this example, I decided to fire up Amazon Web Services (AWS) again.

After logging in at https://aws.amazon.com, I visited the AWS Management Console, selected "EC2", clicked on "Launch Instance", and picked the Amazon Linux AMI (in this case, "Amazon Linux AMI 2015.03 (HVM), SSD Volume Type - ami-1ecae776"). Demos like this require very little resources, so choosing the smallest AMI (t2.micro) is more than sufficient. After waiting a couple of minutes for it to start up, I was able to SSH in and begin. The first order of business is always updating the box and installing some standard tools. After that I make sure we can install the most recent version of Postgres. I'll skip the initial steps and jump to the Major Problem I encountered:

$ sudo yum install postgresql94-plperl
Error: Package: postgresql94-plperl-9.4.4-1PGDG.rhel6.x86_64 (pgdg94)
           Requires: perl(:MODULE_COMPAT_5.10.1)
 You could try using --skip-broken to work around the problem
 You could try running: rpm -Va --nofiles --nodigest

Well, that's not good (and the "You could try" are useless in this case). Although all the other Postgres packages installed without a problem (postgresql94, postgresql94-server, and postgresql94-libs), there is a major incompatibility preventing Pl/Perl from working. Basically, the

[continue reading]

Posted by Josh Berkus in pgExperts on 2015-08-11 at 23:09:00
Since Oracle was so nice as to remind everyone what software security is really like with closed-source software, I wanted to remind people how finding and reporting security issues works in PostgreSQL:
  1. Feel free to "reverse engineer" the code.  In fact, here it is on github if you want to scrutinize it.
  2. We generally credit security researchers who find real security holes (with limitations for duplicates, etc.).  Over the last few years, some of the most critical bugs in PostgreSQL were found by professional security researchers doing things like fuzz testing.
  3. If you think you've found a security issue, please report it to security@postgresql.org. If it turns out to be a non-issue, we'll tell you, and you can report it as a regular bug.
  4. Be prepared to answer questions about your report.  We showed you our code, you can show us yours.
Our open approach to security is the reason why PostgreSQL was rated by the Database Hacker's Handbook as "the most secure by default": more secure than Oracle.  And why for five Defcons in a row, security hackers have been unable to crack PostgreSQL in the annual Schemaverse competition.

And please ... update your servers! We send out those incremental updates for a reason, and often as not, there are security patches.   More information on our Security Page.

Note: the above is my personal opinion and is not the opinion of The PostgreSQL Project or any other organization.