PostgreSQL
The world's most advanced open source database
Top posters
Number of posts in the past month
Top teams
Number of posts in the past month
Feeds
Planet
  • Policy for being listed on Planet PostgreSQL.
  • Add your blog to Planet PostgreSQL.
  • List of all subscribed blogs.
  • Manage your registration.
Contact
  • Get in touch with the Planet PostgreSQL administrators at planet at postgresql.org.

One of the problems with Postgres-XL 9.2 is that it assigns a global transaction identifier (GXID) for every transaction that is started in the cluster. Since its hard for the system to know if a transaction is read-only or not (remember, even SELECT queries can do write activities), Postgres-XL would always assign a GXID and send that to the datanodes. This is quite bad for system performance because every read-only transaction now must go to the GTM, start a new transaction, grab an GXID and also finish the transaction on the GTM when its committed or aborted. For short transactions, like read-only pgbench workload, this adds a lot of overhead and latency. The overhead is even more noticeable when the actual work done by the transaction is very fast, say because the data is fully cached in shared buffers.

This is now set to change in the upcoming Postgres-XL 9.5 release. We have implemented the same lazy GXID assignment technique similar to PostgreSQL. This required us to enhance the concept of global session. A transaction which involves more than one node runs a global session and gets a unique global session identifier. If a node while executing the query decides to assign GXID to the transaction, it sends the global session identifier to the GTM. If the global transaction, identified by the global session, had already been assigned a GXID, the GTM sends back the same GXID to the node or a new GXID is assigned to the transaction. Further, the node sends back the assigned GXID to the coordinator so that it can decide whether to run a single phase or two-phase commit for the transaction.

While the lazy GXID assignment resulted in significant reduction in the GXID consumption rate, its still not enough because of yet another problem that we must solve. As you’re aware, Postgres-XL uses Multi-version Concurrency Control or MVCC for providing various transaction guarantees. But to reclaim the dead row versions, every node must check row versions against something called as RecentGlobalXmin which, in simple terms,

[...]
Posted by Christophe Pettus in pgExperts on 2016-02-12 at 03:53

I’ve noticed an increasing tendency in PostgreSQL users to over-index tables, often constructing very complex partial indexes to try to speed up very particular queries.

Be careful about doing this. Not only do additional indexes increase the plan time, they greatly increase insert time.

By way of example, I created a table with a single bigint column, and populated it with:

time psql -c "insert into t select generate_series(1, 100000000)"

That run without any indexes took 1 minute, 55 seconds; that run with eight indexes on the same table took 26 minutes, 39 seconds, or almost 14 times slower.

Regularly consult pg_stat_user_indexes and drop indexes that aren’t being used. Your disk space and insert times will thank you.

Today, the PostgreSQL Global Development Group announced new minor releases for all supported version: 9.5.1, 9.4.6, 9.3.11, 9.2.15 and 9.1.20. As usual, RPMs are out, too: http://yum.PostgreSQL.org

Since the last set of updates, I have been working hard with other members in the community for shuffling the repo a bit, towards an aim for better and easier maintenance.

Below is the list of today's update set. They all include the latest versions of each software as of today:

postgresql-tcl
postgresql-jdbc
pgpool
pgpoolAdmin
pg_bulkload
pgmemcache
pg_comparator
check_postgres
multicorn
pgactivity
ora2pg
pgbouncer
pg_jobmon
pam-pgsql
pgsi
pguri
plpgsql_check
psycopg2
ogr_fdw
pg_fkpart
pgcenter
pgbadger
pg_track_settings
tds_fdw
repmgr
powa

Please let us know if you encounter any issues with the packaging.

Thanks!
Posted by Federico Campoli on 2016-02-11 at 12:06
I've been  busy recently  and I failed to update on the last meetup news.
I apologise for that.

We had a very interesting meetup in January.

Alexey Bashtanov explained how the grouping works in postgres and how to improve or even re implement in C the grouping functions.
On the meetup page there are the pictures from the meeting .
The presentation's recording is available there and the slides are free to download on slideshare there.

We are already preparing the next meetup announced for 18th of March.
That's the abstract.

The next postgresql meetup will be a technical talk on the ways in which PostgreSQL can communicate with the outside world.

This talk will cover PostgreSQL Foreign Data Wrappers which are tables that
can read and write to external services. 

Efficient ways to run queries from applications will also be covered.

About the speaker.
Matthew Franglen works as a team lead in Brandwatch. He has a long history
of development, with over 10 years of experience.





As usual the talk will be preceded by pizza and a nice chat. We'll also do a recording.

If you are in Brighton please join us here.
http://www.meetup.com/Brighton-PostgreSQL-Meetup/events/228747329/


This will likely be my last post on this topic.  I would like to revive this blog on technical rather than ideological issues but there seems like a real effort to force ideology in some cases.  I don't address this in terms of specific rights, but in terms of community function and I have a few more things to say on this topic before I return to purely technical questions.

I am also going to say at the outset that LedgerSMB adopted the Ubuntu Code of Conduct very early (thanks to the suggestion of Joshua Drake) and this was a very good choice for our community.  The code of conduct provides a reminder for contributors, users, participants, and leadership alike to be civil and responsible in our dealings around the commons we create.  Our experience is we have had a very good and civil community with contributions from every walk of life and a wide range of political and cultural viewpoints.  I see  this as an unqualified success.

Lately I have seen an increasing effort to codify a sort of political orthodoxy around open source participation.  The rationale is usually about trying to make people feel safe in a community, but these are usually culture war issues so invariably the goal is to exclude those with specific political viewpoints (most of the world) from full participation, or at least silence them in public life.  I see this as extremely dangerous.

On the Economic Nature of Open Source


Open source software is economically very different from the sorts of software developed by large software houses.  The dynamics are different in terms of the sort of investment taken on, and the returns are different.  This is particularly true for community projects like PostgreSQL and LedgerSMB, but it is true to a lesser extent even for corporate projects like MySQL.  The economic implications thus are very different.

With proprietary software, the software houses build the software and absorb the costs for doing so, and then later find ways to monetize that effort.  In open source, that is one strategy among many[...]

I have just pushed code for a new version of the codebase for Planet PostgreSQL.

For those of you who are just reading Planet, hopefully nothing at all should change. There will probably be some bugs early on, but there are no general changes in functionality. If you notice something that is wrong (given a couple of hours from this post at least), please send an email to planet(at)postgresql.org and we'll look into it!

For those who have your blogs aggregated at Planet PostgreSQL, there are some larger changes. In particular, you will notice the whole registration interface has been re-written. Hopefully that will make it easier to register blogs, and also to manage the ones you have (such as removing a post that needs to be hidden). The other major change is that Planet PostgreSQL will now email you whenever something has been fetched from your blog - to help you catch configuration mistakes bigger.

The by far largest changes are in the moderation and administration backend. This will hopefully lead to faster processing of blog submissions, and less work for the moderators.

Posted by Richard Yen in EnterpriseDB on 2016-02-09 at 18:25

Postgres supports a variety of data types that allow data architects to store their data consistently, enforce constraints through validation, maximize performance, and maximize space.  Recently, I was asked to show someone how to store a UUID (Universally Unique Identifier) into Postgres by way of JPA (Java Persistence API).

read more

Posted by Umair Shahid in 2ndQuadrant on 2016-02-09 at 14:35

It is always exciting to meet community members and exchange ideas about PostgreSQL and the eco-system around the database. I was lucky enough to be a part of FOSDEM PGDay in Brussels on 29th January this year. 3 of 2ndQuadrant’s very best spoke at the event. If you missed the talks, you can take a look at their slide decks shared below.

I will be sure to direct any queries you might have to the experts themselves! :-)

The slides from my talk at PG Day at FOSDEM 2016 are now available.

Quite a long time ago (in October), Oskar Liljeblad reported a bug in anonymization. Namely – group keys were not anonymized. You can see example of such plan here. I finally got to it, fixed the bug, pushed new version to live site, and now such plans will be correctly anonymized. Thanks Oskar, and sorry […]
On 5th of February, Tom Lane committed patch: Add num_nulls() and num_nonnulls() to count NULL arguments.   An example use-case is "CHECK(num_nonnulls(a,b,c) = 1)" to assert that exactly one of a,b,c isn't NULL. The functions are variadic, so they can also be pressed into service to count the number of null or nonnull elements in […]

The 9th annual “Prague PostgreSQL Developer Day” conference will be held on 17th and 18th February in Prague.

IMG_1019

Me with PostgreSQL elephants at Prague Zoo :)

Welcome to Prague PostgreSQL Developers Day

Prague PostgreSQL Developer Day is a two day conference and talks & trainings are mostly in Czech language. Conference program is also ready, you can check the full schedule.

1st Day: Trainings – 17th of February

The first day of the conference, there will be trainings and practical lectures for a smaller number of attendees. You can find the detailed trainings program. All trainings will be in Czech language.

One of the most interesting ones is “Reading query execution plans” training. With this training Tomas Vondra from 2ndQuadrant aims to help trainees to understand the basic stages of processing a SQL query (parsing, planning and execution) and how the database trying to find the best plan.

If you are more into PostgreSQL database administration you might want to join Pavel Stěhule‘s “Configuration, administration and monitoring PostgreSQL” training. He will focus on monitoring PostgreSQL for preventing problems and unwanted downtimes also figuring out possible bottlenecks of the PostgreSQL systems.

If you are interested in awesome PostGIS you might consider joining Jan Michalek‘s “Introduction to PostGIS” training. After the training participants are expected to have ability to perform simple spatial queries, import / export data, etc.

The last training that I would like to mention is Aleš Zelený‘s “Learning PostgreSQL for Oracle DBAs”. He will show differences and similarities between Oracle and PostgreSQL databases from the perspective of an Oracle DBA. If you are interested in listening his journey while he has been adapting PostgreSQL with a different DBMS background and different experiences, you might want to join his training.

With these 4 trainings that I mentioned above, the first day will end and the conference will continue with the talks both in Czech and English languages.

2nd Day: Talks – 18th o

[...]

Databases form the cornerstone of many applications, Web sites, and platforms. A huge amount of time, money, and research has been poured into databases over the last few decades. But our thirst for data, and the quantities that we’re trying to read and analyze, continue to grow. What can and should we do? How can we ensure reliability?  How can we communicate with a growing number of other systems? And where does PostgreSQL, an open-source relational database with a growing number of features, fit into this trend? In this talk, Siimon Riggs answers all of these questions, and describes how PostgreSQL’s developers are working to keep as many of these questions in mind as they continue to improve their contribution to the world of databases.

Time: 22 minutes

The post [Video 435] Simon Riggs: Databases — The Long View appeared first on Daily Tech Video.

Posted by Shaun M. Thomas on 2016-02-05 at 17:11

Say that three times fast! Joking aside, managing database object access is a headache for users and DBAs alike. Postgres isn’t really an exception in that regard, but it does provide a few tools to greatly reduce the pain involved. The crazy thing is that few people even know this feature exists. I’ve known about it for a while myself, but it always slips my mind because it feels so wrong not to explicitly grant permissions.

What does that mean? Well, consider we have a schema with some existing tables, and a role we want to have read permission on tables in that schema. Here’s one as an example:

CREATE SCHEMA acl_app;
 
CREATE TABLE acl_app.t1 (foo INT);
CREATE TABLE acl_app.t2 (bar INT);
 
CREATE ROLE acl_read;
GRANT USAGE ON SCHEMA acl_app TO acl_read;

If this were a preexisting schema, normally we would grant read permission to tables like this:

GRANT SELECT ON acl_app.t1 TO acl_read;
GRANT SELECT ON acl_app.t2 TO acl_read;

And that’s also the usual suggestion for grants after tables are created. Create the table, then grant the permissions. It’s fairly straight-forward, and an expected part of database administration. But what about when we have an existing table with dozens or hundreds of tables? Doing the grants manually as shown above would be monumentally irritating! In fact, for many database systems, the only way forward is to use system catalogs to generate a script, and then execute the output in the database. Postgres lets you do that:

COPY (
  SELECT 'GRANT SELECT ON acl_app.' || tablename ||
         ' TO acl_read;'
    FROM pg_tables
   WHERE schemaname = 'acl_app'
) TO '/tmp/grant_perms.sql';
 
\i /tmp/grant_perms.SQL

But the kind Postgres devs have also provided us with some extremely useful shorthand, because while usable, the script approach is something of an ugly hack. Here’s how that looks:

GRANT SELECT
   ON ALL TABLES IN SCHEMA acl_app
   TO acl_read;
 
\z acl_app.*
 
 Schema  | Name | TYPE  |     Access privileges     
---------+------+-------+----------
[...]
Posted by gabrielle roth on 2016-02-05 at 02:29

When: 6-8pm Thursday Feb 18, 2016
Where: Iovation
Who: Jason Owen
What: Incremental Schema Discovery via JSON Wrangling

Over the last few releases, Postgres has added and expanded support for storing and querying JSON documents. While the simplicity and flexibility of storing free-form data can be appealing, frequently what’s stored is actually fairly structured. Using data from the GitHub Archive, Jason will show how easy it is to load JSON into Postgres, demonstrate some fairly mechanical schema refactorings to extract structure from JSON data, and then compare the resulting normalized tables with the original, with the documentation, and with the data set loaded into Google BigQuery.

Jason has been using Postgres in earnest for two years now. While he’s worn the DBA hat as a full stack engineer from time to time, he is primarily a developer. He works at Tripwire and is an alumnus of Portland State University. Sometimes he says things online as @jasonaowen.


If you have a job posting or event you would like me to announce at the meeting, please send it along. The deadline for inclusion is 5pm the day before the meeting.

Our meeting will be held at Iovation, on the 32nd floor of the US Bancorp Tower at 111 SW 5th (5th & Oak). It’s right on the Green & Yellow Max lines. Underground bike parking is available in the parking garage; outdoors all around the block in the usual spots. No bikes in the office, sorry!

Elevators open at 5:45 and building security closes access to the floor at 6:30.

When you arrive at the Iovation office, please sign in on the iPad at the reception desk.

See you there!


Posted by David Kerr on 2016-02-04 at 23:12
I recently had the pleasure of Amazon telling me that they had to reboot all of my Postgres RDS instances to apply some security patches. When using RDS you generally expect that Amazon is going to do something like this and I was at least happy that they told me about it and gave me the option to trigger it on a specific maintenance window or else on my own time (up to a drop dead date where they'd just do it for me) One thing that you can't really know is what the impact of the operation is going to be. You know it's a downtime, but for how long? My production instances, are of course, Multi-AZ but all of my non-production instances are not. Fortunately, my non-production instances and my production instances both needed to get rebooted, so I could do some up-front testing on the timing. What I found was that the process takes about 10 to 15 minutes and, in this particular case, it was not impacted by database size. Although it is impacted by the number of instances you're rebooting at the same time. It seems Amazon queues the instances up so that some instances take longer than others. The pre-reboot security patches took about 5 minutes to load during this time the database was up. This was followed by a shutdown / reboot during which the database was unavailable. After the reboot which took less than a minute the database was immediately available while the system did post processing. After that a backup is performed which doesn't impact the system. So total downtime was about a minute, but I scheduled 10 minutes just to be safe. For the Multi-AZ instances the same process is followed but the shutdown / reboot is accompanied by an AZ failover which takes place nearly instantly. This is pretty cool as long as your applications are robust enough to re-connect. (Mine were not, so they required a restart) I timed the reboot to go with a deploy so no additional downtime was required. In the end it was fairly painless, if you don't trust your applications ability to reconnect it's good to baby sit them. Otherwise [...]

PgBouncer is a lightweight connection pooler for PostgreSQL. PgBouncer 1.7 was announced on the 18th of December 2015. In this blog post we’ll talk about the major new improvements in PgBouncer.

I enjoy checking their funny release names at every new release and for celebrating PgBouncer 1.7: “Colors Vary After Resurrection” release, I used the lovely Dumbo image.

pink-elephants

The Most Colorful Features

  • PgBouncer 1.7 supports TLS connections and, I think this is the biggest improvement of the new release. They used OpenSSL/LibreSSL libraries as backend implementation of the feature.

Note: TLS (Transport Layer Security) is a protocol that ensures privacy between communicating applications and their users on the Internet. When a server and client communicate, TLS ensures that no third party may eavesdrop or tamper with any message. TLS is the successor to the Secure Sockets Layer (SSL).

  • PgBouncer now supports authentication via TLS client certificate.

Note: Traditionally, TLS Client Authentication has been considered the alternative to bearer tokens (passwords and cookies) for web authentication. In TLS Client Authentication, the client (browser) uses a certificate to authenticate itself during the TLS handshake. Once the TLS connection is established (and authenticated), the client and server run HTTP on top of the TLS layer.

Let’s dig into details of TLS settings of PgBouncer. There are 14 config parameters related with TLS setup (client side + server side settings).

For assigning which TLS mode to use for connections from clients, we should set client_tls_sslmode parameter. TLS connections are disabled by default. When enabled, client_tls_key_file and client_tls_cert_file must be also configured to set up key and cert PgBouncer uses to accept client connections.

We can assign a root certificate to validate client certificates by setting client_tls_ca_file parameter, default is unset.

We can specify which TLS protocol versions are allowed by setting client_tls_protocols parameter, default is all.

For more detailed cl

[...]
Posted by gabrielle roth on 2016-02-03 at 22:35
I’ve some more info for the AWS RDS OS update I mentioned last week. The announcement states that this update will take an outage. In my experience, it’ll be the usual failover time for a Multi-AZ instance (a minute or so), and 8-10 minutes for no-Multi-AZ instances. According to my event logs, the entire update process takes 20-25 minutes. Since I […]

My slides about PostgreSQL Streaming Replication from PgConf 2016 Russia.



The PostgreSQL statistics collector generates a lot of very important statistics about the state of the database. If it’s not working, autovacuum doesn’t work, among other problems. But it does generate a lot of write activity, and by default, that goes back onto the database volume.

Instead, always set statstempdirectory to point to a RAM disk (which has to be owned by the postgres user, with 0600 permissions). The statistics are written back to the database on shutdown, so in normal operations, you won’t lose anything on a reboot. (You’ll lose the most recent statistics on a crash, but you will anyway; the statistics are reset on recovery operations, including restart from a crash.)

This can substantially cut down the amount of write I/O the main database storage volume has to receive, and it’s free!

On 22nd of January, Tom Lane committed patch: Add trigonometric functions that work in degrees.   The implementations go to some lengths to deliver exact results for values where an exact result can be expected, such as sind(30) = 0.5 exactly.   Dean Rasheed, reviewed by Michael Paquier The description seems to explain everything, but […]
On 20th of January, Robert Haas committed patch: The core innovation of this patch is the introduction of the concept of a partial path; that is, a path which if executed in parallel will generate a subset of the output rows in each process. Gathering a partial path produces an ordinary (complete) path. This allows […]

I recently noted that the COPY command in Postgres doesn’t have syntax to skip columns in source data when importing it into a table. This necessitates using one or more junk columns to capture data we’ll just be throwing away. During that, I completely forgot that friendly devs had contributed alternative file handling methods as Foreign Data Wrappers. Most people think of foreign wrappers as a method for interacting with remote databases. Perusing the full list however, reveals some surprising data sources. Twitter? Hive? Video cards?!

Well, let’s take a quick look at a real case. Here’s a table we want to import data into. We can’t insert into it directly, because of that extra column. So we’ll use an intermediate table as a raw import target, then insert into the target table. Let’s use COPY as a first approximation with 500,000 rows:

CREATE UNLOGGED TABLE option_quote_raw
(
  ext_quote_id BIGINT,
  quote_date DATE NOT NULL,
  quote_time_ns BIGINT NOT NULL,
  seq_no BIGINT NOT NULL,
  sub_seq_no BIGINT NOT NULL,
  exchange_id INT NOT NULL,
  time_us BIGINT NOT NULL,
  feed_side VARCHAR NOT NULL,
  flag INT NOT NULL,
  bid NUMERIC(16, 8) NOT NULL,
  ask NUMERIC(16, 8) NOT NULL,
  bid_size INT NOT NULL,
  ask_size INT NOT NULL,
  trash TEXT
);
 
CREATE UNLOGGED TABLE option_quote
(
  ext_quote_id BIGINT,
  quote_date DATE NOT NULL,
  quote_time_ns BIGINT NOT NULL,
  seq_no BIGINT NOT NULL,
  sub_seq_no BIGINT NOT NULL,
  exchange_id INT NOT NULL,
  time_us BIGINT NOT NULL,
  feed_side VARCHAR NOT NULL,
  flag INT NOT NULL,
  bid NUMERIC(16, 8) NOT NULL,
  ask NUMERIC(16, 8) NOT NULL,
  bid_size INT NOT NULL,
  ask_size INT NOT NULL
);
 
CREATE INDEX idx_quote_id ON option_quote (ext_quote_id);
 
\timing ON
 
COPY option_quote_raw FROM '/tmp/quote_source.csv' WITH CSV;
 
TIME: 1705.967 ms
 
INSERT INTO option_quote
SELECT ext_quote_id, quote_date, quote_time_ns, seq_no,
       sub_seq_no, exchange_id, time_us, feed_side, flag,
       bid, ask, bid_size, ask_size
  FROM option_quote_raw;
 
TIME: 2062.863 ms
[...]

If you already have a working PostgreSQL 9.5 install, and just want to skip to relevant sections, follow this list:

As a general note, these instructions are what I did for CentOS 7. For lower versions ther are some differences in packages you'll get. For example currently if you are installing on CentOS 6 (and I presume by extension other 6 family), you won't get SFCGAL and might have pgRouting 2.0 (instead of 2.1)


Continue reading "An almost idiot's guide to install PostgreSQL 9.5, PostGIS 2.2 and pgRouting 2.1.0 with Yum"

The recent release of Postgres 9.5 has many people excited about the big new features such as UPSERT (docs) and row-level security (docs). Today I would like to celebrate three of the smaller features that I love about this release.

Before jumping into my list, I'd like to thank everyone who contributes to Postgres. I did some quick analysis and found that 85 people, from Adrien to Zeus, have helped version 9.5 of Postgres, at least according to the git logs. Of course, that number is actually higher, as it doesn't take into account people helping out on the #postgresql channel, running buildfarm animals, doing packaging work, keeping the infrastructure running, etc. Thanks to you all!

Feature: REINDEX VERBOSE

The first feature is one I've been wishing for a long time - a verbose form of the REINDEX command. Thanks to Sawada Masahiko for adding this. Similar to VACUUM, REINDEX gets kicked off and then gives no progress or information until it finishes. While VACUUM has long had the VERBOSE option to get around this, REINDEX gives you no clue to which index it was working on, or how much work each index took to rebuild. Here is a normal reindex, along with another 9.5 feature, the ability to reindex an entire schema:

greg=# reindex schema public;
## What seems like five long minutes later...
REINDEX

The new syntax uses parenthesis to support VERBOSE and any other future options. If you are familiar with EXPLAIN's newer options, you may see a similarity. More on the syntax in a bit. Here is the much improved version in action:

greg=# reindex (verbose) schema public;
INFO:  index "foobar_pkey" was reindexed
DETAIL:  CPU 11.00s/0.05u sec elapsed 19.38 sec.
INFO:  index "foobar_location" was reindexed
DETAIL:  CPU 5.21s/0.05u sec elapsed 18.27 sec.
INFO:  index "location_position" was reindexed
DETAIL:  CPU 9.10s/0.05u sec elapsed 19.70 sec.
INFO:  table "public.foobar" was reindexed
INFO:  index "foobaz_pkey" was reindexed
DETAIL:  CPU 7.04s/0.05u sec elapsed 19.61 sec.
INFO:  index 
[...]
Posted by Joshua Drake in CommandPrompt on 2016-01-28 at 15:57
A lot of people probably don't know this but PostgreSQL does plan. It is true that we take all contributions and they are reviewed based on their merit but it is also true that the community tries very hard to have a road map of some sort. Those road maps are created by the more prolific contributors in the community.

In the past there was a yearly Developer Meeting. That meeting would take place at PgCon. PgCon is held in May at the University of Ottawa, Canada. It is a small but great developer conference.

This year we are going to have two plus probably an informal one for a total of three. The first of which is taking place now @ FOSDEM. The current list of attendees to that meeting and the discussion points are listed there.

The informal one will likely happen at PgConf.US in April. As it is the largest North American PostgreSQL conference, it makes sense that a lot of the developers you will see on the list at the FOSSDEM developer meeting will also be at PgConf.US. Even better, we may see some of the "regrets" that were not able to make FOSSDEM. This allows for a great amount of catch-up.

The second formal meeting will happen at PgCon per the usual schedule.

These are invitation only meetings so don't plan to just show up. The way to get invited is to be a prolific developer/contributor to PostgreSQL.Org and we are always looking for more of those. Without them, we couldn't be the leading and most advanced database in the world!

I would be remiss if I didn't also mention that the CFP for PgConf.US 2016 is closing this coming Sunday. Get those presentations in!

Posted by Magnus Hagander in Redpill Linpro on 2016-01-28 at 14:38

The PostgreSQL core team are happy to welcome Dean Rasheed as a new committer to the project.

Dean has been involved wih patch review and authorship for a while, and will now get the extra workload of also pushing other patches through the final hurdles!

Further digging into the PL/R documentation shows a way to run code at startup and make it globally available.

Round 4

First we need a table called plr_modules:
CREATE TABLE plr_modules (
modseq int4,
modsrc text
);
Then we add the necessary entries:

The SVM is now globally available and the predictor function can be reduced to the following:

Let's run this statement three times again:

select s.*, r_predict4(s.*) from generate_series(1,1000) s;

541 ms for the first run. 281 ms for each of the following two. Average: 368 ms.

That's only a 73% improvement compared to the original code. predict3() is faster then predict4().

Since the only difference is, that the mysvm object is now global, that might explain the difference. Maybe it is more expensive to work with global than local objects in R?

But there is another way, suggested by a reader of the first installment of this series.

Push the whole loop into R and wrap it into an SRF.


Let's see and run this statement three times again:

select * from r_predict5(array(select * from generate_series(1,1000)));

341 ms for the first run. 31 ms for each of the following two. Average: 134 ms.

Now that's a 90% improvement compared to the original code. Ten times faster.
If only the post initialization runs are taken into account it's even better: about 45x.

If you process sets anyway and you can live with passing arrays we have a winner here!

Bottom line:
  1. Do expensive initializations only once.
  2. Pre-can R objects that are immutable and expensive to create with saveRDS().
  3. If you process sets, push the whole loop down to R and stay there as long as possible.
The predict2() function initializes the SVM only on first call which improves performance significantly. But it still needs to build the model from scratch.

If the training takes a comparatively long time or the training data cannot be provided along with the code, this is a problem.

Round 3

R has the ability to serialize objects to disk and read them back with saveRDS() and readRDS().

Having saved the SVM object like that, we can restore it from disk instead of rebuilding it each time.

Let's run this statement three times again:

select s.*, r_predict3(s.*) from generate_series(1,1000) s;

484 ms for the first run. 302 ms for each of the following two. Average: 363 ms.

That's a 75% improvement compared to the original code.

Still, the first call is more expensive than the subsequent ones.

Can we do better?