OmniDB 2.8 introduced support for Postgres-BDR 3.0, the ground-breaking multi-master replication tool for PostgreSQL databases, announced last month in PostgresConf US.
Here we have 2 virtual machines with Postgres-BDR 3.0 installed and we will use OmniDB to connect to them and setup replication between the machines.
Postgres-BDR 3.0 requires PostgreSQL 10 or better and also pglogical 3.0 extension should be installed, as Postgres-BDR 3.0 works on top of pglogical 3.0. Make sure you put the required entries in pg_hba.conf to make both machines communicate to each other via streaming replication. Then, in postgresql.conf you should set the following parameters in both machines:
listen_addresses = '*' client_encoding = utf8 wal_level = 'logical' track_commit_timestamp = on max_wal_senders = 10 max_replication_slots = 10 max_worker_processes = 10 shared_preload_libraries = 'pglogical,bdr'
Then in both machines:
CREATE EXTENSION pglogical; CREATE EXTENSION bdr;
We have installed OmniDB in the host machine and we also have included the appropriate entries in the pg_hba.conf file for both virtual machines, to allow connections from the host machine. Then in OmniDB, we need to create both connections:
In the Connection Grid, click on the Select Connection green button to open connections to both nodes as tabs. Then expand the PostgreSQL node, you will see that we are using latest PostgreSQL minor release (currently 10.4). Then expand the Databases node, omnidb_tests node and finally the Extensions node to check the extensions that are installed in this database: plpgsql, pglogical and bdr.
Now expand the BDR node to see some BDR properties about this node.
We can see that BDR version I am using is 3.0.1-a2 and I didn’t activate BDR in this node yet. Also there are no Groups yet.
So far both machines have PostgreSQL 10,
[...]Sqitch has nothing particular to do with PostgreSQL, except it does support our beloved database!
Long story short: if you are not using sqitch you should give it a
look.
sqitch does not ties itself to
only PostgreSQL, but it does support a lot of relational
engines. However, if you want to know how to start using Sqitch
over PostgreSQL go read the excellent
Introduction to Sqitch on PostgreSQL.
I’ve already written about
sqitch in the past (in
italian).
sqitch is a great tool to manage
database changes, mainly schema changes. The idea is to provide a
git-like interface to manage
changes, a change is made by three scripts
appropriately written for the backend database:
Around a month
ago, the sqitch creator,
David E. Wheeler, created a GitHub Organization named sqitchers that now holds all the
Sqitch related stuff including, obviously, the codebase for the
project. At the same time, the Sqitch steering committee grown, and
this is a good thing since this project quickly became an handy
tool for database management.
In conclusion, sqitch is
growing and getting more free every day. If you are curious about
project history and explaination by its own creator David E.
Wheeler, I suggest
you listening to this (old) FLOSS Weekly podcast.
We've got customers discovering PostGIS and GIS in general or migrating away from ArcGIS family of tools. When they ask, "How do I see my data?", we often point them at QGIS which is an open source GIS desktop with rich integration with PostGIS/PostgreSQL.
QGIS is something that is great for people who need to live in their GIS environment since it allows for easily laying on other datasources, web services and maps. The DBManager tool allows for more advanced querying (like writing Spatial SQL queries that take advantage of the 100s of functions PostGIS has to offer) , ability to import/export data, and create PostgreSQL views.
QGIS has this thing called Projects, which allow for defining map layers and the symbology associated with them. For example what colors do you color your roads, and any extra symbols, what field attributes do you overlay - street name etc. Projects are usually saved in files with a .qgs or .qgz extension. If you spent a lot of time styling these layers, chances are you want to share them with other people in your group. This can become challenging if your group is not connected via network share.
Continue reading "New in QGIS 3.2 Save Project to PostgreSQL"Software Engineering Radio has just posted a one-hour audio recording of an interview I did about the Postgres query optimizer. It is generic enough to be useful to anyone wanting to understand how relational databases optimize queries.
On an unrelated note, I am leaving soon for a 25-day European Postgres speaking tour. I am speaking at one-day conferences in Zürich, London, and Amsterdam, a user group in Frankfurt, and will be presenting at 11 EnterpriseDB events around Europe.

PostgreSQL is one of the most popular open-source databases in the world and has successful implementations across several mission-critical environments across various domains, using real-time high-end OLTP applications performing millions and billions of transactions per day. PostgreSQL I/O is quite reliable, stable and performant on pretty much any hardware, including even cloud.
To ensure that databases perform at the expected scale with expected response times, there is a need for some performance engineering. Well, the accomplishment of good database performance depends on various factors. Database performance can go bad for various reasons such as infrastructure dimensioning, inefficient database maintenance strategy, poor SQL code or badly configured database processes that fail to utilize all the available resources - CPU, memory, network bandwidth and disk I/O.
So that’s a lot of potential areas which can cause performance problems. One of the significant areas I would like to focus on in this blog is how to tune PostgreSQL I/O (Input / Output) performance. Tuning the Input / Output operations of PostgreSQL is essential, especially in a high-transactional environment like OLTP or in a Data warehousing environment with complex data analysis on huge size data sets.
Most of the times, database performance problems are caused mainly due to high I/O. This means, database processes are spending more time either writing to or reading from the disk. Any real-time data operation is I/O bound, it is imperative to
One of the joys of geospatial processing is the variety of tools in the tool box, and the ways that putting them together can yield surprising results. I have been in the guts of PostGIS for so long that I tend to thing in terms of primitives: either there’s a function that does what you want or there isn’t. I’m too quick to ignore the power of combining the parts that we already have.
A community member on the users list asked (paraphrased): “is there a way to split a polygon into sub-polygons of more-or-less equal areas?”
I didn’t see the question and answer, which is lucky, because I would have said: “No, you’re SOL, we don’t have a good way to solve that problem.” (An exact algorithm showed up in the Twitter thread about this solution, and maybe I should implement that.)
PostGIS developer Darafei Praliaskouski did answer, and provided a working solution that is absolutely brilliant in combining the parts of the PostGIS toolkit to solve a pretty tricky problem. He said:
The way I see it, for any kind of polygon:
- Convert a polygon to a set of points proportional to the area by ST_GeneratePoints (the more points, the more beautiful it will be, guess 1000 is ok);
- Decide how many parts you’d like to split into, (ST_Area(geom)/max_area), let it be K;
- Take KMeans of the point cloud with K clusters;
- For each cluster, take a ST_Centroid(ST_Collect(point));
- Feed these centroids into ST_VoronoiPolygons, that will get you a mask for each part of polygon;
- ST_Intersection of original polygon and each cell of Voronoi polygons will get you a good split of your polygon into K parts.
Let’s take it one step at a time to see how it works.
We’ll use Peru as the example polygon, it’s got a nice concavity to it which makes it a little trickier than an average shape.
CREATE TABLE peru AS
SELECT *
FROM countries
WHERE name = 'Peru'
Now create a point field that fills the polygon. On average, each randomly placed point ends up “occupying” an equal area within the polygon.
CREATE TABLE peru_pts AS
SELECT (ST_Dump(ST_GeIn our previous Fun with SQL post on the Citus Data blog, we covered window functions. Window functions are a special class of function that allow you to grab values across rows and then perform some logic. By jumping ahead to window functions, we missed so many of the other handy functions that exist within Postgres natively. There are in fact several hundred built-in functions. And when needed, you can also create your own user defined functions (UDFs), if you need something custom. Today we’re going to walk through just a small sampling of SQL functions that can be extremely handy in PostgreSQL.
First, arrays are a first class datatype within Postgres. You can have an array of text or an array of numbers, personally I love using arrays when dealing with category tags. You can also index arrays which can make querying extremely fast. But even if you’re not putting arrays directly into your database, you may want to build up arrays within your query.
A good example might be when you have a team feature within your
application, and you want to group all members that belong to that
team into a single row. To do this, we can use the
array_agg function which will aggregate some column
together into a single row:
SELECT teams.id,
teams.name,
array_agg(users.email)
FROM teams,
users
WHERE users.team_id = teams.id
GROUP BY 1, 2;
id | name | array_agg
----+-------+--------------------------------------------------------------
2 | ACME | {jennifer@acmecorp.com,tom@acmecorp.com,peyton@acmecorp.com}
1 | Citus | {craig@citusdata.com,farina@citusdata.com}
(2 rows)
The above by itself is pretty handy, but we can go even further.
If we want to send this directly to some email client, we can go
ahead and parse out the array to just be a comma separated list by
wrapping our array_agg function with a function to
convert it to a string array_to_string:
SELECT teams.id,
teams.name,
array_to_string(array_agg(users.email), ', ')
FROM teams,
users
WHE[...]Recently I have been refining and adding utilities to look after our Perl code. You might be surprised to learn that as well as 1.3 million or so lines of C code, there are about 30,000 lines of Perl code in our sources. This a sizeable body of code, even if it’s dwarfed by our C code. What does it do? Well, lots of things. It runs some very critical code in building from source, so the code to set up our catalogs is created by some Perl code. All the new data setup for catalogs is in fact Perl code. That’s another 20,000 lines or so of code on top of the 30,000 mentioned above. We also use Perl to run TAP tests, such as testing initdb and pg_dump. And it runs building and testing when we’re building with the Microsoft tool-sets on Windows.
So, what changes have been made? First, we’ve refined slightly the setup for pgperltidy, our utility for formatting perl code. This utility, based on a well known perl utility, does for perl code what pgindent does for C code.
Second, we’ve added a script and a profile to run a utility called perlcritic. This utility checks to see if perl code complies with a set of “best practises”. Currently we’re only testing for the “worst” practices, but I hope in future to be able to check in a much stricter way. The infrastructure is now there to support it.
Finally, there have been some code adjustments to allow us to check all the perl files for compile time errors and warnings, and a utility to run those checks.
These changes mirror some changes I have made in the buildfarm client and server code.
It’s easy to forget about these things, so I’ve also added a Buildfarm module to run the check on perl. If it finds any policy violations or compiler time errors or warnings we’ll soon know about it.
The ansible is a fictional device made popular by the book Ender’s game written by Orson Scott Gards’s. The ansible allows instant communication between two points across the space, regardless of the distance.
It doesn’t surprise that the Red Hat’s ansible shares the same name as it gives a simple and efficient way to manage servers and automate tasks, yet remaining very lightweight with minimal requirements on the target machines.

Businesses and services deliver value based on data. Availability, consistent state, and durability are top priorities for keeping customers and end-users satisfied. Lost or inaccessible data could possibly equate to lost customers.
Database backups should be at the forefront of daily operations and tasks.
We should be prepared for the event that our data becomes corrupted or lost.
I'm a firm believer in an old saying I’ve heard: "It's better to have it and not need it than to need it and not have it."
That applies to database backups as well. Let's face it, without them, you basically have nothing. Operating on the notion that nothing can happen to your data is a fallacy.
Most DBMS's provide some means of built-in backup utilities. PostgreSQL has pg_dump and pg_dumpall out of the box.
Both present numerous customization and structuring options. Covering them all individually in one blog post would be next to impossible. Instead, I'll look at those examples I can apply best, to my personal development/learning environment.
That being said, this blog post is not targeted at a production environment. More likely, a single workstation/development environment should benefit the most.
The documentation describes pg_dump as: “pg_dump is a utility for backing up a PostgreSQL database”
And the pg_dumpall documentation: “pg_dumpall is a utility for writing out (“dumping”) all PostgreSQL databases of a cluster into one script file.”
To start, I'll create a practice database and some tables to work with using the below SQL:
postgres=# CREATE DATABASE example_backups;
CREATE DATABASE
example_backups=# CREATE TABLE students(id INTEGER,
example_backups(# f_name VARCHAR(20),
example_backups(# l_name VARCHAR(20));
CREATE TABLE
example_backups=# CREATE TABLE classes(id INTEGER,
example_backups(# subject VARCHAR(20));
CREATE TABLE
example_backups=# INSERT INTO students(id, f_name, l_name)
example_backups-# VALUES (1, 'John', 'Thorn'), (2, 'Phil', 'Hampt'),
exPivoting is the operation by which values in a column become column names, so that visually a kind of 90° rotation happens: what was read vertically top-to-bottom appears horizontally from left to right. This concept is also called “transposition”, or “crosstab”, and it brings a bit of the spreadsheet way of thinking into the relational way.
In the simplest case, we start from only two columns, one being a function of the other. We’ll use weather as an example: let’s consider a year column, and a raining days column, expressing the number of days when it rained more than 1 mm at a certain location.
Before pivoting:
Year | RainDays |
-------+-----------+
2012 | 112 |
2013 | 116 |
2014 | 111 |
2015 | 80 |
2016 | 110 |
2017 | 102 |
After pivoting:
2012 | 2013 | 2014 | 2015 | 2016 | 2017
------+------+------+------+------+------
112 | 116 | 111 | 80 | 110 | 102
Often there’s a second dimension, giving us 3 columns with a functional dependency: (X dimension, Y dimension) => Value
With the rain example, the second dimension could be a city name, like below:
Before pivoting:
Year | City | RainDays
-------+-----------+---------
2012 | Lyon | 112
2013 | Lyon | 116
2014 | Lyon | 111
... | ... | ...
2014 | Toulouse | 111
2015 | Toulouse | 83
Let’s consider a simple dataset with data for 13 cities over 6 years, or 78 rows. (an SQL dump for this exemple is available here: rainfall-example.sql; The raw data consists of monthly measurements for a few cities in France coming from https://www.infoclimat.fr/climatologie/).
Here’s a pivoted typical resultset showing the per-city,per-year data:
City | 2012 | 2013 | 2014 | 2015 | 2016 | 2017
-----------+------+------+------+------+------+------
Ajaccio | 69 | 91 | 78 | 48 | 81 | 51
Bordeaux | 116 | 138 | 137 | 101 | 117 | 110
Brest | 178 | 161 | 180 | 160 | 165 | 144
Dijon | 114 | 124 | 1PostGIS 2.5 is just around the corner. One of the new functions coming is the ST_OrientedEnvelop. This is something I've been waiting for for years. It is the true minimum bounding rectangle, as opposed to ST_Envelope which is an axis aligned bounding rectable.
Below is a pictorial view showing the difference between the two.
Continue reading "Coming PostGIS 2.5 ST_OrientedEnvelope"
Performance is one of the most important and most complex tasks when managing a database. It can be affected by the configuration, the hardware or even the design of the system. By default, PostgreSQL is configured with compatibility and stability in mind, since the performance depends a lot on the hardware and on our system itself. We can have a system with a lot of data being read but the information does not change frequently. Or we can have a system that writes continuously. For this reason, it is impossible to define a default configuration that works for all types of workloads.
In this blog, we will see how one goes about analyzing the workload, or queries, that are running. We shall then review some basic configuration parameters to improve the performance of our PostgreSQL database. As we mentioned, we will see only some of the parameters. The list of PostgreSQL parameters is extensive, we would only touch on some of the key ones. However, one can always consult the official documentation to delve into the parameters and configurations that seem most important or useful in our environment.
One of the first steps we can take to understand how to improve the performance of our database is to analyze the queries that are made.
PostgreSQL devises a query plan for each query it receives. To see this plan, we will use EXPLAIN.
The structure of a query plan is a tree of plan nodes. The nodes in the lower level of the tree are scan nodes. They return raw rows from a table. There are different types of scan nodes for different methods of accessing the table. The EXPLAIN output has a line for each node in the plan tree.
world=# EXPLAIN SELECT * FROM city t1,country t2 WHERE id>100 AND t1.population>700000 AND t2.population<7000000;
QUERY PLAN
--------------------------------------------------------------------------
Nested Loop (cost=0.00..734.81 rows=50662 width=144)
-> Seq Scan on city t1 (cost=0.00..93.19 rows=347 width=31)
FilWhen people are talking about database performance monitoring they usually think of inspecting one PostgreSQL database server at a time. While this is certainly useful it can also be quite beneficial to inspect the status of an entire database cluster or to inspect a set of servers working together at once. Fortunately there are easy means to achieve that with PostgreSQL. How this works can be outlined in this post.
If you want to take a deep loop at PostgreSQL performance there is really no way around pg_stat_statements. It offers a lot of information and is really easy to use.
To install pg_stat_statements, the following steps are necessary:
Once this is done, PostgreSQL will already be busy collecting data on your database hosts. However, how can we create a “clusterwide pg_stat_statements” view so that we can inspect an entire set of servers at once?
Our goal is to show data from a list of servers in a single view. One way to do that is to make use of PostgreSQL’s foreign data wrapper infrastructure. We can simply connect to all servers in the cluster and unify the data in a single view.
Let us assume we have 3 servers, a local machine, “a_server”, and “b_server”. Let us get started by connecting to the local server to run the following commands:
CREATE USER dbmonitoring LOGIN PASSWORD 'abcd' SUPERUSER; GRANT USAGE ON SCHEMA pg_catalog TO dbmonitoring; GRANT ALL ON pg_stat_statements TO dbmonitoring;
In the first step I created a simple user to do the database monitoring. Of course you can handle users and so on differently but it seems like an attractive idea to use a special user for that purpose.
The next command enables the postgres_fdw extension, which is necessary to connect to those remote serve
[...]Many companies generate large volumes of time series data from events happening in their application. It’s often useful to have a real-time analytics dashboard to spot trends and changes as they happen. You can build a real-time analytics dashboard on Postgres by constructing a simple pipeline:
For large data streams, Citus (an open source extension to Postgres that scales out Postgres horizontally) can scale out each of these steps across all the cores in a cluster of Postgres nodes.
One of the challenges of maintaining a rollup table is tracking which events have already been aggregated—so you can make sure that each event is aggregated exactly once. A common technique to ensure exactly-once aggregation is to run the aggregation for a particular time period after that time period is over. We often recommend aggregating at the end of the time period for its simplicity, but you cannot provide any results before the time period is over and backfilling is complicated.
In this blog post, we’ll introduce a new approach to building rollup tables which addresses the limitations of using time windows. When you load older data into the events table, the rollup tables will automatically be updated, which enables backfilling and late arrivals. You can also start aggregating events from the current time period before the current time period is over, giving a more real time view of the data.
We assume all events have an identifier which is drawn from a sequence and provide a simple SQL function that enables you to incrementally aggregate ranges of sequence values in a safe, transactional manner.
We tested this approach for a CDN use case and found that a 4-node Citus database cluster can simultaneously:
When: 6-8pm Thursday June 21, 2018
Where: iovation
Who: Mark Wong
What: Intro to OmniDB with PostgreSQL
OmniDB is an open source browser-based app designed to access and manage many different Database Management systems, e.g. PostgreSQL, Oracle and MySQL. OmniDB can run either as an App or via Browser, combining the flexibility needed for various access paths with a design that puts security first.
OmniDB’s main objective is to offer an unified workspace with all functionalities needed to manipulate different DBMS. It is built with simplicity in mind, designed to be a fast and lightweight browser-based application.
Get a tour of OmniDB with PostgreSQL!
Mark leads the 2ndQuadrant performance practice as a Performance
Consultant for English Speaking Territories, based out of Oregon in
the USA. He is a long time Contributor to PostgreSQL, co-organizer
of the Portland PostgreSQL User Group, and serves as a Director and
Treasurer
for the United States PostgreSQL Association.
—
If you have a job posting or event you would like me to announce at
the meeting, please send it along. The deadline for inclusion is
5pm the day before the meeting.
—
Our meeting will be held at iovation, on the 3rd floor of the US Bancorp Tower at 111 SW 5th (5th & Oak). It’s right on the Green & Yellow Max lines. Underground bike parking is available in the parking garage; outdoors all around the block in the usual spots. No bikes in the office, sorry! For access to the 3rd floor of the plaza, please either take the lobby stairs to the third floor or take the plaza elevator (near Subway and Rabbit’s Cafe) to the third floor. There will be signs directing you to the meeting room. All attendess must check in at the iovation front desk.
See you there!
If you have Docker installed on your development machine, there is a simple way to road test your code using the buildfarm client, in a nicely contained environment.
These preparatory steps only need to be done once. First clone the repository that has the required container definitions:
git clone https://github.com/PGBuildFarm/Dockerfiles.git bf-docker
Then build a container image to run the command (in this example we use the file based on Fedora 28):
cd bf-docker docker build --rm=true -t bf-f28 -f Dockerfile.fedora-28 .
Make a directory to contain all the build artefacts:
mkdir buildroot-f28
That’s all the preparation required. Now you can road test your code with this command:
docker run -v buildroot-f28:/app/buildroot \ -v /path/to/postgres/source:/app/pgsrc bf-f28 \ run_build.pl --config=build-fromsource.conf
The config file can be customized if required, but this is a pretty simple way to get started.
Most of our use-cases for the built-in json support in PostgreSQL is not to implement schemaless design storage, but instead to remold data. Remolding can take the form of restructuring data into json documents suitable for web maps, javascript charting web apps, or datagrids. It also has uses beyond just outputting data in json form. In addition the functions are useful for unraveling json data into a more meaningful relational form.
One of the common cases we use json support is what we call UNPIVOTING data. We demonstrated this in Postgres Vision 2018 presentation in slide 23. This trick won't work in other relational databases that support JSON because it also uses a long existing feature of PostgreSQL to be able to treat a row as a data field.
Continue reading "Unpivoting data using JSON functions"Internet of Things tends to generate large volumes of data at a great velocity. Often times this data is collected from geographically distributed sources and aggregated at a central location for data scientists to perform their magic i.e. find patterns, trends and make predictions.
Let’s explore what the IoT Solution using Postgres-BDR has to offer for Data Analytics. Postgres-BDR is offered as an extension on top of PostgreSQL 10 and above. It is not a fork. Therefore, we get the power of complete set of analytic functions that PostgreSQL has to offer. In particular, I am going to play around with PostgreSQL’s Window Functions here to analyze a sample of time series data from temperature sensors.
Let’s take an example of IoT
temperature sensor time series data spread over a period of 7 days.
In a typical scenario, temperature sensors are sending readings
every minute. Some cases could be even more frequent. For the sake
of simplicity, however, I am going to use one reading per day. The
objective is to use PostgreSQL’s Window Functions for running
analytics and that would not change with increasing the frequency
of data points.
Here’s our sample data. Again, the number of table fields in a real world IoT temperature sensor would be higher, but our fields of interest in this case are restricted to timestamp of the temperature recording, device that reported it and the actual reading.
CREATE TABLE iot_temperature_sensor_data ( ts timestamp without time zone, device_id text, reading float );
and we add some random timeseries temperature sensor reading data for seven consecutive days
ts | device_id | reading ---------------------+----------------------------------+--------- 2017-06-01 00:00:00 | ff0d1c4fd33f8429b7e3f163753a9cb0 | 10 2017-06-02 00:00:00 | d125efa43a62af9f50c1a1edb733424d | 9 2017-06-03 00:00:00 | 0b4ee949cc1ae588dd092a23f794177c | 1 2017-06-04 00:00:00 | 8b9cef086e02930a808ee97b87b07b03 | 3 2017-06-05 00:00:00[...]
work_mem is perhaps the most confusing setting
within Postgres. work_mem is a configuration within
Postgres that determines how much memory can be used during certain
operations. At its surface, the work_mem setting seems
simple: after all, work_mem just specifies the amount
of memory available to be used by internal sort operations and hash
tables before writing data to disk. And yet, leaving
work_mem unconfigured can bring on a host of issues.
What perhaps is more troubling, though, is when you receive an out
of memory error on your database and you jump in to tune
work_mem, only for it to behave in an un-intuitive
manner.
The work_mem value defaults to 4MB in Postgres, and
that’s likely a bit low. This means that per Postgres activity
(each join, some sorts, etc.) can consume 4MB before it starts
spilling to disk. When Postgres starts writing temp files to disk,
obviously things will be much slower than in memory. You can find
out if you’re spilling to disk by searching for temporary
file within your PostgreSQL logs when you have
log_temp_files enabled. If you see temporary
file, it can be worth increasing your
work_mem.
On Citus Cloud (our fully-managed database as a service
that scales out Postgres horizontally), we automatically tune
work_mem based on the overall memory available to the
box. Our tuning is based on the years of experience of what we’ve
seen work for a variety of production Postgres workloads, coupled
with statistics to compute variations based on cluster sizing.
It’s tough to get the right value for work_mem
perfect, but often a sane default can be something like 64 MB, if
you’re looking for a one size fits all answer.
Let’s use an example to explore how to think about optimizing
your work_mem setting.
Say you have a certain amount of memory, say 10 GB. If you have 100 running Postgres queries, and each of those queries has a 10 MB connection overhead, then 100*10 MB (1 GB) of memory is taken up by the 100 connections—whi
[...]
The first time I got paid for doing PostgreSQL work on the side, I spent most of the proceeds on the mortgage (boring, I know), but I did get myself one little treat: a boxed set of DVDs from a favorite old television show. They became part of my evening ritual, watching an episode while cleaning the kitchen before bed. The show features three military draftees, one of whom, Frank, is universally disliked. In one episode, we learn that Frank has been unexpectedly transferred away, leaving his two roommates the unenviable responsibility of collecting Frank’s belongings and sending them to his new assignment. After some grumbling, they settle into the job, and one of them picks a pair of shorts off the clothesline, saying, “One pair of shorts, perfect condition: mine,” and he throws the shorts onto his own bed. Picking up another pair, he says, “One pair of shorts. Holes, buttons missing: Frank’s.”
The other starts on the socks: “One pair of socks, perfect condition: mine. One pair socks, holes: Frank’s. You know, this is going to be a lot easier than I thought.”
“A matter of having a system,” responds the first.
I find most things go better when I have a system, as a recent query writing task made clear. It involved data from the Instituto Nacional de Estadística y Geografía, or INEGI, an organization of the Mexican government tasked with collecting and managing country-wide statistics and geographical information. The data set contained the geographic outline of each city block in Mexico City, along with demographic and statistical data for each block: total population, a numeric score representing average educational level, how much of the block had sidewalks and landscaping, whether the homes had access to the municipal sewer and water systems, etc. We wanted to display the data on a Liquid Galaxy in some meaningful way, so I loaded it all in a PostGIS database and built a simple visualization showing each city block as a polygon extruded from the earth, with the height and color of the polygon proportional to
[...]In a previous article of mine I’ve been bitten by the “temporary function issue” (which isn’t an issue at all, of course).
I needed a way to use the now() function in
different ways to do a sort of “time travel” over a history table.
I’ve found a few easy
ways to accomplish the same task and have
now distilled one that seems to me to be the “best
one“
. I will make use of a temporary table.
The trick is that a new function, called mynow()
will access a table without an explicit schema qualifier but
relying on a controlled search_path. This approach
opens the door to a controlled table masking thanks
to the temporary schema each session has.
Let’s see this function first.
create or replace function mynow( out ts timestamp )
language plpgsql
as $l0$
begin
select * into ts from mytime;
if not found then
ts := now();
end if;
end;
$l0$;
If you put a timestamp in the table mytime, then
that timestamp will be used a the current time. If there’s nothing
in there, then the normal function output will be used.
First of all, you need to create an always-empty non-temporary table like this in the public schema:
create table mytime ( ts timestamp );
So the function will start working soon with the default time line. If I inserted a time stamp straight in there, I’d set the function behavior for all sessions at once. But this isn’t normally the objective.
As soon as a user needs a different reference time for its time travelling, she needs to do the following:
set search_path = pg_temp,"$user", public; -- 1st create table mytime ( like public.mytime including all ); -- 2nd insert into mytime values ( '2000-01-01' ); -- 3rd
That’s it. More or less.
The first statement alters the search_path setting so
the pg_temp schema becomes the first one to be
searched in and, thus, the “default” schema.
The second one creates a temporary table (re-read the
previous sentence, please!) “just like the public one“.
Please note the schema qualifying used with the like
predicate.
The third one will insert into that temporary table (you
[...]
We all know that you have to pay a price for a new index you create — data modifying operations will become slower, and indexes use disk space. That’s why you try to have no more indexes than you actually need.
But most people think that SELECT performance will never
suffer from a new index. The worst that can happen is that the new
index is not used.
However, this is not always true, as I have seen more than once in the field. I’ll show you such a case and tell you what you can do about it.
We will experiment with this table:
CREATE TABLE skewed ( sort integer NOT NULL, category integer NOT NULL, interesting boolean NOT NULL ); INSERT INTO skewed SELECT i, i%1000, i>50000 FROM generate_series(1, 1000000) i; CREATE INDEX skewed_category_idx ON skewed (category); VACUUM (ANALYZE) skewed;
We want to find the first twenty interesting rows in category 42:
EXPLAIN (ANALYZE, BUFFERS) SELECT * FROM skewed WHERE interesting AND category = 42 ORDER BY sort LIMIT 20;
This performs fine:
QUERY PLAN
--------------------------------------------------------------------
Limit (cost=2528.75..2528.80 rows=20 width=9)
(actual time=4.548..4.558 rows=20 loops=1)
Buffers: shared hit=1000 read=6
-> Sort (cost=2528.75..2531.05 rows=919 width=9)
(actual time=4.545..4.549 rows=20 loops=1)
Sort Key: sort
Sort Method: top-N heapsort Memory: 25kB
Buffers: shared hit=1000 read=6
-> Bitmap Heap Scan on skewed
(cost=19.91..2504.30 rows=919 width=9)
(actual time=0.685..4.108 rows=950 loops=1)
Recheck Cond: (category = 42)
Filter: interesting
Rows Removed by Filter: 50
Heap Blocks: exact=1000
Buffers: shared hit=1000 read=6
-> Bitmap Index Scan on skewed_category_idx
(cost=0.00..19.68 rows=967 width=0)
[...]