There are an almost unlimited number of articles on the web about how to find gaps in sequences in SQL. And it doesn't have to be very hard. Doing it in a "partitioned sequence" makes it a bit harder, but still not very hard. But when I turned to a window aggregate to do that, I was immediately told "hey, that's a good example of a window aggregate to solve your daily chores, you should blog about that". So here we go - yet another example of finding a gap in a sequence using SQL.
I have a database that is very simply structured - it's got a primary key made out of (groupid, year, month, seq), all integers. On top of that it has a couple of largish text fields and an fti field for full text search. (Initiated people will know right away which database this is). The sequence in the seq column resets to zero for each combination of (groupid, year, month). And I wanted to find out where there were gaps in it, and how big they were, to debug the tool that wrote the data into the database. This is really easy with a window aggregate:
SELECT * FROM (
SELECT
gropid,
year,
month,
seq,
seq-lag(seq,1) OVER
(PARTITION BY groupid, year, month
ORDER BY seq)
AS gap FROM mytable
) AS t
WHERE NOT
(t.gap=1)
ORDER BY groupid, year, month, seq
One advantage to using a window aggregate for this is that
we actually get the whole row back, and not just the primary key -
so it's easy enough to include all the data you need to figure
something out.
What about performance? I don't really have a big database
to test this on, so I can't say for sure. It's going to be a
sequential scan, since I look at the whole table,and not
just parts of it. It takes about 4 seconds to run over a table of
about a million rows, 2.7Gb, on a modest VM with no actual I/O
capacity to speak of and a very limited amount of memory, returning
about 100 rows. It's certainly by far fast enough for me in this
case.
And as a bonus, it found me two bugs in the loading script
and at least one bug in somebody elses code that I'm now wa
This video presentation by Dwight Merriman (MongoDB) at OSCON Data 2011 explores the role of NoSQL databases. He makes some very interesting points:
What I found most interesting about his presentation was the comparison to dynamically-typed languages. When dynamically-typed scripting languages started to become popular years ago, there was a sense that they were only for trivial applications, compared to compiled language that were for serious work. As CPUs became faster, scripting languages were increasingly used for production applications, particularly web applications that are modified frequently. There might be a similar pattern now in the (improper?) dismissal of NoSQL for serious applications.
In my previous post, I outlined the basics of a configuration-file and dependency-tracking SQL deployment architecture. But I left it off wanting to eliminate the need for such a file and still have it all work. This post outlines just how to do that by relying on VCS history to determine what changes need to be deployed or reverted.
The term "scalability" gets injected into database conversations often, but scalability discussions often end without a clear consensus. I believe this is because "scalability", like "replication", actually means different things to different people.
There are two dimensions to database scalability; the first dimension specifies if additional servers are involved:
I've been thinking a lot about SQL change management. I've written about this before, implemented a dubious implementation of SQL migrations, and used a dependency-tracking solution with its own set of challenges. Nothing has satisfied me. But I think I may finally have cracked this thing wide open.
PostgreSQL has aggregate functions called bool_and and bool_or which it's had for as far back as I can remember. What do they do? given rows of premises (things that resolve to booleans), bool_and will return true if all of the premises are true. Similarly bool_or will return true if any of the premises in the set of rows is true. What if however your boolean expressions are not in rows, but instead passed in as a sequence of arbitrary statements of questionable fact. We want a function like bool_or or bool_and that takes an arbitrary number of boolean arguments. Are there functions that fit the bill. Indeed there are, but they don't scream out and say I work with booleans because they fit into a class of function we discussed in The wonders of Any Element and that also happen to be variadic functions. These are none other than greatest and least and they are old timer functions that you can find in most versions of PostgreSQL. We'll demonstrate how to use all 4 with booleans in this article. It must be said that greatest and least are much more useful when applied to other data types like dates and numbers, but we were amused at the parallel with booleans.
While database software can be the cause of outages, for Postgres, it is often not the software but the hardware that causes failures — and storage is often the failing component. Magnetic disk is one of the few moving parts on a computer, and hence prone to breakage, and solid-state drives (SSDs) have a finite write limit.
While waiting for storage to start making loud noises or fail is an option, a better option is to use some type of monitoring that warns of storage failure before it occurs, e.g. enter SMART. SMART is a system developed by storage vendors that allows the operating system to query diagnostics on the drive and warn of unusual storage behavior before failure occurs. While read/write failures are reported by the kernel, SMART parameters often warn of danger before failure occurs. Below is the SMART output from a Western Digital (WDC) WD20EARX magnetic disk drive:

Joomla, one of the world’s most popular open source content management systems (CMS) used for everything from websites to blogs to Intranets, today announces the immediate availability of Joomla 2.5. Along with new features such as advanced search and automatic notification of Joomla core and extension updates, the Joomla CMS for the first time includes multi-database support with the addition of Microsoft SQL Server. Previous versions of Joomla were compatible exclusively with MySQL databases.
Way to go Joomla! But why don’t you guys mention PostgreSQL database in the main release story? Do you really think that MSSQL is more common choice for the database layer for any CMS? Seriously?
My presentation from SCALE 10x, “PostgreSQL Performance When It’s Not Your Job” is now available for download.
Postgres consists of roughly 1.1 million lines of C code, which is compiled into an executable with millions of CPU instructions. Of the many CPU machine-language instructions in the Postgres server executable, which one is the most important? That might seem like an odd question, and one that is hard to answer, but I think I know the answer.
You might wonder, "If Postgres is written in C, how would we find the most important machine-language instruction?" Well, there is a trick to that. Postgres is not completely written in C. There is a very small file (1000 lines) with C code that adds specific assembly-language CPU instructions into the executable. This file is called s_lock.h. It is an include file that is referenced in various parts of the server code that allows very fast locking operations. The C language doesn't supply fast-locking infrastructure, so Postgres is required to supply its own locking instructions for all twelve supported CPU architectures. (Operating system kernels do supply locking instructions, but they are much too slow to be used for Postgres.)
cd inst.json sudo yum install v8 v8-devel hg clone https://code.google.com/p/plv8js/ cd plv8js PATH=../bin:$PATH make USE_PGXS=1 PATH=../bin:$PATH make USE_PGXS=1 install cd .. bin/createdb testplv8 bin/psql -c 'create extension plv8; create language plv8;' testplv8
The new server is 2-10 times faster than my old 2003 server, but that 10x speedup is only possible for applications that:
Other observations:
CREATE or replace FUNCTION jmember (j json, key text )
RETURNS text
LANGUAGE plv8
IMMUTABLE
AS $function$
var ej = JSON.parse(j);
if (typeof ej != 'object')
return NULL;
return JSON.stringify(ej[key]);
$function$;
In reality we'd want something a bit more sophisticated than this,
but you can get the idea from this. Armed with this function we
could now create our index, using the functional index feature of
PostgreSQL:
Now, when we issue a query likeCREATE INDEX x_in_json ON mytable (jmember(jsonfield,'x'));
It should be able to use the index. This is reasonably analogous to a very simple use of MongoDB's ensureIndex() function.SELECT * FROM mytable WHERE jmember(jsonfield,'x') = 'foo';
SECURITY DEFINER feature of
PostgreSQL)._v2, mark the previous version as deprecated and after
all your software stack is rolled out to use that new function,
just drop the previous version. But if you are rolling
out new version of the whole stack onSlides (as of this moment) are here: Mistakes were made. I changed quite a bit of the beginning and end, given how bit the audience is. Previous talks, we’ve usually ended with a fun “omg, here’s the craziest story I know” session. I imagine we’ll get a little bit of that today.
Postgres folks will note a relevant picture on slide 13.
![]()
This is my first keynote! Thanks so much to SCALE for inviting me. There were at least 1500 registered attendees as of Friday, so looking forward to a big crowd.
11 people showed up for our first meeting of the new year. Thanks to Iovation for providing a comfortable space with pizza.
We will be having another PRP soon, as well as a YAMS hackathon. We may combine the two into one event. Watch this space for details.
Here are Tim’s slides from his Database Trending talk last night. I can’t wait to try this at home!
I converted his .odp slides to .ppt so’s I could a) upload them to WP (no .odp allowed!) and b) include his notes.
A few weeks ago, I finally replaced my eight-year-old home server. The age of my server, and its operating system, (BSD/OS, last officially updated in 2002) were a frequent source of amusement among Postgres community members. The new server is:
Super Micro 7046A-T 4U Tower Workstation 2 x Intel Xeon E5620 2.4GHz Quad-Core Processors Crucial 24GB Dual-Rank PC3-10600 DDR3 SDRAM Intel 160GB 320 Series SSD Drive 4 x Western Digital Caviar Green 2TB Hard Drives
As a followup to my previous blog entry, I want to show queries that allow users to analyze TOAST tables. First, we find the TOAST details about the test heap table:
SELECT oid, relname, reltoastrelid, reltoastidxid FROM pg_class where relname = 'test'; oid | relname | reltoastrelid | reltoastidxid -------+---------+---------------+--------------- 17172 | test | 17175 | 0
I've finally gotten Git support added into pg_extractor. This works pretty much the same as the SVN option did already. One important difference is that there are two options for committing
--git
This just does a local commit to a locally maintained repository
--gitpush
This does a local commit as well as push to an already configured remote repository
You use either one option or the other, not both. The Git options also expects a proper .gitconfig file for your environment to be set up for the user running pg_extractor. There is no option for passing the git username like SVN has (and I don't see a need for one). Remote repositories will also have to be configured in advance of using the push option.
An important thing to note about using svn or git options with pg_extractor is that it does not do any initial VCS setups on the folders it creates and outputs too. It's best to run it first without any VCS options to get an initial dump and perform a manual commit (and/or push with git). Then for any future runs of pg_extractor, use the VCS options to track changes.
As always, please report any bugs or issues!
The source for the pgEdit TextMate bundle is now available on GitHub at https://github.com/desoi/pgedit-textmate.