Articles in the SQL category
Imagine you run two database machines hosting structurally the same databases on two separate servers and you need to transfer data from one to another. Not very often, let’s say once a month. Your tables aren’t small nor huge, let’s say millions rows in general.
You’re going to use pg_dump
and pipe it to psql
, but the indices on your tables will slow you down a lot.
That’s why you’ll want to drop all indices and constraints (drop_indices_constraints.sql
):
SELECT 'ALTER TABLE ' ||
tc.table_schema ||
'.' ||
tc.table_name ||
' DROP CONSTRAINT ' ||
tc.constraint_name ||
';'
FROM information_schema.table_constraints tc
JOIN information_schema.constraint_column_usage ccu ON (tc.constraint_catalog = ccu.constraint_catalog AND tc.constraint_schema = ccu.constraint_schema AND tc.constraint_name = ccu.constraint_name)
WHERE tc.table_schema IN (SELECT unnest(string_to_array(:'schemas', ',')))
UNION ALL
SELECT
'DROP INDEX IF EXISTS ' || schemaname || '.' || indexname || ';'
FROM pg_indexes
WHERE schemaname IN (SELECT unnest(string_to_array(:'schemas', ',')));
Then you will transfer the data:
pg_dump -a -t "schema1.*" -t "schema2.*" -O -d source -v | psql -h localhost -d target
And restore the already dropped indices and constraints (create_indices_constraints.sql
):
WITH constraints AS (
SELECT 'ALTER TABLE ' ||
tc.table_schema ||
'.' ||
tc.table_name ||
' ADD CONSTRAINT ' ||
tc.constraint_name ||
' ' ||
tc.constraint_type ||
'(' ||
string_agg(ccu.column_name::text, ', ')|| -- column order should be taken into account here
');' def,
tc.table_schema,
tc.table_name,
tc.constraint_name
FROM information_schema.table_constraints tc
JOIN information_schema.constraint_column_usage ccu ON (tc.constraint_catalog = ccu.constraint_catalog AND tc.constraint_schema = ccu.constraint_schema AND tc.constraint_name = ccu.constraint_name)
WHERE tc.table_schema IN (SELECT unnest(string_to_array(:'schemas', ',')))
AND tc.constraint_type = 'PRIMARY KEY'
GROUP BY
tc.table_schema,
tc.table_name,
tc.constraint_name,
tc.constraint_type
)
SELECT def FROM constraints
UNION ALL
SELECT indexdef || ';'
FROM pg_indexes
WHERE schemaname IN (SELECT unnest(string_to_array(:'schemas', ',')))
AND NOT EXISTS (
SELECT 1 FROM
constraints c
WHERE pg_indexes.schemaname = c.table_schema
AND pg_indexes.tablename = c.table_name
AND pg_indexes.indexname = c.constraint_name
);
Few sidenotes
- Run the second piece of code first. If you forget, run that code on the source database.
- Notice the
:schemas
. Variable assignment is one of the psql
features I really like.
- Notice
DROP INDEX IF EXISTS
and the CTE used in the drop code - that’s due to the fact that dropping the constraint obviously drops the underlying index as well and you don’t want to dropping something that doesn’t exist or creating something that exists already.
The bash script proposal might look as follows:
# store indices and constraint definitions
psql -qAt -d target -v schemas='schema1','schema2' -f create_indices_constraints.sql > create.sql
# drop indices and constraints
psql -qAt -d target -v schemas='schema1','schema2' -f drop_indices_constraints.sql | psql -d target
# load data
pg_dump -a -t "schema1.*" -t "schema2.*" -O -d source -v | psql -h localhost -d target
#renew indices and constraints
psql -qAt -d target -f create.sql
PostgreSQL foreign data wrappers are used to connect PostgreSQL database to different datasources, e.g. other SQL databases, CSV files, XLS spreadsheets×
The one I’ve been interested in for several months is Paul Ramsey’s OGR FDW - it gives you access to OGR supported spatial formats directly from your database. No more shapefiles lying around?
Each foreign data wrapper should have three basic components:
- foreign server object
- foreign user mapping - not necessary if you’re not connecting to other database
- foreign table(s)
I got some data about rivers and dams from DIBAVOD open datasets to play with.
First define the foreign server object:
CREATE SERVER dibavod
FOREIGN DATA WRAPPER ogr_fdw
OPTIONS (
datasource '/downloads/dibavod',
format 'ESRI Shapefile',
config_options 'SHAPE_ENCODING=CP1250'
);
Note the OGR specific driver configuration options are available inside config_options
. In case of ESRI Shapefiles, the datasource
is the directory your files reside in.
Let’s create PostgreSQL tables (use ogrinfo
or Paul’s ogr_fdw_info
to list the columns):
CREATE FOREIGN TABLE rivers (
fid integer,
utokj_id numeric,
utokjn_id numeric,
utokjn_f numeric,
prprop_z integer,
ex_jh integer,
pozn text,
shape_leng numeric,
naz_tok text,
idvt integer,
tok_id numeric,
shape_len numeric,
geom geometry(LINESTRING, 5514)
)
SERVER dibavod
OPTIONS (layer 'A02_Vodni_tok_JU');
CREATE FOREIGN TABLE dams (
fid integer,
objectid integer,
naz_na text,
nadr_gid numeric,
kota_hladi numeric,
hloubka numeric,
zatop_ploc numeric,
objem numeric,
kota_hraz numeric,
kota_preli numeric,
kota_vypus numeric,
plocha_m2 numeric,
shape_area numeric,
shape_len numeric,
geom geometry(MULTIPOLYGON, 5514)
)
SERVER dibavod
OPTIONS (LAYER 'A05_Vodni_nadrze');
Note the fid
column - required for write access to underlying datasource.
Things to remember:
- foreign tables mean no constraints nor indices
- no indices mean spatial queries are terribly slow compared to PostGIS
- I like the idea of
CREATE UNLOGGED TABLE dams2 AS SELECT * FROM dams
, not sure what to use it for though
Using JOIN clause
All my GIS life I’ve been using a simple JOIN
clause to find a row with an id = previous_id + 1
. In other words, imagine a simple table with no indices:
CREATE TABLE test (id integer);
INSERT INTO test SELECT i FROM generate_series(1,10000000) i;
Let’s retrieve next row for each row in that table:
SELECT a.id, b.id
FROM test a
LEFT JOIN test b ON (a.id + 1 = b.id); -- note the LEFT JOIN is needed to get the last row as well
Execution plan looks like this:
Hash Join (cost=311087.17..953199.41 rows=10088363 width=8) (actual time=25440.770..79591.869 rows=10000000 loops=1)
Hash Cond: ((a.id + 1) = b.id)
-> Seq Scan on test a (cost=0.00..145574.63 rows=10088363 width=4) (actual time=0.588..10801.584 rows=10000001 loops=1)
-> Hash (cost=145574.63..145574.63 rows=10088363 width=4) (actual time=25415.282..25415.282 rows=10000001 loops=1)
Buckets: 16384 Batches: 128 Memory Usage: 2778kB
-> Seq Scan on test b (cost=0.00..145574.63 rows=10088363 width=4) (actual time=0.422..11356.108 rows=10000001 loops=1)
Planning time: 0.155 ms
Execution time: 90134.248 ms
If we add an index with CREATE INDEX ON test (id)
, the plan changes:
Merge Join (cost=0.87..669369.85 rows=9999844 width=8) (actual time=0.035..56219.294 rows=10000001 loops=1)
Merge Cond: (a.id = b.id)
-> Index Only Scan using test_id_idx on test a (cost=0.43..259686.10 rows=9999844 width=4) (actual time=0.015..11101.937 rows=10000001 loops=1)
Heap Fetches: 0
-> Index Only Scan using test_id_idx on test b (cost=0.43..259686.10 rows=9999844 width=4) (actual time=0.012..11827.895 rows=10000001 loops=1)
Heap Fetches: 0
Planning time: 0.244 ms
Execution time: 65973.421 ms
Not bad.
Using window function
Window functions are real fun. They’re great if you’re doing counts, sums or ranks by groups. And, to my surprise, they’re great in finding next rows as well.
With the same test
table, we retrieve next row for each row with the following query:
SELECT id, lead(id) OVER (ORDER BY id)
FROM test.test;
How does that score without an index? Better than the JOIN
clause.
WindowAgg (cost=1581246.90..1756294.50 rows=10002720 width=4) (actual time=28785.388..63819.071 rows=10000001 loops=1)
-> Sort (cost=1581246.90..1606253.70 rows=10002720 width=4) (actual time=28785.354..40117.899 rows=10000001 loops=1)
Sort Key: id
Sort Method: external merge Disk: 136848kB
-> Seq Scan on test (cost=0.00..144718.20 rows=10002720 width=4) (actual time=0.020..10797.961 rows=10000001 loops=1)
Planning time: 0.242 ms
Execution time: 73391.024 ms
And it works even better if indexed. It’s actually ~1,5× faster than the JOIN
way.
WindowAgg (cost=0.43..409770.03 rows=10002720 width=4) (actual time=0.087..35647.815 rows=10000001 loops=1)
-> Index Only Scan using test_id_idx on test (cost=0.43..259729.23 rows=10002720 width=4) (actual time=0.059..11310.879 rows=10000001 loops=1)
Heap Fetches: 0
Planning time: 0.247 ms
Execution time: 45388.202 ms
It reads well and the purpose of such a query is pretty obvious.
Filtering really big (millions of rows) point datasets by distance might get catchy when solved with wrong PostGIS functions. Without spatial indexes PostGIS would take ages to finish such task.
Someone over StackExchange asked why the next query had been running for 15 hours (!) with no result:
SELECT
a.gid,
b.gid,
ST_Distance(a.geom,b.geom)
FROM
shp1 a,
shp2 b
WHERE
ST_Intersects(
ST_Difference(
ST_Buffer(a.geom,2000),
ST_Buffer(a.geom,500)
),
b.geom
) AND
abs(a.value - b.value) > 400
The answer is quite simple: because it was using wrong functions. Let’s see:
ST_Distance()
does not use spatial index, it’s a simple mathematical calculation, it’s expensive and it can be replaced by spatial operator for point datasets.
ST_Buffer()
will definitely take time to build polygons around points. And it’s being run twice!
ST_Difference()
needs more time to compare the buffers and return just the portion of space they don’t share. Isn’t it a huge waste to create buffers and then throw them away?
ST_Intersects()
finally checks whether the point should be included in the result.
That was a great challenge to test my knowledge of how PostGIS works and here’s my shot at it:
SELECT * FROM (
SELECT
a.gid,
b.gid,
a.geom <-> b.geom distance
FROM
shp1 a, shp2 b
WHERE
abs(a.value - b.value) > 400 AND
ST_DWithin(a.geom, b.geom, 2000)
) sq
WHERE
distance > 500;
- I use
<->
, a.k.a geometry distance centroid instead of ST_Distance()
. It takes advantage of spatial indexes, thus it’s fast. And it’s perfectly fine to use it with point dataset, because the centroid of a bounding box of a point is? That point, exactly. Spatial indexes have to be built beforehand.
- Buffers are not necessary to check whether a geometry lies in a certain distance from another one. That’s what
ST_Dwithin()
was made for. With the inner WHERE
clause I get all the points lying at most 2,000 meters from the current, having the absolute value difference bigger than 400. ST_Dwithin()
will make use of any spatial index available, unlike ST_Distance()
.
- The outer
WHERE
clause throws away all the points closer than 500 meters. Remember, we already got only those not further than 2,000 meters from the previous step.
It took PostGIS 1060545,590 ms (~ 17 minutes) on my Quad-Core Intel® Core™ i5-4210M CPU @ 2.60GHz, 4 GB RAM, 500 GB SSD hard drive, PostgreSQL 9.3 and PostGIS 2.1 with two point datasets having 4M and 300K rows, respectively.
At CleverMaps we heavily rely on the cadastre of real estate, which is probably the biggest data source in my country. Using their excellent knowledge of this data set, my teammates often supply me with all kinds of weird challenges.
Give me the biggest land users nearby
Find the biggest land users in surrounding cadastral communities for each cadastral community (~ 13K) being the latest task, here’s the query I tackled it with:
WITH users_ AS (
SELECT
cad_code,
id,
zipcode,
city,
concat_ws(' ',street, concat_ws('/', house_number, street_number)) as street,
name,
'Users with more than 10 ha'::text note,
SUM(acreage) area
FROM land_blocks -- being a table with info about all the agricultural land blocks
JOIN users u ON id_uz = id
GROUP BY cad_code, u.id
HAVING SUM(acreage) > 10
),
ints AS (
SELECT
ku.cad_code as community,
ku2.cad_code as surrounding,
ku2.cad_name
FROM cadastral_community ku
JOIN cadastral_community ku2
ON ST_Intersects(ku.geom, ku2.geom)
WHERE ku.cad_code <> ku2.cad_code
)
SELECT
DISTINCT ON (surrounding, cad_name, u.zipcode, u.city, u.street, u.name)
surrounding,
cad_name,
u.zipcode,
u.city,
u.street,
u.name,
u.note,
u.area
FROM
users_ u
JOIN ints
ON cad_code = community;
Few things to note:
concat_ws()
is a great function for joining values that might be NULL
. If such a value is found, it is skipped and the function continues with the next one (if any). Thus, you’ll never get a string ending with a trailing slash (Street name 55/
).
- With
users_
CTE I get a list of owners having more than 10 hectares of land for each cadastral community. This gives me the inverse result of what I want (if I know the biggest owners in the cadastral community, I know these are the ones that should be listed for surrounding c. communities).
- This putting-it-all-together step is done with
ints
CTE that outputs the list of surrounding c. communities for each of them.
DISTINCT ON
cleans up the list so the same owners don’t appear more than once for any given c. community.
Writing this makes me realize the list should be probably sorted by area so only the occurence with the biggest value is kept for each c. community. Simple ORDER BY
should deal with this just fine. Or even more sophisticated, using GROUP BY
to output the total acreage in all surrounding c. communities.