Plotting the Czech Cadastre Land Use with d3: Data Extraction (part I)
This post is the first part of the upcoming series summarizing the process of visualizing landuse data with bash, PostgreSQL and d3.js. Read other parts:
- you’re reading it now
- Plotting the Czech Cadastre Land Use with d3: Data Transformation (part II)
- Plotting the Czech Cadastre Land Use with d3: Data Transformation (part III)
Czech Office for Surveying, Mapping and Cadastre has recently published lot of data via Atom feed. There’s pretty small and a bit boring dataset included, featuring quarterly updated landuse-related values for all 13,091 cadastral areas:
- absolute number of land lots within given category (arable land, forests, etc.)
- absolute area of land lots within given category
Data are published as CSV files linked from the Atom feed. Sadly, they come windows-1250 encoded, using Windows line endings, with trailing semicolons and header rows using diacritics.
ETL process
Before the d3 viz can be crafted, it’s necessary to:
- extract CSV data from the URLs provided via the Atom feed
- transform those data into a relational database, do some math
- load data into a d3.js viz
- profit (as usual)
Extract
#!/bin/bash
# extract.sh -f YYYYMMDD
while [[ $# -gt 1 ]]
do
key="$1"
case $key in
-f|--file)
FILE="$2"
shift # past argument
;;
*)
# unknown option
;;
esac
shift # past argument or value
done
URL=http://services.cuzk.cz/sestavy/UHDP/UHDP-
CSVFILE=$FILE.csv
CSVUTF8FILE=${CSVFILE%.*}.utf.csv
URL+=$CSVFILE
echo "downloading $URL"
wget -q $URL -O $CSVFILE
if [[ $? != 0 ]]; then
rm -f $CSVFILE
echo "download failed"
exit
fi
echo "converting to utf-8"
iconv -f WINDOWS-1250 -t UTF-8 $CSVFILE -o $CSVUTF8FILE && \
echo "modifying ${FILE}"
sed -i 's/^M$//' $CSVUTF8FILE && \
sed -i 's/\r$//' $CSVUTF8FILE && \
sed -i 's/;*$//g' $CSVUTF8FILE && \
sed -i '1d' $CSVUTF8FILE
echo "importing to database"
sed -e "s/\${DATE}/$FILE/g" extract.sql | psql -qAt --no-psqlrc
rm $CSVFILE $CSVUTF8FILE
This script downloads CSV file, deals with all the pitfalls mentioned above and, when done, copy
command within extract.sql
loads the data into a data_YYYYMMDD
table. Putting all the files into the one table would have saved me a lot of transformation SQL, yet it didn’t feel quite right though.
Transform
See Plotting the Czech Cadastre Land Use with d3: Data Transformation (part II).
Load
See Plotting the Czech Cadastre Land Use with d3: Data Transformation (part III).