Articles in the development category
Two years ago I was pretty happy with this little piece of code to transform degrees to the decimal value. Yesterday, I found a neater way to do the same:
let deg = [50, 30, 0];
function degToDec(prev, cur, curIndex) {
return prev + cur / Math.pow(60, curIndex);
}
deg.reduce(degToDec);
Once you have an input array, that’s pretty much it. Love JavaScript.
Disclaimer: I’m an enthuastic developer, but I do not code for a living. I’m just the ordinary guy who keeps editing a wrong file wondering why the heck the changes are not being applied.
TL;DR: I do think npm
might be the answer.
Wonderful world of JavaScript DevOps
When I first started using JavaScript on the server side with node.js
, I felt overwhelmed by numerous options to automate tasks. There was npm
taking care of backend dependencies. Then I would build a frontend and found out about bower
for handling frontend dependencies. Then it would be great to have some kind of minification/obfuscation/uglification/you-name-it task. And the build
task. And the build:prod
task. And how about eslint
task? And then I would end up spending hours doing nothing, just reading blogs about the tools being used by others who do code for a living.
Intermezzo: I think my coding is slow. Definitely slower than yours. I’m getting better though.
Using the force
Looking back I find it a bit stressful - how the heck do I choose the right tools? Where’s Yoda to help me out? Anyway, next to adopt after npm
was bower
. And I liked it, even though some packages were missing - but who cares as long as there is no better way, right? Except there is… I guess.
Automation was next in the line to tackle. So I chose gulp
without a bit of hesitation. It was a hype, a bigger than grunt
back then. I even heard of yeoman
, but until now I still don’t know what it actually does. And I’m happy with that.
A short summary so far:
npm
for backend dependencies
bower
for frontend dependencies
gulp
for running tasks
So far, so good.
Is Bower going to die?
Then I stumbled upon this tweet and started panicking. Or rather started to feel cheated. It took me time to set all this up and now it’s useless? Or what?
Seeing it now, I’m glad I read this. And I really don’t know what happened to Bower, if anything at all.
Keeping it simple
So Bower’s dying, what are you going to do about that? You’ll use npm
instead! And you’ll have a single source of truth called package.json
. You’l resolve all the dependencies with a single npm install
command and feel like a king. We’re down to two now - npm
and gulp
.
Gulp, Gulp everywhere!
When you get rid of Bower, next feeling you have is your gulpfile.js
just got off the leash. It got really big and grew to ~160 lines of code and became a nightmare to manage.
So you split it into task files and a config file. What a relief. But you still realize a half of your package.json
dependencies starts with gulp-
. And you hate it.
Webpack for the win
For me, a non-developer, setting the webpack wasn’t easy. I didn’t find docs very helpful either. Reading the website for the first time, I didn’t even understand what it should be used for. I got it working eventually. And I got rid of gulp
, gulp-connect
, gulp-less
, gulp-nodemon
, gulp-rename
, gulp-replace
, gulp-task-listing
and gutil
. And the whole gulpfile.js
. That was a big win for me.
But how do you run tasks?
Well…
npm run start-dev # which in turn calls the code below
npm run start-webpack & NODE_ENV=development nodemon server.js # where start-webpack does the following
node_modules/webpack-dev-server/bin/webpack-dev-server.js --quiet --inline --hot --watch
That’s it. If I need to build code, I run npm run build
, which calls some other tasks from scripts
section in the package.json
.
That’s pretty much it. I don’t think it’s a silver bullet, but I feel like I finally found peace of mind for my future JavaScript development. At least for a month or so before some other guy comes to town.
Last time I described the setup of my OpenShift Twitter crawler and let it running and downloading data. It’s been more than two months since I started and I got interesting amount of data. I also made a simple ETL process to load it into my local PostGIS database, which I’d like to cover in this post.
Extract data
Each day is written to the separate sqlite file with a name like tw_day_D_M_YYYY
. Bash
is used to gzip all the files before downloading them from OpenShift.
#!/bin/bash
ssh openshift << EOF
cd app-root/data
tar czf twitter.tar.gz *.db
EOF
scp openshift:/var/lib/openshift/55e487587628e1280b0000a9/app-root/data/twitter.tar.gz ./data
cd data &&
tar -xzf twitter.tar.gz &&
cd -
echo "Extract done"
Transform data
The transformation part operates on downloaded files and merges them into one big CSV file. That’s pretty straightforward. Note that’s quite simple with sqlite flags, some sed
and tail
commands.
#!/bin/bash
rm -rf ./data/csv
mkdir ./data/csv
for db in ./data/*.db; do
FILENAME=$(basename $db)
DBNAME=${FILENAME%%.db}
CSVNAME=$DBNAME.csv
echo "$DBNAME to csv..."
sqlite3 -header -csv $db "select * from $DBNAME;" > ./data/csv/$CSVNAME
done
cd ./data/csv
touch tweets.csv
echo $(sed -n 1p $(ls -d -1 *.csv | head -n 1)) > tweets.csv # get column names
for csv in tw_*.csv; do
echo $csv
tail -n +2 $csv >> tweets.csv # get all lines without the first one
done
Load data
In the last step, the data is loaded with SQL \copy
command.
#!/bin/bash
export PG_USE_COPY=YES
DATABASE=mzi_dizertace
SCHEMA=dizertace
TABLE=tweets
psql $DATABASE << EOF
DROP TABLE IF EXISTS $SCHEMA.$TABLE;
CREATE UNLOGGED TABLE $SCHEMA.$TABLE (id text, author text, author_id text, tweet text, created_at text, lon float, lat float, lang text);
\copy $SCHEMA.$TABLE FROM 'data/csv/tweets.csv' CSV HEADER DELIMITER ','
ALTER TABLE $SCHEMA.$TABLE ADD COLUMN wkb_geometry geometry(POINT, 4326);
UPDATE $SCHEMA.$TABLE SET wkb_geometry = ST_SetSRID(ST_MakePoint(lon, lat), 4326);
CREATE INDEX ${TABLE}_geom_idx ON $SCHEMA.$TABLE USING gist(wkb_geometry);
COMMIT;
EOF
First statistics
Some interesting charts and numbers follow.
More than a year ago I wrote about analyzing Twitter languages with Streaming API. Back then I kept my laptop running for a week to download data. Not a comfortable way, especially if you decide to get more data. One year uptime doesn’t sound like anything you want to be part of. OpenShift by Red Hat seems to be almost perfect replacement. Almost.
OpenShift setup
I started with Node.js application running on one small gear. Once running, you can easily git push
the code to your OpenShift repo and login via SSH. I quickly found simple copy-pasting my local solution wasn’t going to work. and fixed it with some minor tweaks. That’s where the fun begins…
I based the downloader on Node.js a year ago. Until now I still don’t get how that piece of software works. Frankly, I don’t really care as long as it works.
Pitfalls
If your application doesn’t generate any traffic, OpenShift turns it off. It wakes up once someone visits again. I had no idea about that and spent some time trying to stop that behavior. Obviously, I could have scheduled a cron job on my laptop pinging it every now and then. Luckily, OpenShift can run cron jobs itself. All you need is to embed a cron cartridge into the running application (and install a bunch of ruby dependencies beforehand).
rhc cartridge add cron-1.4 -a app-name
Then create .openshift/cron/{hourly,daily,weekly,monthly}
folder in the git repository and put your script running a simple curl command into one of those.
curl http://social-zimmi.rhcloud.com > /dev/null
Another problem was just around the corner. Once in a while, the app stopped writing data to the database without saying a word. What helped was restarting it - the only automatic way to do so being a git push
command. Sadly, I haven’t found a way to restart the app from within itself; it probably can’t be done.
When you git push
, the gear stops, builds, deploys and restarts the app. By using hot deployment you can minimize the downtime. Just put the hot_deploy
file into .openshift/markers
folder.
git commit --allow-empty -m "Restart gear" && git push
This solved the problem until I realize that every restart deleted all the data collected so far. If your data are to stay safe and sound, save them in process.env.OPENSHIFT_DATA_DIR
(which is app-root/data
).
Anacron to the rescue
How do you push an empty commit once a day? With cron of course. Even better, anacron.
mkdir ~/.anacron
cd ~/.anacron
mkdir cron.daily cron.weekly cron.monthly spool etc
cat <<EOT > ~/.anacron/etc/anacrontab
SHELL=/bin/sh
PATH=/usr/local/sbin:/usr/local/bin:/sbin:/bin:/usr/sbin:/usr/bin:/$HOME/bin
HOME=$HOME
LOGNAME=$USER
1 5 daily-cron nice run-parts --report $HOME/.anacron/cron.daily
7 10 weekly-cron nice run-parts --report $HOME/.anacron/cron.weekly
@monthly 15 monthly-cron nice run-parts --report $HOME/.anacron/cron.monthly
EOT
cat <<EOT >> ~/.zprofile # I use zsh shell
rm -f $HOME/.anacron/anacron.log
/usr/sbin/anacron -t /home/zimmi/.anacron/etc/anacrontab -S /home/zimmi/.anacron/spool &> /home/zimmi/.anacron/anacron.log
EOT
Anacron is to laptop what cron is to 24/7 running server. It just runs automatic jobs when the laptop is running. If it’s not and the job should be run, it runs it once the OS boots. Brilliant idea.
It runs the following code for me to keep the app writing data to the database.
#!/bin/bash
workdir='/home/zimmi/documents/zimmi/dizertace/social'
logfile=$workdir/restart-gear.log
date > $logfile
{
HOME=/home/zimmi
cd $workdir && \
git merge origin/master && \
git commit --allow-empty -m "Restart gear" && \
git push && \
echo "Success" ;
} >> $logfile 2>&1
UPDATE: Spent a long time debugging the “Permission denied (publickey).”-like errors. What seems to help is:
- Use id_rsa instead of any other SSH key
- Put a new entry into the
~/.ssh/config
file
I don’t know which one did the magic though.
I’ve been harvesting Twitter for a month with about 10-15K tweets a day (only interested in the Czech Republic).
1⁄6 to 1⁄5 of them is located with latitude and longitude. More on this next time.
I’ve seen a bunch of questions on GIS StackExchange recently related to SFCGAL extension for PostGIS 2.2. Great news are it can be installed with one simple query CREATE EXTENSION postgis_sfcgal
. Not so great news are you have to compile it from source for Ubuntu-based OS (14.04) as recent versions of required packages are not available in the repositories.
I tested my solution on elementary OS 0.3.1 based on Ubuntu 14.04. And it works! It installs PostgreSQL 9.4 from repositories together with GDAL and GEOS and some other libs PostGIS depends on. PostGIS itself, CGAL, Boost, MPFR and GMP are built from source.
Here comes the code (commented where needed).
sudo -i
echo "deb http://apt.postgresql.org/pub/repos/apt/ trusty-pgdg main" | tee -a /etc/apt/sources.list
wget --quiet -O - https://www.postgresql.org/media/keys/ACCC4CF8.asc | sudo apt-key add -
apt-get update
apt-get install -y postgresql-9.4 \
postgresql-client-9.4 \
postgresql-contrib-9.4 \
libpq-dev \
postgresql-server-dev-9.4 \
build-essential \
libgeos-c1 \
libgdal-dev \
libproj-dev \
libjson0-dev \
libxml2-dev \
libxml2-utils \
xsltproc \
docbook-xsl \
docbook-mathml \
cmake \
gcc \
m4 \
icu-devtools
exit # leave root otherwise postgis will choke
cd /tmp
touch download.txt
cat <<EOT >> download.txt
https://gmplib.org/download/gmp/gmp-6.0.0a.tar.bz2
https://github.com/Oslandia/SFCGAL/archive/v1.2.0.tar.gz
http://www.mpfr.org/mpfr-current/mpfr-3.1.3.tar.gz
http://downloads.sourceforge.net/project/boost/boost/1.59.0/boost_1_59_0.tar.gz
https://github.com/CGAL/cgal/archive/releases/CGAL-4.6.3.tar.gz
http://download.osgeo.org/postgis/source/postgis-2.2.0.tar.gz
EOT
cat download.txt | xargs -n 1 -P 8 wget # make wget a little bit faster
tar xjf gmp-6.0.0a.tar.bz2
tar xzf mpfr-3.1.3.tar.gz
tar xzf v1.2.0.tar.gz
tar xzf boost_1_59_0.tar.gz
tar xzf CGAL-4.6.3.tar.gz
tar xzf postgis-2.2.0.tar.gz
CORES=$(nproc)
if [[ $CORES > 1 ]]; then
CORES=$(expr $CORES - 1) # be nice to your PC
fi
cd gmp-6.0.0
./configure && make -j $CORES && sudo make -j $CORES install
cd ..
cd mpfr-3.1.3
./configure && make -j $CORES && sudo make -j $CORES install
cd ..
cd boost_1_59_0
./bootstrap.sh --prefix=/usr/local --with-libraries=all && sudo ./b2 install # there might be some warnings along the way, don't panic
echo "/usr/local/lib" | sudo tee /etc/ld.so.conf.d/boost.conf
sudo ldconfig
cd ..
cd cgal-releases-CGAL-4.6.3
cmake . && make -j $CORES && sudo make -j $CORES install
cd ..
cd SFCGAL-1.2.0/
cmake . && make -j $CORES && sudo make -j $CORES install
cd ..
cd postgis-2.2.0
./configure \
--with-geosconfig=/usr/bin/geos-config \
--with-xml2config=/usr/bin/xml2-config \
--with-projdir=/usr/share/proj \
--with-libiconv=/usr/bin \
--with-jsondir=/usr/include/json \
--with-gdalconfig=/usr/bin/gdal-config \
--with-raster \
--with-topology \
--with-sfcgal=/usr/local/bin/sfcgal-config && \
make && make cheatsheets && sudo make install # deliberately one CPU only
sudo -u postgres psql
sudo -u postgres createdb spatial_template
sudo -u postgres psql -d spatial_template -c "CREATE EXTENSION postgis;"
sudo -u postgres psql -d spatial_template -c "CREATE EXTENSION postgis_topology;"
sudo -u postgres psql -d spatial_template -c "CREATE EXTENSION postgis_sfcgal;"
sudo -u postgres psql -d spatial_template -c "SELECT postgis_full_version();"