More than a year ago I wrote about analyzing Twitter languages with Streaming API. Back then I kept my laptop running for a week to download data. Not a comfortable way, especially if you decide to get more data. One year uptime doesn’t sound like anything you want to be part of. OpenShift by Red Hat seems to be almost perfect replacement. Almost.
OpenShift setup
I started with Node.js application running on one small gear. Once running, you can easily git push
the code to your OpenShift repo and login via SSH. I quickly found simple copy-pasting my local solution wasn’t going to work. and fixed it with some minor tweaks. That’s where the fun begins…
I based the downloader on Node.js a year ago. Until now I still don’t get how that piece of software works. Frankly, I don’t really care as long as it works.
Pitfalls
If your application doesn’t generate any traffic, OpenShift turns it off. It wakes up once someone visits again. I had no idea about that and spent some time trying to stop that behavior. Obviously, I could have scheduled a cron job on my laptop pinging it every now and then. Luckily, OpenShift can run cron jobs itself. All you need is to embed a cron cartridge into the running application (and install a bunch of ruby dependencies beforehand).
rhc cartridge add cron-1.4 -a app-name
Then create .openshift/cron/{hourly,daily,weekly,monthly}
folder in the git repository and put your script running a simple curl command into one of those.
curl http://social-zimmi.rhcloud.com > /dev/null
Another problem was just around the corner. Once in a while, the app stopped writing data to the database without saying a word. What helped was restarting it - the only automatic way to do so being a git push
command. Sadly, I haven’t found a way to restart the app from within itself; it probably can’t be done.
When you git push
, the gear stops, builds, deploys and restarts the app. By using hot deployment you can minimize the downtime. Just put the hot_deploy
file into .openshift/markers
folder.
git commit --allow-empty -m "Restart gear" && git push
This solved the problem until I realize that every restart deleted all the data collected so far. If your data are to stay safe and sound, save them in process.env.OPENSHIFT_DATA_DIR
(which is app-root/data
).
Anacron to the rescue
How do you push an empty commit once a day? With cron of course. Even better, anacron.
mkdir ~/.anacron
cd ~/.anacron
mkdir cron.daily cron.weekly cron.monthly spool etc
cat <<EOT > ~/.anacron/etc/anacrontab
SHELL=/bin/sh
PATH=/usr/local/sbin:/usr/local/bin:/sbin:/bin:/usr/sbin:/usr/bin:/$HOME/bin
HOME=$HOME
LOGNAME=$USER
1 5 daily-cron nice run-parts --report $HOME/.anacron/cron.daily
7 10 weekly-cron nice run-parts --report $HOME/.anacron/cron.weekly
@monthly 15 monthly-cron nice run-parts --report $HOME/.anacron/cron.monthly
EOT
cat <<EOT >> ~/.zprofile # I use zsh shell
rm -f $HOME/.anacron/anacron.log
/usr/sbin/anacron -t /home/zimmi/.anacron/etc/anacrontab -S /home/zimmi/.anacron/spool &> /home/zimmi/.anacron/anacron.log
EOT
Anacron is to laptop what cron is to 24/7 running server. It just runs automatic jobs when the laptop is running. If it’s not and the job should be run, it runs it once the OS boots. Brilliant idea.
It runs the following code for me to keep the app writing data to the database.
#!/bin/bash
workdir='/home/zimmi/documents/zimmi/dizertace/social'
logfile=$workdir/restart-gear.log
date > $logfile
{
HOME=/home/zimmi
cd $workdir && \
git merge origin/master && \
git commit --allow-empty -m "Restart gear" && \
git push && \
echo "Success" ;
} >> $logfile 2>&1
UPDATE: Spent a long time debugging the “Permission denied (publickey).”-like errors. What seems to help is:
- Use id_rsa instead of any other SSH key
- Put a new entry into the
~/.ssh/config
file
I don’t know which one did the magic though.
I’ve been harvesting Twitter for a month with about 10-15K tweets a day (only interested in the Czech Republic).
1⁄6 to 1⁄5 of them is located with latitude and longitude. More on this next time.
I was made to use ArcGIS Server with Openlayers 3 just recently as one of the projects I’ve been working on demands such different tools to work together.
tl;dr: I hate Esri.
I found myself in need to access secured layers published via WMS on ArcGIS Server using username and password I was given, so here’s a little how-to for anyone who would have to do the same.
Let’s start with a simple ol.layer.Image and pretend this is the secured layer we’re looking for:
var layer = new ol.layer.Image({
extent: extent,
source: new ol.source.ImageWMS(/** @type {olx.source.ImageWMSOptions} */ ({
url: url,
params: {
'LAYERS': 'layer',
'CRS': 'EPSG:3857',
}
}))
});
We need to retrieve the token, so we define a function:
function retrieveToken(callback) {
var req = new XMLHttpRequest;
req.onload = function() {
if (req.status == "200") {
var response = JSON.parse(req.responseText);
if (response.contents) {
callback(response.contents); // response contents is where the token is stored
}
};
req.open("get", "http://server.address/arcgis/tokens/?request=getToken&username=username&password=password&expiration=60", true);
req.send()
}
I pass a parameter called callback
- that’s a very important step, otherwise you would not be able to retrieve the token when you actually need it (AJAX stands for asynchronous). Now you just pass the token to the layer params like this:
retrieveToken(function(token) {
layer.getSource().updateParams({
token: token
})
}
When you open Firebug and inspect Network tab, you should find token
URL parameter passed along with WMS GetMap
request.
Few sidenotes:
- Although you might be logged in ArcGIS Server via web interface, you might need to pass the
token
URL param when trying to access Capabilities document. Don’t know why though.
- You should probably take care of calling the
retrieveToken()
in shorter interval than the token expiration is set to. Otherwise you might end up with invalid token.
- You need to hide the username and password from anonymous users (I guess that’s only possible with server side implementation of selective JavaScript loading).
I am writing a diploma thesis focused on extracting spatial data from social networks. I have been working mainly with Twitter API and results I have got so far look really promising. This post was written as a reaction to many retweets I got when I shared one of my visualizations. It aims to make it clear how to connect to Twitter Streaming API using node.js, Leaflet and SQLite and retrieve tweets to analyze them later.
If you have any further questions after reading this paper, feel free to contact me via Twitter or e-mail. I must say right here that the code will be shared as well as the map, but there are still some bugs/features I would like to remove/add.
On a side note: I have been studying cartography and GIS for the last five years at Masaryk University in Brno, Czech Republic. I am mostly interested in ways computers can make data handling easier. I like to code in Python.
Using Twitter Streaming API
As you probably know, Twitter offers three different APIs:
- REST API which is obviously RESTful. You can access almost every piece of information on Twitter with this one: tweets, users, places, retweets, followers…
- Search API used for getting search results. You can customize these by sending parameters with your requests.
- Streaming API which I am going to tell you about. It is really different, as (again, obviously) it keeps streaming tweets from the time you connect to the server. This means, once the connection is made, it has to stay open as long as you want tweets coming to you. The important thing here is that you get real time tweets delivered to you via this stream, which implies you cannot use this API to get tweets already tweeted.
To sum it up: You get a small sample of tweets in a real time as long as the connection to the server stays open.
What you need
To use any of the Twitter APIs, you need to authenticate you (or your app) against Twitter via OAuth protocol. To be able to do so, you need a Twitter account, because only then you can create apps, obtain access tokens and get authenticated for API use.
And then, obviously, you need something to connect to server with. I chose node.js because it seemed as a good tool to keep connection alive. I have also been interested in this technology for the couple of months but never really had a task to use it for.
The good thing about node.js is that it comes with lots of handy libraries. You get socket.io for streaming, ntwitter for using Twitter API and sqlite3 for working with SQLite databases.
You need something to store the data in also. As mentioned, I picked SQLite for this task. It is lightweight, does not need server nor configuration to run, just what I was looking for. Seems we are set to go, right?
Filtering the data
I guess none of you is interested in obtaining random tweets from around the world, neither was I. I live in the Czech republic and that is the area I want to get tweets from. How?
It is fairly simple, you tell Twitter with the locations
parameter of statuses/filter
resource. This parameter specifies a set of bounding boxes to track.
To sum it up: you connect to the server and tell it you just want to get tweets from the area you specified with the locations
parameter. The server understands and keeps you posted.
Is it that simple?
No. Twitter decides whether to post you the tweet or not according to what the value of coordinates field is. It goes like this:
- If the
coordinates
field is not empty, it gets tested against the bounding box. If it matches, it is sent to the stream.
- If the
coordinates
field is empty, but the place
field is not, it is the place
field that gets checked. If if it by any extent intersects the bounding box, it is sent to the stream.
- If both of the fields are empty, nothing is sent.
I decided to throw away the tweets with the empty coordinates
field, because the accuracy of the value specified in the place field can be generally considered very low and insufficient for my purposes. You still need to account for position inaccuracies of users’ devices though, however that is not something that we can deal with. Let us just assume that geotagged tweets are accurate.
Figure: Twitter seems not to be very accurate when matching tweets against bounding box.
Although, as you can see in the picture, they are not. Or they are, but Twitter is not good at telling so. Besides that, none of the countries in the world is shaped like a rectangle and we would need to clip the data anyway. That is where SQLite comes in, because I have been saving incoming tweets right into the database.
If you use any GUI manager (sqlitebrowser for Linux is just fine), you can easily export your data to the CSV file, load it into QGIS, clip it with Natural Earth countries shapefile and save them to the GeoJSON file. It is just a matter of few JavaScript lines of code to put GeoJSON on a Leaflet map.
Displaying the data
Once a GeoJSON file is ready, it can be used for making an appealing viz to get a sense of what may be called “nationalities spatial patterns”. The lang
field (stored in the database, remember?) of every tweet is used to colour the marker accordingly. Its value represents a two-letter language code as specified in ISO 639-1 document.
However, as those codes are guessed by Twitter’s language algorithms, they are prone to error. There are actually three scenarios we might be facing:
- User tweets in the same language as used in the Twitter account.
- User tweets in his/her mother language, but has set different Twitter account language.
- User does not tweet in his/her mother language, but has it set as a Twitter account language.
We basically have to deal with 2) and 3), because 1) means we can be pretty sure what nationality the user is. Sadly though, I have not found an easy way to tell which one of these two we came across, thus which language settings should be prioritized. I made an arbitrary decision to prioritize the language the tweet was written in, based on assumption that the most of the users tweet in their mother language. No matter what you do, the data is still going to be biased by automatically generated tweets, especially ones sent by Foursquare saying “I’m at @WhateverBarItIs (http://someurl.co)”. It works fine for the strange languages like Russian and Arabic though.
From Jan 2 to Jan 4 this year 5,090 tweets were collected. Leaflet is becoming a little sluggish without clustering turned on displaying all of them. Plans are to let the collection run until Jan 7 and then put all the tweets on the map. I guess that might be around 10,000 geotagged tweets by that time.
I am definitely willing to share the code and the final viz. Meanwhile, you can have a look at the screenshot on picture [*]. I have already implemented nationality switch (legend items are clickable) and I would like to add a day/night switch to see whether there are any differences between the peoples’ behaviour.
Figure: Final map screenshot. A legend is used to turn nationalities on and off. You are looking at Prague by the way.
Obviously the most tweets were sent from the most populated places, e.g. Prague, Brno, Ostrava.