Collection and Processing of Tweets
Tweets contain precise and relatively accurate information about when they were posted. They can contain precise and accurate where (location) information as well if the tweet comes from a GPS enabled device for which the user has opted in to geolocation. Attribute information in tweets is challenging to extract due to the combination of the 140 character limit on tweets (which prompts extensive use of abbreviations) and the dramatic variety of tweet content spanning a range: from individuals documenting mundane daily activities in their lives, through professionals alerting followers about events and information (e.g., the Director of FEMA tweeting about the challenges of social media security), government and non-government organizations making regular announcements (e.g., UNGlobalPulse announcing events and new stories or CrisisMappers.org announcing maps or training webinars), and advertisers using Twitter for marketing purposes.
SensePlace2 uses a crawler to systematically query the Twitter API for tweets that contain any topics deemed to be of interest. The current implementation of the system uses a set of keywords and phrases that our research team has proposed over time and that is added to as new events happen around the world. Queries for each term are run every day and each can retrieve tweets and auxiliary metadata (e.g., creation time, tweet id, user id etc.) in JSON format. After parsing, each tweet is stored in a PostGresSQL database. Once tweets are loaded into the database, separate distributed applications analyze tweets for named-entities such as locations, organizations, persons, hashtags, URLs etc. These named entities are then written to separate tables such as an auxiliary location table, organization table etc. Lastly, locations that are extracted are then georeferenced using GeoNames. Once entities are identified and organized, a Lucene text index is generated that supports relatively fast full text querying as well as more advanced retrieval of relevant tweets within a geographic region and date range.
Visual Interface of SensePlace2
To support geovisual analysis, we designed a coordinated, multiple-view interface for SensePlace2 which supports an understanding of spatial and temporal patterns of activities, events, and attitudes that can be identified through analysis of our growing geo-located Twitter database. A key goal for this interface is to support an analyst’s ability to explore, characterize, and compare the space-time geography associated with topics and authors in Tweets. This includes the ability to describe the geographic content associated with tweets as well as the locations where Tweets were reported by users that have enabled that feature. The default SensePlace2 interface includes a query window, map, time-plot / control, relevance-ranked list of tweets, and task list. The primary display views (map, time-plot/control, and tweet list) are dynamically coordinated. Each view is introduced in the screen capture below, which is the result of a short analysis session to explore flooding incidents. Then cross-view linking is discussed.
Users can enter single or multi-term queries and these can include place names. Each query retrieves a new set of information that is processed to populate display views. The session illustrated above began with query on “earthquake Haiti”.
The time-plot doubles as a compact representation of the frequency distribution of tweets that match the query across the full time span of data in the database and a control to filter tweets by time. Specifically, both ends of the time range selector can be dragged and once a time range is set, that range can be shifted along the timeline by clicking on the time snap bar and dragging. Above, the timeline is set to a range of interest (approximately one month after the earthquake during the recovery phase).
The 500 most relevant tweets are displayed in a scrolling list. The list can be sorted by relevance (the default), time, and place. Hierarchical sorting is also enabled (e.g., with time as the primary sort and relevance as the secondary to highlight recent and relevant tweets). Above, the tweet list was sorted by time to find those near the end of the time range of interest. Interesting tweets were explored, with the map panning and zooming to include all places that a highlighted tweet was associated with (based on the computational processing outlined above).
The map provides both overview, in the form of a gridded density surface representing all tweets that match the query, and detail in the form of point-based depiction of the most relevant 500 tweets. The density surface is generated for the globe and currently depicts frequency counts for tweets aggregated to 2 degree grid cells (grid resolution is flexible, but that flexibility has not yet been made accessible to users). A quantile classification scheme is applied, to allow comparison from one query to the next, and a sequential color scheme is used with dark=highest. It is likely that some locations for the top 500 tweets can have multiple tweets, thus those location are depicted with range-graded sizes for 1, 2-5, and >5 tweets from/about a place.
In the above example, places were also explored by pointing to them on the map; this moved the tweets linked to the place to the top of the list for easy reading. Thre role of various organizations in the relief effort starts to become apparent as does the variety of places in the U.S. that are active in relief efforts. In the view below, one tweet is highlighted that focuses on multi-state efforts by one church to organize relief efforts.
The task list view (not yet fully implemented) will allow users to label results of a query and store that result in a history. These stored queries will retain user set parameters that include any place and time filtering as well as decisions to promote at tweet to high relevance or hide an irrelevant tweet. As noted above, the display views are dynamically linked. Clicking on a tweet location on the map moves the tweet to the top of the tweet list, highlights that tweet in the list, and highlights its time bin on the time-plot. If the location has multiple tweets, the sorting and highlighting is applied to the full set. When a tweet is selected in the tweet list, the map zooms and pans to bring it into focus.