Data Visualization: Identifying Geographical Origins in Wikipedia Data

by   |   May 22, 2012 2:42 pm   |   0 Comments

Mapping Wikipedia image

This view of Mapping Wikipedia shows the number of articles written in French on topics tagged by geography.

Who has the strongest voice in the high-profile public forum of Wikipedia, and on which topics? The answer to that question depends on your address and your language.

“No information is neutral or devoid of power-relations,” says Mark Graham of the Oxford Internet Institute (OII), an academic center devoted to studying the societal implications of the Internet. “Accordingly, we need to understand where information is produced and what it is produced about.”

To investigate who was creating Wikipedia content—particularly content related to the Middle East and North Africa—Graham and his team at OII worked with TraceMedia, a London-based web application firm specializing in visualizations and mapping tools. Together, they created the Mapping Wikipedia project.

More on Data Visualizations

Articles about data visualization implementations and tools.

       Read more»

The resulting data visualization gave Graham a stark impression of the extent to which Wikipedia’s geographic coverage was uneven. “I knew it would be [uneven], but not to the degree which it is,” says Graham. “This doesn’t mean that Wikipedia is inherently flawed; it just means that we can—and should – become more aware of the biases in the knowledge that we use.”

What the visualization shows: Drop-down menus let users choose from seven languages (Arabic, Egyptian Arabic, English, Farsi, French, Hebrew and Swahili) and select one or more global locations including countries, regions and continents. The visualization also allows users to view the data by the dates Wikipedia articles were created, the word counts of articles, the number of authors, images and links to other   Wikipedia articles, density (the number of articles falling under a pixel at a given zoom level) and the number of anonymous edits associated with an article posted from a certain location.

Based on these selections, Mapping Wikipedia displays all matching geotagged Wikipedia articles as colored dots illuminated against a grey-on-black map. The largest data set (English language geotagged Wikipedia articles for the entire world) can take more than five minutes to load, but other smaller data sets (French language geotagged articles for North Africa, for instance) load in just a few seconds.

Once the sets have loaded onto the map, users can toggle back and forth rapidly among a number of different sorting methods, each with its own color scheme. Sorting the French language North African Wikipedia articles according to the number of images they contain yields mainly blue and purple dots; few articles have photos. Viewing the same data set according to the date when the articles were created lights up the map with green dots showing that many articles were created in 2012.

Hovering over any dot on the map reveals the title of that Wikipedia article. Clicking on the dot opens an information box filled with metrics on the article—the longitude and latitude of its geotag, its creation date, word count, number of authors and so forth. The information box also contains a link to the original Wikipedia article.

Putting the Map in Motion

On May 17, TraceMedia launched a new version of Mapping Wikipedia that shows dots light up on the map along a timeline that shows when articles were created.

“In some regions, the animation shows patterns emerge gradually, as you would expect when humans are creating articles one by one,” says Gavin Baily, producer and developer at TraceMedia. “For France and Italy, you see whole blocks of stub articles appear overnight, which is a telltale sign of bot activity.”

Below is a video clip showing the density of articles in English on the timeline.

Source of the data: The Oxford Internet Institute derived its data from Wikipedia XML dumps of languages relating to the Middle East and Africa, a region of the world where OII researchers were particularly interested in analyzing article production. OII then created a MySQL database of geo-tagged articles and assigned country codes for each article from the XML dumps.

What the designers did: TraceMedia wrote a PHP (hypertext preprocessor) middleware application to examine different aspects of articles, including concentrations of author activity and image contributions, to gauge article quality and identify bot-generated article stubs. Technically, the main challenge was figuring out a way to display close to one million interactive points on an OpenStreetMap, a free open source web map.

After experimenting with various mapping libraries, TraceMedia decided to use an Open Layers  map with an HTML5 Canvas renderer optimized for large point datasets on top of a base map made from styled Google Map tiles.

“Previously, having even 1,000 interactive points on a map would have been a lot, and 20,000 points would have been considered a ridiculous number,” says Gavin Baily, producer and developer at TraceMedia. “Hundreds of thousands of interactive points were just not possible in any way until the latest browser versions that use HTML5 Canvas and run Java so well and so quickly that you can now plot a million points at semi-interactive frame rates.”

Insights from the Mapping Wikipedia project

  • Israelis “are far more active  in creating/reproducing knowledge in one of the world’s most used websites than their counterparts in the Middle East and North Africa.”
  • As a portion of the Internet population, users in Italy, Scandinavia, the Baltic States and Ukraine are more likely to make an edit to Wikipedia than authors from Great Britain or Germany.
  • Mapping Wikipedia users have found a higher-than-expected number of Swahili language geo-tagged articles in Turkey and English articles tagged for Poland. In both cases, it seems that either dedicated Wikipedia editors or bots have gone out of their way to create Wikipedia entries for a large number of relatively obscure towns and localities.

Key Presentation Choice: “The brilliant thing about using a stylized Google base map is that you have access to fantastic APIs, wonderful tools and the ability to easily render the map in any way you want. Rather than having the standard blue sea and grey land masses, we were able to develop the dark Mapping Wikipedia base map,” says Baily. “The  Wikipedia Foundation tends to be keen on having people use OpenStreetMap, which lets you create tiles, but you have to host them yourself. If you have no budget, you would need a really good reason to pick OpenStreetMap over Google.”

As for the dark map background illuminated with pinpoints of light, the total effect brings to mind nighttime shots of Earth from space with the cities and highways glowing with manmade illumination. “Whether we chose light-on-dark or dark-on-light, the key thing for us was to make sure that the articles were clearly visible,” says Baily. “Ultimately, we made the choice to use the dark background mainly for aesthetic reasons. We found that light dots on a dark background worked best in terms of being able to distinguish a range of colors within each of the Mapping Wikipedia metrics.”

Reaction: The result caught the attention of many data visualization followers from Information Aesthetics, to the Fast Company Co.Design blog which called Mapping Wikipedia “a magnificently complicated project that only grows more fun at every turn.”

Aaron Dalton is a freelance writer based in Nashville.

Tags: , ,

Post a Comment

Your email is never published nor shared. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong>