Internet Geographer


Mapping Wikipedia at the global-scale in Arabic, English, French, Hebrew and Persian

Building on the maps posted a few days ago which mapped Wikipedia in Arabic, English, French, Hebrew, Persian and Swahili, I wanted to present a few alternate visualisations of the same data.

In the maps below, the number of articles was grouped by country in order to better understand national-level inequalities in Wikipedia’s augmentations of our world. The data were all taken from November 2011 Wikipedia data dumps. Our project team wrote a script to search for coordinate representations in every article (taking into account the language variations between the dumps and the varying ways in which geo-coordinates are expressed). We improved the quality of our coordinates by doing things like eliminating or fixing erroneous coordinates, grabbing coordinates (where sensible) from not just structured infoboxes, and making sure to remove irrelevant coordinates (Wikipedia actually contains a lot of coordinates for extra-terrestrial entities like lunar craters!). We then did some post-processing to make sure each country also contained articles for entities like lighthouses, coastal shipwrecks and piers that sometimes fell just outside of the coastal boundaries of the country-boundary file that we were using. 

And after all of that, we’re able to tell you how many Wikipedia articles are in each country!

The map above (click the image to enlarge it) displays the number of articles in English. There are a staggering number of articles in the United States (over 180,000 of them) and tens of thousands in many European countries, Japan, Australia and India. As we saw in our last post, there are also far fewer in much of the rest of the world. In fact, there are only a few countries in Africa that contain more than 1000 articles. 

Below are the Arabic, French, Hebrew, and Persian Wikipedias. Rather than discussing each individually, I will discuss some key themes at the end of this post. 

The most striking pattern that you probably notice in these maps is the significant amount of self- or inward- focus of some languages (including English) (an observation that is supported by the work of Brent Hecht and Darren Gergle). There are more Hebrew articles about Israel, French articles about France, and Persian articles about Iran than anywhere else. 

The same pattern, however, doesn’t hold true for the Arabic Wikipedia. In fact, in the top-10 list of total number of articles by country in the Arabic version of Wikipedia, there are only two countries (Algeria being 8th on the list and Syria bring 9th) that can be considered to be predominantly Arabic-speaking (the rest of the list is: #1 USA, #2 Spain, #3 Russia, #4 UK, #5 France, #6 Italy, #7 Greece, and #10 Iran).

The lack of focus on, and contributions by, the Arab world is particularly striking in all of these maps. Arabic is the world’s fifth most spoken language and yet only has the 25th largest Wikipedia. There are just over 24,000 geotagged Arabic Wikipedia articles whilst there are over 691,000 geotagged articles in English. 

The scale of these difference ultimately results in some almost implausible comparisons. For instance, there are more articles in English about North Korea than articles in Arabic about Saudi Arabia, Libya, the UAE and many other countries in the region!

But, perhaps most interesting is the question alluded to above. Why is there such a (relatively) large number of Persian articles about Iran, but so few Arabic articles about places like Saudi Arabia (the country in which more than a quarter of all Arabic edits originate) and the UAE? A key goal of our research project is to answer this very question and better understand the barriers to participation. 

In future posts, we’ll begin to move beyond these raw data counts and explore patterns of participation and representation in the region. In the meantime, any observations and questions are welcome.
Mark Grahamwikipieda, oii, menaea