Internet Geographer


Posts tagged data
Historicizing Big Data and Geo-information

I was asked by my colleague and friend Oliver Belcher to act as a discussant in a session that he put together at the 2016 meeting of the Association of American Geographers: ‘Historicizing Big Data and Geo-information’.

The session contained a set of truly excellent and provocative talks by Louise Amoore, Matthew Hannah, Patrick McHaffie, and Eric Huntley. I’ve now had a chance to type up my discussant notes (although apologies for the hastily-put-together nature of the text).

I think that this has been a much-needed set of papers at a moment in which we’re awash with hype about ‘big data’. We hear that we’re at a significant moment of change; that there’s a big data revolution that will transform science, society, the economy, security, politics, and just about everything else.

And so, it’s important that these sorts of conversations are brought together. To allow us to think about continuities and discontinuities. To allow us to think about what is and what isn’t truly new here. And to do that in order to hone our responses as critical scholars.

One way to start - perhaps - is to recognize, as all of the papers in this session have, that while ‘big data’ may not be new, we’re in, have been in, or at least have long been moving towards, what Danny Hills refers to as an Age of Entanglement. I think it is maybe useful as a starting point for me to quote him here.

He says “We are our thoughts, whether they are created by our neurons, by our electronically augmented minds, by our technologically mediated social interactions, or by our machines themselves. We are our bodies, whether they are born in womb or test tube, our genes inherited or designed, organs augmented, repaired, transplanted, or manufactured. We are our perceptions, whether they are through our eyes and ears or our sensory-fused hyper-spectral sensors, processed as much by computers as by our own cortex. We are our institutions, cooperating super-organisms, entangled amalgams of people and machines with super-human intelligence, processing, sensing, deciding, acting.”

In other words, while big data may not be new, we do now undoubtedly live in a digitally infused, digitally-augmented world. One in which we’re entangled in complex digital ecosystems; hybrid complex ecosystems in which it is increasingly hard to disentangle agency and intent.

Why’s my phone telling me to go left and not right? Why is the supermarket creating some personalized economic incentives for me and not others? Why is the search engine prioritizing some knowledge and not others? As researchers, it is hard to address questions like these because there is often no straightforward way of knowing the answers. Do we look to the code embedded within algorithms? Do we look to the people or institutions who created the algorithms? Do we look to the databases? Do we look to the people who manage the databases? Or do we look to the people, processes, and places emitting the data?

What today’s talks have all usefully done is point to the fact that we need to be addressing some combination of all of those questions. 

So, let me just pick up with a few general reflections and concerns about what the histories of big data mean for the futures of big data - that emerge from listening to these talks. Like all of the speakers, what I’ll especially focus on here is what geography can bring to the table.

First, is a thought about our role as geographers. Many geographers, me included, spend a lot of time thinking about the geographies of the digital; thinking about how geographies still matter.

We probably do a lot of this to counter some of the ongoing, relentless, Silicon Valley narratives of things like the cloud. Narratives that claim that - provided they are connected - anyone can do anything from anywhere at any time. So, we end up spending a lot of our energy pushing back: arguing that there is no such thing as the cloud. That there are just computers. Computers all in other places.

But I wonder if we’re missing a trick, by not also asking more questions about the contextual, specific, but likely present ways in which some facets of geography might actually matter less in a world of ever more digital connectivity. Not as a way of ceding ground to the breathless ‘distance is dead’ narrative - but in a critical and grounded way. Are there any economic, social, and political domains is distance, or spatial context actually becoming less relevant? 

Second, when we speak about big data, or the cloud, or the algorithms that govern code-spaces, we often envision the digital as operating in a black box. In many cases, that is unavoidable.

But we can also draw more on the work from computational social science, critical internet studies, human computer interaction, and critical GIS. In all of those domains, research is attempting to open the black boxes of cloud computing; of big data; of opaque algorithms. Scholars are asking and answering questions about the geographies of the digital. Where is it; who produces it; who does it represent; who doesn’t it represent; who has control; to whom is it visible.

There is much more that clearly needs to be done, and this work needs to be relentless, ongoing, and - of course - critical. But, one hope for the future is to see more cross-pollination with those communities who are developing tools, methods, and strategies to empirically engage with geography in a more data-infused world. So, yes – there are black boxes. But those boxes are sometimes hackable.

Third, and relatedly. A key critique of ‘big data’ - that I see - in the critical data studies community, is the one about correlative reasoning (in other words, if your dataset is big enough, you no longer need theory; no longer need to understand context; and can just look for correlations in the dataset). And relatedly a lack of reflexivity within those data practices of data analysis. But I wonder if we aren’t also overplaying our hand a little here. Some big data practitioner work does stop at correlations, but a fair amount of it can be quite reflexive and aware of its limitations.

These researchers are still building or doing multi level models, social network analytics, community detection models, influence analysis, predictive analytics, and machine learning. My point here, is that whilst a lot of ‘big data’ work is undoubtedly naïve, let’s also not underestimate the power that those with access to the right datasets, the computing resources to analyse those datasets, and the methods to analyse those datasets - have.

My broader point is that we really need to find the balance of not understating and not overstating the work being done by governments and corporations in the domain of big data. Yes, some big data practices out there are naïve and dumb. And yes, some are terrifyingly precise in the ways that they can anticipate human behavior.

To get that balance, I think we need a few things. The first is to pay attention to what has been called the Fourth Law of Thermodynamics: That, the amount of energy required to refute bullshit is an order of magnitude larger than required to create it. Let’s make sure our energy is wisely spent.

To get the right balance, it also seems clear that all of us need to not just try to better understand the nuances of key techniques and methods being employed.  But also to think about what we can specifically add to the debate as geographers – and on this latter point, this is something that I think the papers in this session did very well.

Fourth, when thinking about the political economy of data, it’s becoming ever more clear that we need a full throated response to reclaim our digital environments (a point that Agnieszka Leszczynski has been forcefully making). Privacy and security scholars and activists have been especially vocal here. But let’s again think about what our role as geographers can be in this debate.

The way in which my colleague Joe Shaw and I are thinking about this is (and – my advertising pitch here is that this is something we’re speaking about in three sessions we’ve organized on the topic on Friday morning) - is to argue that we need to translate some of the ongoing ‘right to the city’ debate into the digital sphere. The point being that if places we live in are bricks, mortar, and digital data - we need to think about rights of access, control, and use - in our digitally-infused world.

This is just one type of intervention; and I’m sure that building on the foundations of critical historicisations of big data can offer us fertile ground for reimagining what we actually want our data-infused futures to look like.

Fifth, something that I saw, and really appreciated, in all of the papers was a forceful reflection on how data are always political. Too often data, and especially ‘big data’, gets presented as a neutral and objective reflection of the object of its measurement. That big data can speak for themselves. That, big data are a mirror of the world around them. What a lot of today’s work has done is reflect on not just how data reflect the world, but also how they produce it; how they drive it. As we tread deeper into Danny Hills’ ‘Age of Entanglement’, this is something we’ll need much more of.

As Trevor Barnes, in the last session mentioned, the best kinds of papers leave you hungry for more detail – and a few more things I would have loved to have heard more about are:

From Louise –a bit more about what our vision of the cloud enables beyond the cloud; the cloud in many ways can make some facets of the cloud – or life - perceptible – the cloud being deployed to study life online. But how much of the cloud vision is about moving beyond the cloud – being deployed to study life offline; to study the facets of life that aren’t directly emitting digital data shadows? Also, the empirical work you spoke about sounds fascinating – and I hope the questions give you some more time to bring out ways in which you’ve gone behind the algorithms – and underneath the cloud – to look at how these knowledges are created.

From Matthew – it was interesting to see how some of our contemporary concerns about the power of big data to aid the surveillance powers of the powerful – are far from new. So what might protests against contemporary regimes learn from the earlier moments you spoke about? There are many of us who want to opt out; is this now less possible because of the more encompassing nature of contemporary data collection regimes?

From Eric - I wonder if the idea of the ‘world inventory’ in the 80s; the details of it; what it means in practice, were similar to large tech firm like Google’s vision of a world inventory of geospatial information today. Does a world inventory now mean something significantly different from what it used to?

From Patrick - You didn’t use the term ‘smart city’. But I wonder if you’ve looked into any so called ‘smart city’ initiatives – and if you could say more about how we should we should be honing our inquiry into the so-called ‘smart city’ based on what you’ve learnt here; based on what we know about the visions that brought the Cartographatron into being?

For all of us – scholars in this field – I wonder if we’re all speaking about the same thing in this session when we talk about ‘big data’. Are we taking about datasets that are a census rather than a sample of a population. Are we just using ‘big data’ as a proxy for ‘digital data’? Are we using that term to refer to the whole contemporary apparatus of data trails, shadows, storage, analysis, and use? Are we using it to refer to digital unknown unknowns – the digital black box? Is the term actually helping us as short-hand for something else? Or do we need more precise language if we want to make sure we’re genuinely having a conversation?

And finally, for all of us, I want to ask why this seems to continue to be such a male dominated field? In two sessions, seven speakers, and two discussants, we had only one female speaker. Are we reproducing the gendered nature of earlier computational scholarship? One of the dangers of telling these histories – is that it can end up being white men speaking about white men. This is not a critique of the organiser, as I know Oliver is well attune to these issues, but rather a question about how and why we might be re(producing) masuclinist knowledges.

So, to end – I want to again thank Oliver and the speakers for putting together this session on historicizing Big Data. We need more conversations like this; and we need more scholarship like this. And this is work that will have impacts beyond the boundaries of geography.

We know that we can’t go backwards; and I think the goal that many of us have is a more just, more democratic, more inclusive data-infused world. And to achieve that, one thing we all need to be doing is participating in ongoing debates about how we govern, understand, and make sense of our digitally-augmented contexts.

And perhaps one thing that we can all take away from this session is that if we want to take part in the debate - to influence it – we’ll need to understand big data’s history if we want to change its futures.
Call for papers: The Data Revolution in International Development (Sri Lanka, May 2015)

Richard Heeks and I are organising a track at WG 9.4: Social Implications of Computers in Developing Countries on the topic of “The Data Revolution in International Development." 


Richard Heeks (University of Manchester, UK)
Mark Graham (Oxford Internet Institute, University of Oxford, UK)

Many have pointed to a “data revolution” occurring in business, science, and politics.  As ever-more and ever-faster information is available about trends, patterns and processes, then related decision/action systems will be significantly affected.  This track focuses on these changes in the context of international development, given the likelihood that the post-2015 development agenda will include a greatly increased role for data.  This was particularly identified in the 2013 High-Level Panel Report – “A New Global Partnership” – one of the foundations for post-2015 discussions.  The report explicitly calls for a data revolution in international development, and suggests data-related targets for inclusion within the new development goals.

In some ways, the High-Level Panel reflects a reality already underway, and this track invites papers on any aspect of the data revolution in international development, such as:
  • Technical research on new techniques specifically required for capture, input, storage and processing of developing country data.
  • Socio-technical research on the specific issues that arise in analysis and presentation/visualisation of developing country data.
  • Socio-organisational research on the developmental value of new data, and on the transformation of development processes and systems that new data can enable.
  • Critical research on the politics and discourses of the data revolution.
We identify four main strands within the data revolution, which papers might address:

Open development data: the greater availability of developing country datasets for general use.  By far the biggest growth area has been open government data which is particularly linked to improvements in transparency, accountability and service delivery.  But open data can apply equally to private sector firms, markets, NGOs, and other development actors and systems.

Big development data: the emergence of very large datasets relating to phenomena within developing countries.  One main source has been mobile phone call records but there are growing numbers of survey-based, transactional and other large datasets that can offer new insights
into development.

Real-time development data: the availability of developing country data in real time.  To date, lagged models have been dominant within developing country data and decision-making, with data becoming available months or years after the events that it describes.  The growing diffusion of ICTs within developing countries is reducing this lag significantly as crowdsensing – everything from humans reporting via their mobiles to field-based sensors – becomes a reality.  The use of (near) real-time data for development decisions could enable a move to agile methods in development.

Other data trends: open, big and real-time data are three main elements to the data revolution but there will be others that form part of the post-2015 agenda.  These include increases in geo-locatable data, mobile data, bottom-up data, and qualitative data.

Submissions Due: 3rd October 2014

For more information, please contact richard.heeks[at]
A critique of the Economist's "#AfricaTweets" story

The latest edition of the Economist contains an article titled “#AfricaTweets.” The piece contains a striking map that visualizes the “number of tweets” per country in the “top 20 African countries.”

The only problem is that the article doesn’t do what it promises. 

My problem with the Economist’s article isn’t their whimsical (and quite funny) commentary on the use of Twitter in Africa (e.g. they quote @MorganTsvangirai “******* **** ******* ****** ******** ****** ** ******* #ZimPolitics” and @Bono” Africans tweeting each other, not me, about news, not me #sadface”).

The issue is that the Economist makes no attempt whatsoever at qualifying the limitations of these data. 

For instance, the article begins with the statement that “Twenty countries sent over 11m tweets in the last quarter of 2011.” I believe this to be a vast underestimation of the amount of information pushed through the platform in Africa. 

Looking at the source document for the Economist’s data (something they neglect to link to), we see that the data naturally contain only geo-located Tweets (something they neglect to mention). This is important because only a very small proportion of tweets tend to contain any geodata. In June 2011, my team and I collected 19.6 billion tweets using the statuses/sample stream with spritzer access (this was a 19 day sample collecting both geocoded and non-geocoded tweets globally), and we found that only 0.7% of tweets contained geographic coordinates.

Original map created by Portland Communications above

This matters because it is conceivable that people in some countries are more likely to geolocate their tweets than others due to either social norms or access to the requisite devices (such as smartphones). In other words, by looking at geocoded tweets we’re only seeing a tiny fraction of the content that passes through the platform.

This isn’t to say that there aren’t other ways to geolocate information on Twitter. In a recent paper, Scott Hale, Devin Gaffney and I recently analysed whether locations in user profiles (descriptions such as “Oxford, UK” or “Barad-dûr, Mordor, Middle Earth” that can be grabbed from the vast majority of tweets) can be used as a proxy for (much rarer) geocoded content. It turns out that profile information isn’t a great substitute for actual latitude and longitude coordinates.

Time zone settings are another approach to figuring out where information comes from, but our research shows that many users (especially within Africa) don’t seem to set their time zone.

Furthermore, there is no attempt to account for prolific users in these samples. Looking at the Economist’s map (and even the source document) doesn’t tell us if Gabon is in the top-20 list because a lot of people use the service in that country, or a small number of Twitter addicts all have their smartphone GPS buttons turned on. Knowing the answer to this question fundamentally changes how we should interpret the map. 

None of this means that the original maps produced by Portland Communications are fundamentally flawed. Geocoded tweets are both insightful and useful. It just shows that there are crucially important details to be aware of whenever analysing Twitter data (I won’t even get started on the different types of sampling methods in this post).  

Hundreds of millions of short messages are passed through Twitter every day, and this content has been used by researchers from fields as diverse as epidemiology, politics, marketing and geography to better understand, map and measure large-scale social, economic, and political trends and patterns. However, much of this analysis is carried out with only limited understandings of how best to work with the spatial and linguistic contexts in which that information was produced. 

Maps are powerful tools: they influence how we understand, enact, produce, and re-produce our world. This means that cartographers bear a significant amount of public responsibility. 

And any geographer will tell you that no map is true representation of anything. With the advent of easy-to-access Internet-based data, we therefore need to more than ever constantly ask critical questions about how online data are collected, analysed, and presented.
Geographies of the World's Knowledge

Our project titled “Geographies of the World’s Knowledge” has just gone live on the new Oxford Internet Institute data visualisation site. In the project, we use a range of visualisation techniques to map literacy, Internet penetration, the world’s newspapers, academic knowledge, Flickr, Wikipedia, and user-generated content indexed in Google. A sample of three of our maps are below, or a full PDF of the publication can be downloaded at the following link:

Graham, M., Hale, S. A. and Stephens, M. (2011) Geographies of the World’s Knowledge. London, Convoco! Edition.