Agent Based Models Offer Better Predictions

Statistical forecasting is a familiar technique where lots of data points are gathered and assessed. An aggregation is formed and the analysis of that aggregated information is used to predict everything from elections to global warming.

This method, as pioneered by W. Edwards Deming, was used as an industrial tool while working to help create a viable manufacturing industry, virtually from scratch, from amidst the burnt out wreckage of post-war Japan. Now, nearly all modern industry uses statistical probability in their manufacturing and engineering processes to great effect as long as precision is not confused with accuracy.

However, statistical approaches that are ideal for maximizing the performance of machines encounter serious issue when it comes to working with humans. That is because even when we are given a highly restricted amount of options from which to choose from it doesn’t follow that the optimal decision gets made. This source for unpredictably by itself not such a big deal with the involvement of the properly trained and competent people working in a team. But even within clearly defined parameters the possibility always exists that, even with good intentions and sound thinking, decisions can get made which could have disastrous consequences.

In any system where human are required to make a decision options and methodologies can vary from person. They can vary because of influences from environmental conditions and also in how the information presents itself.

Technology, through the use of robots, automated systems, and computerization has at every stage of its growing capability been used to take humans out of the industrial process but human involvement can never be eliminated.

Even in the well-resourced and vast global systems; commercial, political, military and environmental, it is the individual doing the unpredictable thing that can invalidate complex, deeply wrought models of statistical forecasting based on petabytes or more of data.

We know humans are unreliable. We all think it is wise to get a second opinion on important medical matters even if we have faith and ability in the training and competence of the doctor that we are dealing with. So, the question becomes, can we account for our unpredictability in predictable ways?

Michal Kosinski, Operations Director at the Cambridge University Psychometrics Centre, thinks it is possible. In an article soon to be published by the Proceedings of the National Academy of Sciences he suggests that minor or incomplete digital information that exists about our individual selves when collated and aggregated with hundreds of thousands of people roughly like us can serve to make accurate predictions about what we will do next.

The testbed used for his theory was Facebook. Using a sample size of 58,000 and measuring just their activity using the Like button alone they were able to make predictions of such accuracy that the information, according to the researchers could be, “Worthwhile for advertisers.”

[NB: When the PNAS article is published Technology Voice will be covering the findings in greater detail.]


But the application of this work goes beyond selling stuff on a social network site. By being able to make predictions in the very area where predictive power is weakest could save us from all sorts of disasters. Apart from natural disasters nearly all catastrophes can be traced to human error of some sort.

The classic failure of statistical forecasting as a predictive tool was seen in the recent financial meltdown. Financial Services was, and still is, a massively data-ized industry. Yet the gazillion bits of intensely analyzed data and the resulting highly thought-through prognostications were rendered illusory when the economy disappeared.

The idea of econometrics as a trustworthy and reliable tool for policy-makers evaporated along with it. But this attitude only shows a lack of understanding on what the limits are on what can be gleaned from an approach that is based purely on statistical forecasting. The failure was not in the data but in how people used and interacted with (or ignored) the data.

However, by focusing on people are likely to do rather than what information on its own does we can perhaps model intensely complex systems such as economies in a far more realistic and useful manner.

Approaching the same issue from another angle is the research is taking place in what are called, agent-based models, which in turn have been developed from complexity theory, (the theory that enables companies to move things from A to B around the world but often, and counter-intuitively, not by the shortest route.) It looks very promising and the work being done by projects such as CRISIS – Complexity Research Initiative for Systemic InstabilitieS is beginning to attract wider attention and is receiving major funding.

The idea of developing agent-based models is to predict what human decision-makers in a given system will do. That agent could be a pilot in a plane, a manager tasked with a project or a customer picking up something from the supermarket on the way home from work. All humans in a system – all bringing their own rationales and temperament to the process.

Apart from the obvious benefit to marketers, being able to predict how people or individuals will make decisions in a given situation while working within the confines of a process or a system would help to reduce randomness and increase reliability.

However, it seems we now have the very odd situation where the abstracted you, which has been formulated by the aggregation of data derived in part or in whole by vasts number of people who are somewhat like you, is more likely to behave like you than the real you. Where’s Douglas Adams when you need him?

DataSift: New Frontiers in Search


Sarah Blow is a computer science graduate of Manchester University in England. Sarah is now working on bringing DataSift through its alpha testing phase in preparation for wider use. The filtering technology underpinning DataSift allows for highly refined searches to be able to take place across a selection of social networking platforms.

Five years ago, in her spare time, Sarah founded Girl Geek Dinners to alleviate the isolation that many women may feel while working in what is still a male dominated tech industry.

How did DataSift come to be?

“DataSift is pretty much the back end system that powers TweetMeme. We wanted to rebuild the engine that because we knew we could do more with it. Rather than changing TweetMeme, we created this new brand called DataSift.

Who is it for?

“The types of people we are bringing in are big financial services companies, anyone looking to do marketing and marketing analysis, agencies that are looking after brands and so on. Pretty much anything you can think of where you want to find deeper layers of information.

“While it is in alpha we are looking mainly for developers but with the mindset of bringing anyone in who is filtering and creating content. We want to find out what their needs are beyond the basics of what we’ve got there at the moment.

“We have basically asked, ”What do people want to do with the system?” And now we are looking at ways of packaging it so it can work in the right ways for different groups of users.”

What do people use it for?

“We’ve seen people use it for geo-targeting and geo-mapping content in order to find about particular brands and track them. Also, people use it to find out where their users are based. They can also find the most influential users within their particular market.

  • Use example: San Francisco 49ers

    “For TechCrunch disrupt we demonstrated the capability of DataSift by using publicly available information from a San Francisco 49ers game that was happening that weekend.

    “There were three rules which we set up 24 hours in advance.”

    • Data Collection: “We first created a base rule. That was pulling in information from everyone from at Candlestick Park. Anyone who mentioned the name San Francisco 49ers and anyone who mentioned any of the players from the 49ers and the opposing team.

      “There was no geo data in that in that filtered stream.”

    • Geo-location: “From there we built a second rule on top. Taking all the output from the first rule we said, “Right, now we want anything in San Francisco.” So, if someone has set their Twitter location to San Francisco we could pick them up including their tweets about that particular subject but only that subject.”
    • Geo-targeting: “Then we decided, that’s good, but it’s not perfect. What we really want to know is who is in the Park that’s really seeing the cool stuff. Can we manage to get some twitpics, for example, from inside the Park from people who were actually there?

      “The only way you can verify they are really in the Park is if they are actually geo-located in the Park. So we have an option on DataSift to do a single point and set a radius around it. We found the geo-target for Candlestick Park and set a one kilometre radius around it which pretty much covers that area and anyone just outside the stadium.

      “But it didn’t come back with much at all. It literally came back with one user and they hadn’t done any photographs. They had just tweeted that they were there.”

    That only one user was returned according to the parameters that were set up is very interesting in itself. It would have been reasonably valid to have guessed far more returns. It is always worthwhile to remember that what one assumes about a situation and what really happens may be two entirely different things.

    This is why having better tools to be able to really drill down into the data and to refine and define the results is so vitally important.

  • Use example: Starbucks

    “We were based in San Francisco at the time. So we tried a different exercise using DataSift where we basically said, “We want to find anyone in Starbucks who has got a PeerIndex score of over 40.” Let’s see who the influential people were in San Francisco at that time and find out which Starbucks they are in today. That was a fun one to do. “

“You could do something similar with breaking news. If you knew there was a story breaking in a particular location and you are a news organisation and you want to filter down to find who the actual, legitimate sources were that were actually really in that location using DataSift would certainly be one way of doing it.”

What is the next step?

“We are aiming to have a drag and drop interface which we haven’t finished yet. Users who don’t necessarily have a strong technical ability and an understanding of the technicalities of it don’t need to. They shouldn’t need to have to have that level of detail to use the system.

“The FSDL language that we have got there we only really expect to be used by developers. It is not really aimed at the general user. But while DataSift is in alpha we’ll teach the general users how to use it in case we take a bit longer doing the other side of it.”

The work being done by Sarah and the DataSift team is promising to be a cutting edge development in information retrieval. If you want to help with their alpha testing you can still sign up at DataSift.

ScraperWiki: Hacks and Hackers Day comes to Dublin


Hacks and Hackers Hack Day is taking place in Ireland on the 16th of November during Dublin Innovation Week. The organiser of the day-long event is ScraperWiki. Their aim is to provide the resources that allow anyone with any kind of programming ability to develop, store, and maintain software tools for the purposes of extracting and linking data.

By providing the means to create accessibility to data ScraperWiki can allow interested parties such as journalists to take advantage of initiatives such as the UK Government’s policy to make its data more available to the public. Since the UK Expenses Scandal, where certain British parliamentarians were found to have abused their statutory allowances, journalists have become increasingly aware of the wealth of potential stories that lie in databases around the world. However, this data has usually been stored in a random, unstructured and relatively inaccessible manner.

According to Aine Mcguire, in charge of sales and marketing for ScraperWiki, change has only come recently, “In 2003, a gentleman called Julian Todd contacted the UK Government to find out how various MPs had voted on the war. When he tried to get this information in order to do some analysis on it he was advised by the Cabinet Office that all this information was published in Hansards which is the official publishing body of the UK government. But it was difficult [to access.] It was deep down inside a website and he couldn’t do anything with it.

“So Julian went and scraped all that information from Hansards and…then fed it into a website in the UK called The Public Whip which shows you the voting record of all of the MPs in the UK.

“But it was very controversial as he risked imprisonment for doing this because of Crown copyright. But they didn’t imprison him and it was Julian Todd who came up with the idea for ScraperWiki.”

Active since March, 2010, Aine says Scraperwiki aims to, “build the largest community supported public data store in the world.

“You’ve got Wikipedia which supports content that’s predominantly for text and OpenStreetMap is for maps. What we want to do is create a wiki for data. We’re taking data that is in a very unstructured style and putting it into our structured data store. Where appropriate we’re adding longitude and latitude tags. We’re geo-tagging it which means that data can be mapped.”

In line with its aim of being a worldwide data resource project ScraperWiki has had datasets submitted from countries such as the UK, Brazil, Germany, Estonia, Greece and France to name a few. These datasets cover such subjects as the 11,000 deep sea oil wells around the UK, public transport traffic incidents in London, oil rig accidents and so on.

“As well as being a datastore it’s a wiki for code.” Aine explains, “At the moment if you want to do some programming you would go out into the web somewhere, you download some tools, you would install them in the server. Scraperwiki allows you to directly program on the browser so in effect we’re given you lots of libraries for you to program with.

“You can write a screenscraper that goes that uses any of the libraries we’ve got in our browser technology. You can use Python PHP, or Ruby. So you can go off and scrape without having to install anything on your PC or server.”

An added benefit is that because of the inherently collaborative nature of wikis the possibility exists for code to be updated and improved and shared by other programmers.

Aine describes what to expect from the Hacks meet Hackers Hack Day, “At the beginning of the day we have a little presentation about what a Hacks and Hackers Hack Day is all about. Then we give a little presentation on ScraperWiki although we don’t prescribe that they use it. Then we let the journalists and developers gravitate together to form teams over datasets of interests. Then they go off and hack all day. At six o’clock we ask the project groups to come back and present for three minutes each their particular visualisation of the data set that they have worked on.”

Prizes are then awarded and there is a reception for the participants to attend. At a previous event in Liverpool in July eight projects were produced by journalists and programmers working together using open data.

For data driven journalism to flourish information even with the maximum reasonable amount of access granted by governments around the world the problem still exists of data being stored in data silos. Information has to be accessible not only by other people then those who made the original entries but by other machines as well. Structuring information for greater accessibilty is not going to happen all by itself. It will take the sort of co-ordinated and collaborative effort that organisations such as ScraperWiki offer to really make our world a more open and transparent place to live and work in.

At the moment of writing the Hacks and Hackers day taking place in Dublin is fully subscribed but tickets are still available for the Belfast event on the 13th of November.

It is a free event and Scraperwiki is a not for profit organisation. Please contact Aine through their website if you would be interested in sponsoring a part of the event.

Infographics: Communicating The Essence Of A Tidal Wave Of Information

As databases around the world begin to share and compare their data with ever-greater meaning and relevance through the rapid roll out of Linked Data implementations, it is going to become more and more challenging to corral that data and make it into something user-friendly and practical. After all, data isn’t worth anything unless it is usable in some form.

We do know that there is a tidal wave of data coming right at us just over the horizon. According to one source we passed 3 zettabytes (21 zeroes) in 2008. So how so we begin to make sense of it?

One answer to help solve the need for increased intelligibility lies in the nascent field of information graphics or infographics for short. Up until now it has been a geeky/arty sub-genre of the Internet and regarded as something quite separate from the hard-core, often macho world of ‘real’ coding. But researchers, artists, statisticians and folks from all sorts of other fields are realising that not everyone wants to plough through all those numbers and data tables, and why should they, when a simple picture can tell the whole story.

But infographics has the possibility of being something far more than the mere prettifying of data. Assembling data in this manner to produce an infographic, a chart, or some other means of communicating an idea visually is really content production. The most important rule of content production is tell a story. That is the secret of all the most interesting infographs.

The all-time master (so far) has to be Hans Rosling: it’s worth taking a break now and watching his TED presentation where he sets the record straight on widespread notions concerning the ‘developing world’. There is even more of his work over at gapminder.org.

In his historical graphs not only can you view a data subject over time, you can also compare it to other neighbouring data subjects. Plus, in the graphs at Gapminder, you can set your own parameters to achieve a very great degree of fine tuning. It is impossible to play with this data without garnering some very interesting insights into how the world has developed over the last hundred and fifty years or so. Up until very recently, for Hans to have been able to communicate this knowledge and information, which he presents in such an understandable and approachable way, would have required hours, days, weeks, months even, to assemble and put together. Then there would be the time spent writing the book or making the film so that he could share the findings and insights with others.

Not only do we have the the chance to make data sensible and easy-to-use, we can – through the application of Linked Data and various new applications – do it in relatively short periods of time.

The possibility arises not just of a new and important channel of communication but of a new and exciting possibility of new art form. There will be a great need for more practitioners in this field with the creativity and talent to be able to make huge swathes of data intelligible and useful.

In the same way that IT departments devolved into separate computer services and web services departments and we now have a further devolution of social media functions as a professional sector in its own right, I can see data representation as becoming an entire skillset/profession as well.

Further reading:

  • This article from the Association of Computing Machinery that was helpful in laying out the land and the terms involved.
  • Protovis is an open source programme from Stanford University. It is free and probably a good place to start building your own infographics.
  • Google Fusion Tables is another place you can try building something for free.
  • Nathan Yau (@yfd on Twitter) has been leading the way for some time with, for instance, the use of UN data to create a world progress report. There is a nice interview with him in the magazine for the School of the Art Institute in Chicago in which he lays out some of his thinking.