DataSift: New Frontiers in Search

Sarah Blow is a computer science graduate of Manchester University in England. Sarah is now working on bringing DataSift through its alpha testing phase in preparation for wider use. The filtering technology underpinning DataSift allows for highly refined searches to be able to take place across a selection of social networking platforms.

Five years ago, in her spare time, Sarah founded Girl Geek Dinners to alleviate the isolation that many women may feel while working in what is still a male dominated tech industry.

How did DataSift come to be?

“DataSift is pretty much the back end system that powers TweetMeme. We wanted to rebuild the engine that because we knew we could do more with it. Rather than changing TweetMeme, we created this new brand called DataSift.

Who is it for?

“The types of people we are bringing in are big financial services companies, anyone looking to do marketing and marketing analysis, agencies that are looking after brands and so on. Pretty much anything you can think of where you want to find deeper layers of information.

“While it is in alpha we are looking mainly for developers but with the mindset of bringing anyone in who is filtering and creating content. We want to find out what their needs are beyond the basics of what we’ve got there at the moment.

“We have basically asked, ”What do people want to do with the system?” And now we are looking at ways of packaging it so it can work in the right ways for different groups of users.”

What do people use it for?

“We’ve seen people use it for geo-targeting and geo-mapping content in order to find about particular brands and track them. Also, people use it to find out where their users are based. They can also find the most influential users within their particular market.

  • Use example: San Francisco 49ers

    “For TechCrunch disrupt we demonstrated the capability of DataSift by using publicly available information from a San Francisco 49ers game that was happening that weekend.

    “There were three rules which we set up 24 hours in advance.”

    • Data Collection: “We first created a base rule. That was pulling in information from everyone from at Candlestick Park. Anyone who mentioned the name San Francisco 49ers and anyone who mentioned any of the players from the 49ers and the opposing team.

      “There was no geo data in that in that filtered stream.”

    • Geo-location: “From there we built a second rule on top. Taking all the output from the first rule we said, “Right, now we want anything in San Francisco.” So, if someone has set their Twitter location to San Francisco we could pick them up including their tweets about that particular subject but only that subject.”
    • Geo-targeting: “Then we decided, that’s good, but it’s not perfect. What we really want to know is who is in the Park that’s really seeing the cool stuff. Can we manage to get some twitpics, for example, from inside the Park from people who were actually there?

      “The only way you can verify they are really in the Park is if they are actually geo-located in the Park. So we have an option on DataSift to do a single point and set a radius around it. We found the geo-target for Candlestick Park and set a one kilometre radius around it which pretty much covers that area and anyone just outside the stadium.

      “But it didn’t come back with much at all. It literally came back with one user and they hadn’t done any photographs. They had just tweeted that they were there.”

    That only one user was returned according to the parameters that were set up is very interesting in itself. It would have been reasonably valid to have guessed far more returns. It is always worthwhile to remember that what one assumes about a situation and what really happens may be two entirely different things.

    This is why having better tools to be able to really drill down into the data and to refine and define the results is so vitally important.

  • Use example: Starbucks

    “We were based in San Francisco at the time. So we tried a different exercise using DataSift where we basically said, “We want to find anyone in Starbucks who has got a PeerIndex score of over 40.” Let’s see who the influential people were in San Francisco at that time and find out which Starbucks they are in today. That was a fun one to do. “

“You could do something similar with breaking news. If you knew there was a story breaking in a particular location and you are a news organisation and you want to filter down to find who the actual, legitimate sources were that were actually really in that location using DataSift would certainly be one way of doing it.”

What is the next step?

“We are aiming to have a drag and drop interface which we haven’t finished yet. Users who don’t necessarily have a strong technical ability and an understanding of the technicalities of it don’t need to. They shouldn’t need to have to have that level of detail to use the system.

“The FSDL language that we have got there we only really expect to be used by developers. It is not really aimed at the general user. But while DataSift is in alpha we’ll teach the general users how to use it in case we take a bit longer doing the other side of it.”

The work being done by Sarah and the DataSift team is promising to be a cutting edge development in information retrieval. If you want to help with their alpha testing you can still sign up at DataSift.