Sindice: New Approach to Online Data Management

Sindice is a semantic web index, which allows you to access and leverage the “web of data”, which is the rapidly expanding number of websites which are semantically marked up, that is tagged with RDF, RDFa, Microformats or Microdata, tags which can be used to identify online content as belonging to different categories.

This week Sindice, in partnership with Hepp Research and Openlink Software, launched Sindice Ltd, a new startup which will manage Sindice’s intellectual property, and oversee the commercial drive of its products.

Giovanni Tummarello is the CEO of Sindice, which originated at the Digital Enterprise Research Institute in Galway, Ireland. He explains how the web of data will revolutionise online data management, and how it is, “set to explode”, in the coming months. Once it does, he enthuses, the web of data, “all becomes a big graph which one can join with a single query”.

“Semantic mark-up is basically a markup that you put on the page to express what you have on the page, so if you have the name of a movie, because you are discussing that movie in a blog article for example, you might want to tag the title of the movie, the director of the movie, whatever data makes that page recognisable to a search engine for exactly what it is, which is a page which talks specifically about that movie.

“In a regular search engine, it’s just the keywords that are being searched, so you’re looking for the title of the movie, which could be, “The Blue Tomato”, but there are all sorts of pages which can contain these two words, for all sorts of reasons. On the other hand, if you put a mark up saying that it is a movie, you will be stating that you are talking about a movie.”

As Giovanni points out, Sindice acts like a search engine of all the 270 million or so sites which currently have semantic markup, but its real utility is greater than that, “OK, you can put in a keyword and search it, it’s fine, but that’s not really the point”.

“Sindice is basically a search engine which is not just a search engine. Really it’s an infrastructure for leveraging all the web data out there. We have 270 million pages or so at the moment; they are not normal web pages, they are only web pages which have semantic markup on them. What Sindice does is it has a very powerful engine that can correlate information from one website to another.

“You can basically use the entire web as if it is your playground by merging information here and there. You can get the name of a movie from a page which is marked up, and the name of the movie can be looked up on Wikipedia, where you can see what the director is, and then go on Rotten Tomatoes and get the rating, and all together it can be queried with a single query which goes all over the web and returns the information all ready to be consumed to enhance websites, anywhere where you want content aggregated from multiple sources on the web.

“Sindice provides these services where you can make these queries and combine content coming from all of the web of marked up data.”

The formation of the company is a sign that Sindice are ready to commercialise their technology. Sindice Ltd solidifies the “very important” partnerships with Hepp Research and Openlink Software, and also manages the intellectual property the partners now share, keeping it, “nice and tidy”, so that Sindice can seek further investment.

“There are two main markets we are pursuing, the first one is customised cloud hosted data spaces”, continues Giovanni.

“We’re going to be allowing people to have their own data spaces, we call them, and that comes for a price of course, it’s kind of data as a service.

“You want to have the data that comes from the web of data but you also want to have your own data, you also want to have your own correlation of the data. So you want subsets of Sindice data that need to be live and fresh, that need to be combined in way that you want, to solve your problems.”

The second main service, which Giovanni describes as, “much more concrete and immediate” is something called Sindice site services.

Not yet available, this is, he says, “something that will basically appeal to anybody that has a website that they want to enhance with information coming from multiple websites at the same time. This is good for everybody, because the websites which are providing the information, they get traffic and they get links so it becomes a syndication network. That obviously has a value in terms of the possibility of sharing revenue and advertisement”.

Giovanni is confident that following the common approach taken by the three major search engines to create what they call, “a shared markup vocabulary”, the time is right for Sindice to capitalise on the expected flurry of markup activity in the near future.

“It’s exploding as we speak. Microsoft, Google and Yahoo! are telling everybody at the same time to do this! In search engine optimisation circles they are raving about this stuff, and everybody’s implementing it so there’s no alternative.

This means that there will be a lot of people who want to do services on top of this, so if there’s a market, it’s right now.”

Weedle: Creating a Fabric of Credibility

Iain MacDonald started Weedle with the desire to provide a fundamentally better way of connecting people who have skills to people who need them within a trusted environment. Weedle employs 18 people at present, most of whom are mathematically and software orientated, and has users in over 160 countries. Not bad for a company that is only about a year old.

The origins of Weedle began when Iain needed to solve an arboreal problem at his home. He explains further, “I needed to find somebody who could help me cut down some trees at the bottom of my garden. I spent a bit of time asking my friends and my family if they knew somebody but unfortunately they didn’t. Then I went to Google and I found people who said they were very good tree surgeons but they were strangers to me. I found it very hard to trust them in terms of their competence, their value money, their reliability or their trustworthiness.

“I was pretty sure that the right person was out there. That they were out there looking to connect with me and I’m looking for someone with a skill I need and I am trying to connect with them. But the reality is that it is currently very difficult for us to find each other.”

Iain designed Weedle so that people who are searching for someone with a skill can go beyond just depending on how the suppliers have presented themselves, either via advertising or simply having membership of a professional or trade body. Using social networks as a means of verification there is now a way to assess other factors such as reliability and suitability in terms of being able to work with them.

“When you are looking for a lawyer or someone like that, often when a friend recommends them to you and you end up going to the lawyer they can turn out to be not the sort of person you are looking for. We can circumvent that waste of time by being able to see all the details of the person who has the skill before you contact them.

“What we have is a fabric of credibility. Say, I go to Weedle and I make my skill page. In order to be found when someone searches for me there are a couple of things we take into account in the context of our search algorithm. So, the first one is the content of my skill page and does it match for what someone would be searching for. The next element we take into account is who this person is actually connected to. We may have someone who has created a skill page and is connected to fifty people but they may be less credible then someone who has connected a skills page that maybe only connected to five people.”

How does Weedle compare to Facebook and Linkedin?

“Facebook is very good for communicating with your friends and Linkedin is very good for managing your white-collar network of contacts. But it is not so good if you are looking for a plumber or a carpenter.

“Even if you were looking for a corporate lawyer to float your company on the Nasdaq you’ll get a resume or a some type of CV. What people are really looking for is what projects has he or she been involved in, what role did he play, how long did it take him to do it and to see examples of the expertise that he has.

“It’s not just about say, a yoga teacher who simply states they have worked in ABC Yoga for the last five years. On Weedle you will see; this is where they trained, this is where they worked and here is a video of her giving a yoga class. Here are photographs of the yoga studio, here is a list of ten people that are in your network that went to their class.”

How much of a role does Semantic Web technology have in your system?

“It’s really very significant: A lot of sites using search have gone down the hierarchical directory structure route. A person would have to pick from a drop-down box and choose ‘telecom industry’ and then ‘mobile telephony’ and then ‘mobile network.’ It’s very hierachical and pigeon-holes people into specific positions.

“The particular benefit of using a semantic ontology is that we have no hierarchy to the classification of our user skills. If you go on to the site declaring that you are a carpenter then all you need to do is say “I am a carpenter.” We know that we need to present that search result in a population of search results generated when others search for terms like carpenter, woodworker or joiner.

“We can apportion levels of relevance to the skill pages we have versus search strings. Machine learning combined with Semantic Web technology creates a much better user experience.”

The underlying idea that determines credibility and trustworthiness both offline and online is social proof. Any claim you may make about yourself personally or professionally is validated, or not, against how you are perceived by your social network. Professional bodies may declare you competent and award you some sort of certification and send you out the door to ply your trade but it is how you handle your day to day dealings that really count for most people.

Iain has come up with a system that allows you to access the layers of social trust that surrounds us all and enables access to the sort of vital information on someone that would only normally become available over time and after, possibly, a number of encounters.

Henry Story And WebID

Henry Story was until recently a Social Web Architect at Sun Microsystems. Previously, he worked on Babel Fish, a machine translation service at AltaVista. The babel fish was a small creature featured in “The Hitch-Hiker’s Guide to the Galaxy“. When placed in the ear, it could translate all known languages. The author of the book, Douglas Adams, was also involved in the project.

Henry is the creator of WebID, and on a recent visit to 091 Labs in Galway, Ireland, he took some time out to tell us more about it.

Why is WebID important?

“Currently social networks are closed systems. You have to be part of a social network to friend or communicate with anybody on that social network.

“This isn’t the case for telephones. You notice the oddness of this when you start thinking about previous technologies. You can have a telephone number from any company. You can call anybody in the world in whatever telecommunications network they are part of. You don’t even know what provider people are using. The same with e-mail. You can e-mail anyone, anywhere in the world. It’s a cross-organisational method of communication.

“We have this situation with social networking. You can’t leave your social network. You can’t make a friend on another social network. You are tied. Whenever you want to communicate with someone you have to join their social network or you have to convince them to join your social network.

“So there’s a centralisation process. But centralised creates lots of problems. If a social network goes down, and one third of the social networks have disappeared completely, then all your work and all your relationships disappear too.

“Also, Facebook can’t provide for all the different needs of social networking. What we are really looking for is how we can create a distributed social web and that’s what WebID solves.”

So how does it do that?

“WebID is inspired by OpenID. OpenID allows you to type the URL of essentially what should be your homepage and use that to create an account. You then use that to log into any provider using one password which you never send to the relying parties – the people you are trying to log into. The only person who knows your password is your OpenID provider.

“OpenID was initially inspired by the Friend of a Friend (FOAF) project. The FOAF project allows me on my homepage to describe who my friends are and link to them in the Semantic Web way. Your Facebook is marked up with something like microformats. A successor to microformats is RDFa, and that would allow you to link to your friends. All your friends would have a profile on their server and you could link to them via their profile and say that you know them.

“That’s very nice because they can describe who they are, where they are, what their interests are on their page and keep that up to date. You can always be up-to-date with their information because all you have to do is have a robot fetch that information, read it and give you the latest version.

“The problem with the semantic linking of profiles is mainly that it is completely open. The information is visible to everybody. There are people who have a bit of a problem with that. People want a bit of privacy, a bit of intimacy, so they can develop new ideas and play around without having to think about the critical eye of society.”

You were working on a developing a protocol for an address book to solve this problem when you made a remarkable discovery. What was that?

“It turns out, amazingly enough, that HTTPS has all that built in.”

HTTPS allows for secure transactions over the Web by means of cryptography and the use of digital certificates. These certificates authenticate the user’s public key with the encoded key.

“In a usual HTTPS session you connect to a site and the site through cryptography tells you who it is.

“In the X.509 certificate, there is a subject alternative name ‘field’ that was placed in the certificate. Nobody has ever used it, so we just put the WebID in there. So when you click on the certificate, the certificate is sent with your WebID to the server. The server then fetches the document at that WebID and it verifies that you own the private key of the public key with which you just authenticated to the service. So in two HTTPS connections you get to do exactly what OpenID does [in seven] and in a web friendly manner.”

Henry will be speaking at Open Coffee Galway this Friday at 11 AM.

Interview (Part 2): Nova Spivack On The Fragmentation Of The Semantic Web

In the first part of our interview with Nova Spivack, we talked about the struggle for dominance between Facebook, Google and Microsoft. In the second part, we discuss the current state of Semantic Web technology. Nova started Twine with the intention of it being the first consumer Semantic Web application.

Semantic Web technologies hold the promise of delivering a web of Linked Data where information is understandable to computers. So rather than simply moving data around, machines can derive meaning from the requests made of them, and return searches and so on that are more relevant to us than what we are used to at present.

Nova claims that Twine had potential, “The next version, had we been allowed to finish it, would have been a candidate for a killer app. It would have provided a social plus semantic search engine. Basically, the kind of thing that Facebook will probably build in the future.

“We had a system where people could say what they liked. They could share things, they could join communities around interests. We were also indexing the Web, pulling out the semantics, building a faceted semantic index, and combining all that data so that people could say – “here’s what I like” – and basically build their own semantic search engines with each other.”

But that was not to be. Unfortunately, when the time came to raise the next series of funding that would have enabled Nova and his team to really build out the project, the recession hit and Twine subsequently ended up being sold to Evri.

But the vagaries of economic cycles are not the only forces at work that hinder the widespread adoption of Semantic Web technologies. There is an issue with the technologies themselves.

Nova states, “They are technically difficult to understand and use. It’s hard for mainstream developers to really adopt them. There are still a lot of challenges to building applications that use those technologies because of the relative lack of commercial grade tools, particularly around storage. Triple stores have largely been R&D technologies, not really mainstream, primetime-ready products. Traditional databases don’t really handle triples in a performant way. The semantic data storage issue continues to be a huge bottleneck for the whole field.

“You can’t build big semantic applications if you can’t store and manipulate large graphs in an efficient way. It’s one thing if you can wait a week to load your triples into your triple base and then process them. That’s fine, but most real applications need to do that every hour.”

Such unwieldy technologies seem to be collapsing under their own weight, and the only practical and expedient way to move forward is by fragmenting the space and using Semantic Web developments in an ad-hoc, piecemeal, and proprietary form.

In a recent article on this site, Open Graph: A Cosy Corner Of The Web?, we indicated that this would be a very likely outcome. We pointed out that the Open Graph Protocol which Facebook uses is very much a subset of the technological power available.

Nova remarks, “ I think semantic [the Semantic Web] is no longer a frontier. I think, possibly, it is a missed opportunity, but I think it will surface in various ways, probably non-ideal ways, on Google, Facebook and Bing.

“I think that the big search engines could still do it, but they’ve made their bets. Google with Metaweb, which is not compliant with the Semantic Web. Microsoft with Powerset, which is not compliant with the Semantic Web. And Facebook, at the moment, is doing non-compliant things as well. Facebook is probably not storing things in a triple form. They’re probably using some other mechanism. It may not have the complexity of a true graph.”

It may be that the vision of a Semantic Web of Linked Data is fading somewhat. So what can be done to resurrect the dream?

Nova goes on to say, “They’ve [Facebook, Google, Microsoft, and so on] built or bought themselves into positions which don’t indicate that they’re going to be using W3C standards. I think that may be deliberate because by doing proprietary things you get advantages over competitors. But it’s old-fashioned thinking. I think the benefits of being able to leverage a truly open semantic graph – based on open standards and well thought-out semantics that can eventually support reasoning – would enable a generation of smarter applications across the web, and everyone would benefit from that.

“I don’t think semantics can be hoarded. I don’t think they should be hoarded. I think semantics only work when they are shared.

“If your e-mail protocols don’t work with anyone else’s, you really can’t win the e-mail wars. Similarly, with semantics. If your semantics don’t work with anyone else’s semantics, you can’t win the semantic wars.”

Nova Spivack also spoke at the BlogTalk event held in Cork in 2008. BlogTalk is returning to Ireland on 26-27 August; check out the speaker list.