Forums, the Semantic Web, and SEO (Version 2)

Why add semantic metadata to forums?

  • To increase the relevance of search results
  • To make search results more useful (embedding information on the numbers of posts or authors in a discussion, the last post date, ratings, topics, etc.)
  • To help link together topics across forums
  • To collect a person’s contributions across sites
  • To go beyond what is possible with RSS (limited items, cannot fully describe forum structures)

Search engines are starting to identify structures

Google have implemented some message board parsing algorithms to determine how many posts are on a thread, how many users posted on that thread and when the last post was made. This can be very useful when you are deciding what forum result to click on, to determine which is the freshest, which has the most posts by distinct authors. This can be seen in the search result for ‘irish pubs’ shown below. It is not complete, and probably relies on identifying certain HTML structures for non-Google discussion sites, e.g. there is a blog discussion and a forum thread in the middle of the results that do not display the total posts or commenters. However, this is moving towards the Semantic Web vision of providing more metadata about objects (in this case discussions) on the Web to help you in finding more relevant information. I propose that forum software incorporate these structure definitions by default, targetting (and contributing to) recent metadata recommendations from Google Rich Snippets and Yahoo! SearchMonkey.

What is the Semantic Web (in brief)?

The Semantic Web, also called the Web of Data, is an extension of the current Web in which pages are augmented with metadata that identify the types of things (called classes) described in pages on the Web and their properties (e.g. a Person class could have properties like name or a location). The Semantic Web allows us to create a Web of Data as opposed to a Web of Documents. Instead of having a lot of pages linked together with hyperlinks that don’t describe the relationships between these pages, we can have various types of things described and linked together (e.g. a Forum for Music may be the container_of a Post about Lily Allen). These classes and properties are usually described in data schemas or “vocabularies”. Two popular vocabularies for social websites are FOAF (for describing personal profiles and social networks), and SIOC (for describing online community content such as discussions in forums, blogs, etc.).

RDF is the main Semantic Web format used for representing metadata. RDFa is a means of embedding semantic metadata or microcontent within HTML. RDFa metadata from one page may be standalone or it can be combined with that from another page or site. For example, metadata on John’s page may say that he knows Mary and Mike. Mike and Mary may have metadata on their pages about where they live, the posts that they have written and the topics that these posts are on. Combining statements from pages by John, Mary, Mike and John’s other friends, we can answer questions for John like “show me posts from my friends in Galway talking about food”.

Search engines and the semantic structures they recommend

Search engine companies have recently started to publish recommended vocabularies [data schemas] (Yahoo!, Google) so that site owners and publishers can use these to annotate their content with microcontent / metadata using RDFa or microformats. For example, profiles of people can be annotated with semantic tags identifying names, addresses, etc. The result is that more relevant content will be displayed in search results, and the results themselves become more visually appealing (by using metadata to show people or organisations on a map, to display stars for review ratings, etc.), thereby encouraging clickthroughs.

Yahoo! SearchMonkey has published a list of recommended vocabularies [data schemas] (including FOAF, GoodRelations, hReview, SIOC, vCalendar and vCard) that publishers can use to create structured data and thereby drive more traffic to their sites.

Google’s ‘Rich Snippets’ initiative (introduced in May 2009) has a similar aim, albeit using Google’s own RDF vocabularies rather than popular existing ones like FOAF. Rich Snippets also promotes the use of the hReview, hProduct and hCard microformats for annotating reviews, products and people or organisations respectively.

There are obvious SEO implications in providing metadata for user-generated content, especially as it accounts for a majority of what we are now accessing on the Web. As search engines settle on a set of metadata terms that they will support, SEO experts will have to ensure that metadata corresponding to these terms is being properly generated from the websites that they are in charge of.

By providing RDFa-enabled HTML templates for popular social software applications with metadata relevant for search results (leveraging experience gained from creating RDF/XML exporters for WordPress, vBulletin, etc.), a very important step towards the formation of “Web 3.0” (Web 2.0 with added semantics) can be taken. The RDFa in Drupal code sprint was one of the first efforts to do this for a widely-used social software application. There are good reasons why something similar should be done for vBulletin, phpBB, WordPress, etc.

Drupal’s provision of semantic metadata

Drupal creator Dries Buytaert wrote a very interesting post in October 2008 entitled ‘Drupal, the semantic web and search’ in which he said:

On a social networking site built with Drupal, [semantic technology] opens up the possibility to do all sorts of deep social searches – searching by types and levels of relationships while simultaneously filtering by other criteria. I was talking with David Peterson the other day about this, and if Drupal core supported FOAF and SIOC out of the box, you could search within your network of friends or colleagues. This would be a fundamentally new way to take advantage of your network or significantly increase the relevance of certain searches.

The structured data that is available in many Drupal deployments (but is difficult to leverage due to HTML representations) can be made available as RDFa through modules created for Drupal 6 and through recent additions to the core of Drupal 7 (now in testing mode). The possibilities are exciting, as Dries has described further. A video was created to demonstrate some deep searches of Drupal RDFa data using Yahoo! SearchMonkey, and it also showed some visual navigations of this linked data.

What do we need for forums and bulletin board systems?

I think we need to see some developments for forums and bulletin board software regarding the Semantic Web now that Yahoo! SearchMonkey and Google Rich Snippets are already indexing semantic content from the Web, and you can be sure that Bing will go the same way if recent developments are anything to go by.

Drupal is about to get the jump on vBulletin and phpBB by becoming one of the first forum (content) systems that will have semantic markup out of the box allowing search engines to know what is a post, a reply, a user / person, a topic, etc. And it’s not just for the last 15 posts either à la RSS, but for all site content that can have embedded metadata through the corresponding HTML pages. On October 19th, Dries Buytaert committed RDFa additions to Drupal 7 core into the code repository (Twitter announcement).

I propose that when vBulletin 4 comes out, someone grabs the default set of templates, and marks them up with RDFa just as was done for Drupal. Then, whoever wants to benefit from having semantic markup identifying the posts, replies, users, topics and forums on their multi-forum website can do so. I’ve included a picture of what could be marked up for a thread.

What kind of pages could be marked up, and what would this enable?

Thread pages could be marked up with information about the number of authors, the number of posts/items, the last reply date, the rating, the topics (using tags), etc. This could all be shown in a search result (as Google are doing for some discussions already).

Forum pages could have summaries of their posts, the last post date, the main topics, titles, descriptions, etc. This could be shown in a search result to help you decide where to talk about a particular topic.

Annotating pages about people could allow you to, as Dries mentioned above, search for relevant content from your friends or colleagues. User profile pages in forums could be marked up with relevant properties identifying homepages of people, names, projects, interests, etc.

By editing a few template files that are in use across hundreds or even thousands of discussion websites, the potential impact for search and data reuse is huge.

To this end, I’ve added some new statistical / summary properties to the SIOC vocabulary to aid with marking up forums and bulletin boards: num_authors, num_items, num_threads, last_item_date, last_reply_date. I believe we have a fairly complete set of terms that can be used to annotate forum content for the Semantic Web.

I look forward to your comments!