A reflection on the Semantic Web
The Semantic Web is about classifying the information on webpages. When a search engine crawls a webpage, it can’t do much more than to index the words and, gather other data to ensure that when someone searches for a combination of words, the search hits are ranked in order of relevance to that particular person.
I don’t know about other search engines, but Google does try to go beyond this, for example on Google News, where the searches are restricted to current news articles. But this is achieved by someone configuring which sites are news sites – which with the Semantic Web would be detected automatically.
There’s also Google Glossary which can determine parts of webpages that define terms, so that when you search for something, the search hits are pages that in Google’s opinion defines the meaning of that word – and it works amazingly well.
However, with the Semantic Web, you would define “metadata” for your published information, classifying each element – for example, “this is a news article written by John Doe on 2002-06-11 09:21 +0200 written in English” (with resarvations regarding the timestamp format being according to the standard) or “this is an English glossary entry for the word ‘panda’”.
If services on the Internet would classify their information semantically, the existence of one service providing weather information for the entire world, one providing information about parks in the major cities of the world, and several shops listing their inventory online – and finally a super-Google that is knows how to deal with the Semantic Web metadata, you could search for “a day in the forthcoming week (i.e., no rain) which would be good for a picnic in a park in Stockholm where you can buy the ingredients for mozzarella and tomato paninis within 500 meters”.
Walking to work this morning, I realized that HTML from the beginning was about semantically classifying information – although not at the level of detail that the Semantic Web is about. From the beginning, HTML was intented to mark up text and leave the formatting to the receiver – “this is the headline, this is a paragraph, here’s another paragraph, this word should be emphasized, this is a code snippet, this is a citation,” and so on.
Instead, most webpages are like this: “text text text, newline newline, italic text with this font, bold text with that font” nested in 10 tables with lots of invisible images to ensure it looks proper in the popular browsers. Had the creators of HTML anticipated that webpage owners wanted control over the layout, they would have made CSS part of HTML from the beginning, so that information and format would be separated from each other.
It will be interesting to see what happens with the Semantic Web effort, and if people will start separating content from design when (and if) it catches on. (I try to do that with this site, but there’s still work to do.)