I’ve been thinking about how the explosion of user generated content that has characterized web 2.0 can be made more useful by the addition of structure, ie meaning = data + structure.
The obvious way that structure can be added to user generated content is by asking users to do it – user generated structure.
There are at least four ways that I can think of to get at user generated structure.
Tagging is the first approach, and its use has been endemic to web 2.0. Sometimes the tagging is limited to the author of the content, and other times any user can add tags to create a folksonomy. Most if not all social media companies employ some form of tagging, including Flickr for photos, Stylehive for fashion and furntiure (Stylehive is a Lightspeed company) and Kongregate for games. Tagging is a great first step, but with well known limitations, as Wikipedia notes:
Folksonomy is frequently criticized because of its lack of terminological control that it seems to be more likely to produce unreliable and inconsistent results. If tags are freely chosen instead of taken from a given vocabulary, synonyms (multiple tags for the same concept), homonymy (same tag used with different meaning), and polysemy (same tag with multiple related meanings) the efficiency of indexing and searching of content is lower. Other reasons for inaccurate or irrelevant tags (also called meta noise) are the lack of stemming (normalization of word inflections) and the heterogeneity of users and contexts.
The second approach is to solicit structured data from users. Examples of sites that do this include wikihow (which breaks down each how to entry into sections such as Introduction, Steps, Tips, Warnings and Things You’ll Need), CitySearch (which asks you for Pros and Cons and for specific ratings on dimensions such as Late Night Dining, Prompt Seating, Service and Suitability for Kids) and Powerreviews (which powers product reviews at partner sites that prompt for Pros, Cons, Best Uses and User Descriptions, including both common responses as check boxes and a freeform text field with autocomplete).
This can be a powerful tool to add structure to data, but as one of the commentors to my last post points out,
From a UGC perspective, site administrators can force structure by requiring every site contribution to have a parent category, or descriptive tags. The problem is that the more obstacles you put in place before content can be submitted, the less participation you are going to get.
The core idea is to create the meta data describing the data, which will enable computers to process the meaning of things. Once computers are equipped with semantics, they will be capable of solving complex semantical optimization problems. For example, as John Markoff describes in his article, a computer will be able to instantly return relevant search results if you tell it to find a vacation on a 3K budget.
In order for computers to be able to solve problems like this one, the information on the web needs to be annotated with descriptions and relationships. Basic examples of semantics consist of categorizing an object and its attributes. For example, books fall into a Books category where each object has attributes such as the author, the number of pages and the publication date.
Ideally, each web site creator would usa an agreed format to mark up the meaning of each statement made on the page, in a similar way that they mark up the presentation of each element of a webpage in HTML. In a subsequent article, Iskold also notes some of the challenges with a bottom up approach to building the Semantic web which can be summarized at a high level as “it’s too complicated” and “no one wants to do the work”.
The fourth approach to user generated structure is to build a central authority of meaning. Metaweb appears to be trying to do this with Freebase, a sort of “Wikipedia for structured data” which describes itself as follows:
Freebase is an open database of the world’s information. It’s built by the community and for the community – free for anyone to query, contribute to, build applications on top of, or integrate into their websites.
Already, Freebase covers millions of topics in hundreds of categories. Drawing from large open data sets like Wikipedia, MusicBrainz, and the SEC archives, it contains structured information on many popular topics, including movies, music, people and locations – all reconciled and freely available via an open API. This information is supplemented by the efforts of a passionate global community of users who are working together to add structured information on everything from philosophy to European railway stations to the chemical properties of common food ingredients.
By structuring the world’s data in this manner, the Freebase community is creating a global resource that will one day allow people and machines everywhere to access information far more easily and quickly than they can today.
There are clearly both advantages and disadvantages with a single authoritative source of user generated structured data; and criticisms similar to those leveled at Wikipedia (potential for systemic bias, false information, vandalism and lack of accountability could cause some data to be unreliable) could be leveled at Freebase. Wikipedia has combated these problems largely successfully through a robust community of Wikipedians – it isn’t clear if Freebase has yet developed a similar protective community.
All four of these approaches offer the opportunity to substantially improve the usefulness of user generated content, and data in general, by adding elements of structure. They are without doubt a meaningful improvement. However, I’m not sure that any of these will offer a universal solution. They all suffer from the same problem that user generated anything suffers from – inertia.
A body at rest needs some external force to put it into motion. Similar, a user needs some motivation to contribute structure. Some people do it out of altruism, others to earn the respect of a community, others for recognition, and yet others for personal gain. Unless these motivations are well understood and either designed into the system, or adjusted for, there is risk that the user generated structure could be either sparse, or unreliable.
I would love to hear reader’s thoughts on these four models of user generated structure, and on other models of user generated structure.
I’ll talk about other ways to generate structure in subsequent posts. [Update: thoughts on inferring structure from domain knowledge now up.]