LP Login

Think Big. Move Fast.

I’ve been thinking about how the explosion of user generated content that has characterized web 2.0 can be made more useful by the addition of structure, ie meaning = data + structure.

The obvious way that structure can be added to user generated content is by asking users to do it – user generated structure.

There are at least four ways that I can think of to get at user generated structure.

Tagging is the first approach, and its use has been endemic to web 2.0. Sometimes the tagging is limited to the author of the content, and other times any user can add tags to create a folksonomy. Most if not all social media companies employ some form of tagging, including Flickr for photos, Stylehive for fashion and furntiure (Stylehive is a Lightspeed company) and Kongregate for games. Tagging is a great first step, but with well known limitations, as Wikipedia notes:

Folksonomy is frequently criticized because of its lack of terminological control that it seems to be more likely to produce unreliable and inconsistent results. If tags are freely chosen instead of taken from a given vocabulary, synonyms (multiple tags for the same concept), homonymy (same tag used with different meaning), and polysemy (same tag with multiple related meanings) the efficiency of indexing and searching of content is lower.[3] Other reasons for inaccurate or irrelevant tags (also called meta noise) are the lack of stemming (normalization of word inflections) and the heterogeneity of users and contexts.

The second approach is to solicit structured data from users. Examples of sites that do this include wikihow (which breaks down each how to entry into sections such as Introduction, Steps, Tips, Warnings and Things You’ll Need), CitySearch (which asks you for Pros and Cons and for specific ratings on dimensions such as Late Night Dining, Prompt Seating, Service and Suitability for Kids) and Powerreviews (which powers product reviews at partner sites that prompt for Pros, Cons, Best Uses and User Descriptions, including both common responses as check boxes and a freeform text field with autocomplete).

This can be a powerful tool to add structure to data, but as one of the commentors to my last post points out,

From a UGC perspective, site administrators can force structure by requiring every site contribution to have a parent category, or descriptive tags. The problem is that the more obstacles you put in place before content can be submitted, the less participation you are going to get.

The third approach to user generated data is the traditional approach to the Semantic Web. As Alex Iskold notes in ReadWriteWeb:

The core idea is to create the meta data describing the data, which will enable computers to process the meaning of things. Once computers are equipped with semantics, they will be capable of solving complex semantical optimization problems. For example, as John Markoff describes in his article, a computer will be able to instantly return relevant search results if you tell it to find a vacation on a 3K budget.

In order for computers to be able to solve problems like this one, the information on the web needs to be annotated with descriptions and relationships. Basic examples of semantics consist of categorizing an object and its attributes. For example, books fall into a Books category where each object has attributes such as the author, the number of pages and the publication date.

Ideally, each web site creator would usa an agreed format to mark up the meaning of each statement made on the page, in a similar way that they mark up the presentation of each element of a webpage in HTML. In a subsequent article, Iskold also notes some of the challenges with a bottom up approach to building the Semantic web which can be summarized at a high level as “it’s too complicated” and “no one wants to do the work”.

The fourth approach to user generated structure is to build a central authority of meaning. Metaweb appears to be trying to do this with Freebase, a sort of “Wikipedia for structured data” which describes itself as follows:

Freebase is an open database of the world’s information. It’s built by the community and for the community – free for anyone to query, contribute to, build applications on top of, or integrate into their websites.

Already, Freebase covers millions of topics in hundreds of categories. Drawing from large open data sets like Wikipedia, MusicBrainz, and the SEC archives, it contains structured information on many popular topics, including movies, music, people and locations – all reconciled and freely available via an open API. This information is supplemented by the efforts of a passionate global community of users who are working together to add structured information on everything from philosophy to European railway stations to the chemical properties of common food ingredients.

By structuring the world’s data in this manner, the Freebase community is creating a global resource that will one day allow people and machines everywhere to access information far more easily and quickly than they can today.

My friend Thomas Layton recently took the CEO job at Metaweb.

There are clearly both advantages and disadvantages with a single authoritative source of user generated structured data; and criticisms similar to those leveled at Wikipedia (potential for systemic bias, false information, vandalism and lack of accountability could cause some data to be unreliable) could be leveled at Freebase. Wikipedia has combated these problems largely successfully through a robust community of Wikipedians – it isn’t clear if Freebase has yet developed a similar protective community.

Conclusion

All four of these approaches offer the opportunity to substantially improve the usefulness of user generated content, and data in general, by adding elements of structure. They are without doubt a meaningful improvement. However, I’m not sure that any of these will offer a universal solution. They all suffer from the same problem that user generated anything suffers from – inertia.

A body at rest needs some external force to put it into motion. Similar, a user needs some motivation to contribute structure. Some people do it out of altruism, others to earn the respect of a community, others for recognition, and yet others for personal gain. Unless these motivations are well understood and either designed into the system, or adjusted for, there is risk that the user generated structure could be either sparse, or unreliable.

I would love to hear reader’s thoughts on these four models of user generated structure, and on other models of user generated structure.

I’ll talk about other ways to generate structure in subsequent posts. [Update: thoughts on inferring structure from domain knowledge now up.]

Update: Comments well worth reading, as well as the follow up guest post

  • http://www.mkbergman.com Mike Bergman

    I share your enthusiasm for the structured Web (http://www.mkbergman.com/?cat=23) and applaud your recent posts to bring it greater attention. I also think it is helpful to describe the four approaches to ‘user generated structure.’

    But I question whether UGS is really in the same mold as user-generated content or even user-generated data. Besides making for an ugly acronym, UGS (“ughs”) will never be of interest or of value to users in the same way that content is. Face it, alone, structure is simply not sexy nor identifiable to a given user.

    The real approach — the fifth approach — is to mine the value of the massive amounts of structured data (granted, mostly semi-structured data) that already exists. And much of that structure is from Web 1.0 sites, as well.

    Structure is the root crossing the trail that we stumble over at night while gazing up at the stars. It is here, right now, and is in our metadata, our tables, our HTML, our tags, our content, our DOMs and our links. While user-generated structure is welcomed, and we will see rapid growth in linked structured data from publishers as has occurred from the Linked Open Data movement in the past few months, the trick is to figure out how to gain value from the structure already at hand.

    The fifth way, then, will not be users generating anything special for structure at all. It will be background tools and extractors that grab the valuable structure already locked in our documents to connect with the linked data value growing daily across the Web.

    Methinks the issue is not generating structure, but freeing and linking what already exists.

  • Nik

    Jeremy,

    Great job trying to figure this out.

    I think the anything “user generated” has to necessarily also has to necessarily mean “non-coordinated”. The whole confusion is because there are a bunch of folks acting independently that moves away from structure (i.e. the old taxonomy or library index approaches) and towards confusion (or conversely better outcomes i.e. wisdom of the crowds). A Godfather movie could imply Marlon Brando or Oscar movie or Both. Every user actning independently leads to confusion about labelling and the rest of the community COULD be worse off.

    My thoughts on the approach:
    1) Tagging is useful but not really a great value addition. The whole point of tagging is great for data that does not have a whole lot of text for “auto generating” tags. For e.g. it works great for links (Del.icio.us), images(Flickr) and videos (YouTube). But does a tag really add that much value when the content is all text (e.g. a blog post). Shouldn’t we able to figure out the tag based on the content (i.e. Similar to jiglu)

    2) Structured Data- This I think has by far the greatest promise from a user generated perspective. Since the system is sheperding users to act in a co-ordinated way i.e. thou shall give your comments/responses in only A.B.C format

    3) Bottom Up- No need to say anything more on this. Alex has covered the issue and related issues of why this does the work, spam etc and why the “semantic web” promise seems to be always 2 years away.

    4) A central authority of meaning to me implies more of a “domain knowledge” approach highlighted in your previous post. You are trusting the central authority (based on a trusted community) to give meaning. You are essentially depending on the community to show a difference between Manhattan, NY or Manhattan, Kansas.

  • http://kalio.net vruz

    more like…. User-Generated Semantically Structured Content…

    which is the same as saying:

    Meaningful User-generated Content. “MUG”

    sounds a lot more… meaningful to me. :-)

  • http://FacebookEconomy.com Mark

    interesting post

  • http://dannyayers.com Danny
  • Mike Veytsel

    I think that the best approach will be one that employs the best facets of the other approaches. Here is my recipe:

    1) There are already many major hubs of fairly reliable structured data, openly available through APIs. Create a user tagging system that bypasses the ambiguity and redundancy of folksonomies by using only existing, structured data objects as endpoints for tag associations.

    2) Use a centralized system to crossmap tags between structured abjects on various data hubs. The issues of systemic bias, false information, and vandalism can be resolved much more easily than in a system like Wikipedia, since one of the major benefits of a tagging system is that, unlike entire articles, tags are atomic in nature, so their accuracy can be measured by weighting. By weightings I mean that, just like in del.icio.us, the more users who’ve tagged a keyword to a site, the more visible and reliable it is.

    3) User inertia is by far the hardest problem to overcome, because a worthwhile system will need user scale to be useful. A good system would attract users and scale by starting with tag associations which have innate incentives for the end user. Facebook already does this, quite successfully, with photo tagging of friends. Users don’t consider this work or altruism, but a natural extension of how they use the app. Likewise, delicious users tag sites as a means of organizing their online life. These types of actions are what Jakon Neilson calls ‘participation as a side effect’ in his interesting article about user participation: http://www.useit.com/alertbox/participation_inequality.html

    The same guiding principles could be a applied to a more universal system.

    Also, another huge advantage of a tagging system is that a) users are largely already familiar with tagging (according to a recent PEW Internet study), b) tagging is a microtransaction that requires little effort on the users part (as opposed to, say, adding to a Wikipedia article, or even filling in the structured data for an object).

    The bottom-up, high level approach is only as complex for users as the system makes it. I think that at this stage, collective intelligence is still smarter (even if slower) than artificial intelligence. We just need a way to harness it usefully and effectively.

  • Pingback: Between the Lines mobile edition

  • http://keyingredient.com David Goodman

    We’ve come at the structured data problem by offering users value in return for their effort. We’ve targeted recipe as our social object .. don’t leave me now, recipes aren’t just for mom!

    The idea is structuring the recipe and offering a suite a value-added services in return. Example: once the recipe is entered into Key Ingredient, you can blog it with a widget. The widget allows readers to save the recipe to their own collection, email it, print it, etc. But they can also assemble a collection of recipes and buy a print-on-demand cookbook.

    The nice part of the structure is that we can parse out metadata (like ingredients, servings) to offer more services like nutritional estimates and recipe scaling on-the-fly. So I agree with Mike that further meaning can be implied from the basic structure of data in “unstructured” form.

    Will users be bothered to structure their data? They will structure if the quid pro quo is there. Look at the success of photo scanning services. That is a huge investment, but literally millions of images have crossed that barrier into digital form. People have paid for it, and that says a lot today. Once digital, Shutterfly and the like offer the services that fulfill the promises that made the transition worthwhile.

    Our challenge is to add value beyond the obvious, low-hanging fruit and make the transition into structured data worth the hassle.

  • http://500hats.typepad.com/ Dave McClure

    really excellent post jeremy.

    i wrote about some related ideas a few months for O’Reilly Release 2.0 report on prediction markets, re: mining / structuring user-generated content for creating prediction markets in vertical communities.

    lots of interesting potential around creating structure from user data, particularly where the domain of conversation can be used to filter / define specific datatypes & ratings / tags / reviews.

    – dave mc

  • http://www.sexywidget.com lawrence

    I think I’d break out the structuring of UGC as follows:

    The Users apply structure – add tags, annotate, drop content into the appropriate buckets, etc. This involves a heck of a lot of education (anybody else ever try to teach their users to SEO their content?), tapping into selfish interest (Delicious, Squidoo), and balancing the need for structure with the need to keep obstacles for participation low. Scales well, but tough to maintain consistency / quality.

    The Site Administrators provide structure – aka, brute force, put those interns to work, aka the Mahalo, Yahoo! Directory, model. Preferably managed by complementing what’s done by the users. Doesn’t scale, but quality / consistency control is more enforceable.

    The “System” provides the structure automatically based on implicit clues in the content, where the content is coming from, who’s linking to the content, etc. (what Mike says, and what Google is already doing in terms of structuring web site content into a search index).

    Of these, to me, only the third method is next generation stuff.

  • http://www-webhosting.cz Jan Horna

    How about microformats (http://microformats.org/wiki/hreview)? Does not this solve the problem of meta data structure?

  • http://kickstand.typepad.com Jordan Mitchell

    Yesterday you spoke of three sources for structure: UGC, inferred from knowledge of domain, and inferred from user behavior. I actually kinda think of it more simply: explicit and implicit.

    So much of the structure today is explicit, whereas I think the bigger opportunities lie in implicit — that way, no one has to do all kinds of “work” and we don’t suffer the effects of participation inequality (where <1% of the population is driving). Maybe we’re already “voting” with our attention and “structuring” with our personal attributes (location, interests, behaviors, etc.) — layered on top of existing content attributes, of course.

  • fewquid

    A few thoughts…

    a) IDC reckon that data is growing at a compound annual rate of 56%. Most of that growth is in unstructured data.

    b) In a corporate environment 85% or more of the data is unstructured. I don’t have numbers for the consumer world, but I’d guess it is even more skewed to unstructured data.

    So my first point: In general, isn’t it clearly a losing battle to make the fastest growing majority of data try and behave like a shrinking minority??

    Second point: the idea of a central repository of “meaning” is absurd, even at it’s most general. “Meaning” is highly personal. On almost any given topic, what has “meaning” for me will have no meaning for you (and vice versa). Meaning is fundamentally about personal relevance. The reason today’s search engines can be frustrating is because they have no concept of personal relevance (and because they mostly don’t consider meta-data).

    Third point: the traditional concepts of structure really only work when data is relatively sparse. When it becomes superabundant, rigid “classical” ideas of structure break down. The concept of structure needs to evolve into something new, user driven and fundamentally transient.

    Last point: There’s a ton of evolutionary potential in tagging that folks are barely scratching the surface of…

    As you can hopefully tell, I think about this stuff a great deal (it’s my day job)…

  • Pingback: Meaning = Data + Structure: User Generated Structure « digital asset management weblog

  • http://alexiskold.wordpress.com/ alexiskold

    Hi Jeremy,

    This is well researched and thought through post! Here is my take on each of these:

    1) Tagging is an awesome way of adding light structure/semantics on top of the content. The pure fact that it is engaging, well-spread and people do it is a huge plus for it.

    The problem with tagging is that it is not precise. The tag is in the eye of the beholder and a collective set of tags may not amount to a consensus. Another problem is better seen through an example. If you tag the book “Road” like this: [book], [cormac mccarthy] [science fiction]. the tags do not reflect the structure of the object. The thing is that book defines object type, corman mccarthy is the author and science fiction is the genre. For an algorithm this is a hugely important information that gets lost.

    2) Soliciting structure from the users is really about art of the interface. People hate forms. People hate long forms even more. An interface which engages the users and drives her to reveal the structure over time is the one that is likely to succeed.

    3) I said a ton already on the bottom up annotation approach. It could work in theory but hard in practice on the web wide scale. If anything, I would be on microformats as a lighter and simpler approach.

    4) The silos are not the answer, because squeezing the web into 1 site is not possible. Its too rich. We look for web-wide solution, thats what we really need.

    Will see how things unfold. I look forward to your next posts.

    Alex

  • Pingback: SezWho Blog » Blog Archive » Meaning = Data + Structure: User Generated Structure

  • http://www.kango.com Yen

    Jeremy, excellent job summarizing and explaining the different approaches. As you and a number of your readers point out, organic approaches to creating a critical mass of usable, structured data is challenging.

    At Kango, we believe it has to be approach four, and have invested to create an ontology to focus our efforts analyzing unstructured content (e.g. reviews and articles) & data (e.g. product attributes and location) to derive weighted tags (e.g. 80% percentile for kid-friendly) for travel products. The explosion of traveler-generated opinions (blogs, ratings, reviews, journals, articles, trip plans…) has been a gold mine for deriving those weighted tags, and enable consumers to search based on both subjective and objective criteria. For example, you can search for romantic hotels and activities in Monterey and get a different set of results then if you search for kid-friendly hotels and activities.

    Look forward to hearing about how other ventures are generated structured data.

  • gerel

    Well, I think you already gave the problem and the solution in this quote:

    ##
    From a UGC perspective, site administrators can force structure by requiring every site contribution to have a parent category, or descriptive tags. The problem is that the more obstacles you put in place before content can be submitted, the less participation you are going to get.
    ###

    You see ?, If people don’t like/want to give more structure to what they say, it’s likely they’re not saying anything worth hearing.
    They don’t really have an argument then. And if they do have an argument, please think again before writing it !!!

    The reality is that we have millions of trolls hanging around.

    cheers.

  • Pingback: Meaning = Data + Structure « Lightspeed Venture Partners Blog

  • Pingback: Meaning = Data + Structure: Inferring Structure from domain knowledge « Lightspeed Venture Partners Blog

  • Pingback: Meaning = data + Structure: More thoughts on user generated structure « Lightspeed Venture Partners Blog

  • http://innonate.com/ Nate Westheimer

    There are some excellent comments here!

    The only thing I could add, from my self-interested side of the world, is that the concept of UGC being separated from UG-meta-data comes from the fact that blogs and wikis are the only mainstream publishing tools.

    BricaBox will be one way folks can start to use the “right tool for the job.” When content is better published with structure (not always the case!) we hope they use our platform.

    But advertisement aside, the key is using the right publishing tool for the job. It’s a lot simpler than using the wrong tool and then trying to use underdeveloped semantic technologies to clean up the mess.

  • Pingback: User-Contributed Data Auditing? « SmoothSpan Blog

  • http://www.smoothspan.com/index.html smoothspan

    I love the idea of users contributing structure. There are so many ways users can help. A completely different example is having users contribute accuracy for business data:

    http://smoothspan.wordpress.com/2007/10/29/user-contributed-data-auditing/

  • http://www.intelcapital.com Eghosa

    Jeremy, you probably should come attend my VC Taskforce panel on Semantic Search & Discovery coming up next week (Nov 6).
    We will touch upon some of the issues you raised.

    Cheers, Eghosa

  • Pingback: Vario Creative Blog - Marketing, design, web tech and small farm animals

  • amisare

    There are two main considerations when setting up User Generated Structures for UGC’s:

    1. How to group key data ie Categorisation Rules
    2. Who sets up the Rules/Compliance ie Categorisation Control

    Rules can be:

    • Discrete: logical; either/or; black/white; pros/cons
    • Non-discrete or Analogue: fuzzy; gaps & overlap ;shades of gray; relationships

    Control can be private/local or communal/central.

    Thus Jeremy’s Four Ways (Approaches) may be approximately fitted into the following matrix which is formed by combining the above two main considerations:

    Non-discrete Discrete Categorisation
    ——————————————————–
    Private/Local |1st Approach | 2nd Approach
    Control | Tagging |Wikihow ;Powereview
    ———————————————————
    Communal/Central | 3rd Approach | 4th Approach
    Control | Semantic web | Metaweb; Freebase
    ———————————————————

    The 4th approach requires hard work but may provide good seach results.

  • Pingback: Meaning = Data + Structure: Inferring structure from user behavior « Lightspeed Venture Partners Blog

  • Pingback: 2008 Consumer Internet Predictions « Lightspeed Venture Partners Blog

  • Pingback: 網絡集錦 « Alan Poon’s Blog

  • http://bentrem.sycks.net bentrem

    Something I’m waiting for is to find “structuration” in context of an article like this. (I can’t expect “tensgegrity” … something just too too precious about that one.)

    What I’ve been trying to say about conventional forum flow (hard to be critical without being read as cynical or put-downie or sour-grapes) is that it imposes a very primitive structure. Typically, a far too sweeping subject followed by responses of all sorts arrayed by nothing more than chronology. (Threaded systems like LiveJournal introduce a sophistication that can be very beneficial. “LiquidThreads” for MediaWiki goes some distance in improving that platform.)

    My own project is, well, I’m tempted to say “orthogonal” to these, and to those you’ve mentioned … something like comparing ToC and FootNotes.

    –bentrem

  • Pingback: The Software Abstractions Blog

  • Pingback: Innovablog > Le Web Sémantique : Où sont les outils de création de contenu riche ?

  • Pingback: Meaning = Data + Structure: User Generated Structure — Biography. writers and their biography

  • Pingback: Nodalities » Blog Archive » This Week’s Semantic Web