My post claiming that Meaning = Data + Structure and follow up post exploring how User Generated Structure is one way that structure can be added to data have generated some great comments from readers. Additionally, a knowledgeable friend, Peter Moore, sent me a long email critique that was so thoughtful that I asked his permission to turn it into a guest post – you’ll find it below.
To rephrase your topic at the risk of going afield, perhaps what one wants is computer assistance in answering questions, including also search queries or browsing operations, that require semantic understanding rather than just filtering by keywords and numerical operators.
For example, if you expect growing US concern about high-carb diets to produce material long-term declines in US consumption of things that make high-carb diets, you might wonder whether that might influence US imports of sugar, and thereby whether it might impact the exchange rate between the Brazilian real and US dollar. You might ask, “What percent of the value of Brazil’s exports to the US are things that go into making high-carb diets? What other products consumed in the US might those Brazilian things go into making instead of high-carb diets? What might drive changes in US consumption of those other things?”
I think that your four approaches are a good way to explore if and how users might add structure sufficient to help us obtain some computer assistance with questions like this, especially where the data being queried is generated by a community of users.
Approach 2: Solicit Structured Data From Users
Perhaps you should have addressed this first, because it seems to me that these efforts by websites to provide a template for their users to fill in is really just a structured database. When users fill in these templates, their entries might go directly into fields in a database. These fields have semantics, so these sites can answer questions with some semantics, but only using the site’s internal knowledge base and only for pre-defined types of questions. The problem is not just the tendency of structure requirements to discourage participation, but more fundamentally, the inability of the application to use information from other sources that didn’t require this same structure, and answering other kinds of questions for which this particular structure is not productive.
Approach 1: Tagging
The first approach will rarely satisfy this desire, however. Tagging might only work for the knowledge base addressed by a single person’s tags, and probably not even that, because many of us don’t tag things consistently even with ourselves. So as you describe in your blog, a question might trip on hypernyms or polynyms, and it certainly won’t understand that a bottle is a kind of container, and so is a can, much less that bottles are typically made of glass or plastic and cans are typically made of metal, so it will miss opportunities to use that kind of semantic understanding to answer questions.
Approach 3: Traditional Approach to the Semantic Web
The third approach is more promising. But unfortunately, it’s not as easy as suggested in your blog by Mike Veytsel, who said:
1) There are already many major hubs of fairly reliable structured data, openly available through APIs. Create a user tagging system that bypasses the ambiguity and redundancy of folksonomies by using only existing, structured data objects as endpoints for tag associations.
2) Use a centralized system to crossmap tags between structured abjects on various data hubs. The issues of systemic bias, false information, and vandalism can be resolved much more easily than in a system like Wikipedia, since one of the major benefits of a tagging system is that, unlike entire articles, tags are atomic in nature, so their accuracy can be measured by weighting. By weightings I mean that, just like in del.icio.us, the more users who’ve tagged a keyword to a site, the more visible and reliable it is.
The problem is that the structured data sources do each use different structures, and mapping between them is very hard. To consider why, let’s first cover some “semantics”:
A “flat list,” such as a list of tags, suffers from all of the issues that you outline in your blog. Many terms may be the same for certain purposes, but without additional information, a computer will not recognize that similarity.
One can provide some of the requisite additional information by organizing a flat list into a “taxonomy,” which organizes entities via “subsumption”: This “is a kind of” that. But even armed with this, a computer will fail to see additional connections between entities, such as this “is a part of” that, or this “causes” that, etc.
One can arm a computer to consider these relations too, transforming a taxonomy into an “ontology,” which uses these additional relations to describe more about a topic. RDF-S is a computer-readable format for describing not only subsumption but also other relations. But it turns out that even when one can describe things with numerous relations, when one tries to move from a small ontology focused on one “domain” to a larger ontology that subsumes more than one “domain,” it becomes very difficult to maintain consistency between items in one area and those in another, albeit related, area. For example, one may author a nice categorization of “products,” including medical treatments by both their mechanisms of action, such as “beta-blocker” or “implantable defibrillator”, and by the medical conditions that each is sold to treat, such as “congestive heart failure” or “cardiac arrest”, in various regulatory jurisdictions. But then one might also want to create a list of companies and organize it by products sold by those companies, such as “Cardiology company” or “Medical device company” or “Cardiology medical device company.” Even with RDF-S, one cannot re-use the ontology of products to organize the ontology of companies, so one will need to create a redundant list of companies, not only creating redundant work but also risking inconsistency, especially if one evolves these ontologies over time. Furthermore, one might also want to create a list of securities, organized by position in capital structure, such as common stock or corporate bond, and by products sold by the issuing companies. So there again one may have another need to re-use the company ontology, which should have already re-used the product ontology. This re-use and nested re-use is a common need.
One can address this need by using Description Logics to create expressions that define some entities in an ontology in terms of other elements in the ontology, such as “Common stock AND issuedBy (Company AND sells (Medical device AND treats Cardiology disease))”. Descriptions Logics are languages that come in many flavors that have been developed in academia for decades. They matter now more than ever, because in 2004, the W3C “recommended” the Web Ontology Language (OWL), which starts with RDF-S and adds Description Logics elements to enable construction of such expressions, all in a machine-readable format that is encoded in XML. But this is no good unless a computer can use the expressions that it reads to infer from them knowledge that was asserted for the elements from which the expressions were built.
That’s what one can do with Description Logics Inference Engines (or “DL Reasoners”), which can use these DL expressions to check consistency of an ontology and infer additional knowledge. Unfortunately, in many circumstances this inference can be computationally challenging, if not intractable, and the field stresses a trade-off between “expressivity” and “efficiency.” That’s why the Description Logics community has developed numerous flavors of languages. If one limits oneself to simple expressions in Description Logics, then one has a chance of answering questions that require some semantic understanding without waiting too long.
I think that long detour is necessary to understand the infrastructure that the (mostly academic and government-sponsored until recently) “semantic web community” has felt it necessary to build to enable curation of ontologies that describe wide-ranging topics, enabling some semantically-aware question assistance from computers outside the domain of a single, narrowly-focused site.
Having established some of this infrastructure, the “semantic web community” is indeed excited about Approach 3, as you suggest in your blog. Among this crowd, the mantra is that the 2004 standardization of OWL as the lingua franca for ontology authoring will enable bottom-up authoring, trading, and reuse of ontologies. This is appealingly democratic and webby, and Swoogle has become a popular place to search for ontologies created by others in the OWL format.
Hopefully, if each content creator can structure his content with ontologies that are reused by others, there may be greater potential for reasoning over the resulting “chimeric” ontology to produce answers that consider semantics. But big challenges remain to this Approach 3, because it remains very difficult to take multiple separate ontologies and merge them into one that is consistent and meaningful for question-answering. The DL Inference Engines can help check consistency of a merged ontology, but the field has recently focused on layering onto these Inference Engines additional tools to support mapping concepts from one ontology to another, identifying homonyms and the like. Two separate groups at Stanford have developed two separate tools for this, called Chimera and PROMPT.
I remain skeptical that this bottom-up aggregation of separate ontologies will produce reasoning robust enough to support the effort involved in merging ontologies. I think that it’s most likely to bear fruit in fairly narrow areas, where one can merge a couple or a few small ontologies to extend one’s reasoning power beyond a single data source to a few data sources. In a way, the merging work may be akin to the next approach, because it asserts a central meaning to broker between the merged ontologies.
Approach 4: Build a Central Authority of Meaning
That may be why some groups continue to try to produce a massive, universal ontology. The Metaweb approach sounds very interesting, partly because it is modest in its reasoning ambitions, so it can pursue a format that does not look like Description Logics and that should feel somewhat familiar to its prospective community of “Wikipedians.” In fact, like most current “semantic web” applications, it relies more on a structure like RDF-S rather than aiming for the broad inference supported by Description Logics. I’m frankly somewhat optimistic about that effort. And I’m curious about Radar Networks’ new service Twine, which started invitation-only beta this week.
But in this area, the Cyc project (www.cyc.com) is also very interesting. It was started by Stanford prof Doug Lenat in the 1980’s, and he moved to Texas on the invitation of the US military to fund a team big enough to build a comprehensive ontology in only . . . about 20 years (so far). In a speech at Google recently, Lenat said that the Cyc team has been “dragged kicking and screaming” into creating over 100 types of relations. That’s in addition to the vast numbers of entities that those relationships connect. They appear to have found difficulty in achieving meaningful inferences without being comprehensive, and difficulty in being comprehensive without this “relation creep.”
But the payoff may be big. Cyc claims to have some success in producing answers that do leverage very general semantics.
At Clados, we’re using a combination of approaches 3 and 4 to discover imbalances in supply and demand that will produce great investment opportunities. We can discover more imbalances with the aid of some automated organization and cross-referencing across disparate areas via numerous perspectives, and to achieve this, we need a reasonably-consistent ontology, but unlike the Cyc people, we don’t seem to need anything close to a complete one, so we are able to sacrifice completeness in favor of consistency by manually curating everything that goes in, even if some of it comes courtesy of others’ ontologies.
I suppose that this approach may work within a community of user-generated content. A team of curators (with centralized coordination, I’m afraid!) can structure some of the content, and users may be happy to just get some of the inferences that are latent in the unstructured content itself, even if the centrally-coordinated curators will never structure everything.
UPDATE: Related post on inferring structure from domain knowledge now up.