Posted:September 20, 2016

CognontoSix Large-scale Knowledge Bases Interact to Help Automate Machine Learning Setup

Fred Giasson and I today announced the unveiling of a new venture, Cognonto. We have been working on this venture very hard for at least the past two years. But, frankly, Cognonto represents bringing into focus ideas and latent opportunities that we have been seeing for much, much longer.

The fundamental vision for Cognonto is to organize the information in large-scale knowledge bases so as to efficiently support knowledge-based artificial intelligence (KBAI), a topic I have been writing about much over the past year. Once such a vision is articulated, the threads necessary to bring it to fruition come into view quickly. First, of course, the maximum amount of information possible in the source knowledge bases needs to be made digital and represented with semantic Web technologies such as RDF and OWL. Second, since no source alone is adequate, the contributing knowledge bases need to be connected and made to work with one another in a logical and consistent manner. And, third, an overall schema needs to be put in place that is coherent and geared specifically to knowledge representation and machine learning.

The result from achieving these aims is to greatly lower the time and cost to prepare inputs to, and improve the accuracy in, machine learning.  This result applies particularly to supervised machine learning for knowledge-related applications. But, if achieved, the resulting rich structure and extensive features also lend themselves to unsupervised and deep learning, as well as to provide a powerful substrate for schema mapping and data interoperability.

Today, we’ve now made sufficient progress on this vision to enable us to release Cognonto, and the KBpedia knowledge structure at its core. Combined with local data and schema, there is much we can do with the system. But another exciting part is that the sky is the limit in terms of honing the structure, growing it, and layering more AI applications upon it. Today, with Cognonto’s release, we begin that process.

Entry Point for the Cognonto Demo

Screen Shot of the Entry Point for the Cognonto Demo

You can begin to see the power and the structure yourself via Cognonto’s online demo, as shown above, which showcases a portion of the system’s functionality.

Problem and Opportunity

Artificial intelligence (AI) and machine learning are revolutionizing knowledge systems. Improved algorithms and faster graphics chips have been contributors. But the most important factor in knowledge-based AI’s renaissance, in our opinion, has been the availability of massive digital datasets for the training of machine learners.

Wikipedia and data from search engines are central to recent breakthroughs. Wikipedia is at the heart of Siri, Cortana, the former Freebase, DBpedia, Google’s Knowledge Graph and IBM’s Watson, to name just a prominent few AI question answering systems. Natural language understanding is showing impressive gains across a range of applications. To date, all of these examples have been the result of bespoke efforts. It is very expensive for standard enterprises to leverage these knowledge resources on their own.

Today’s practices pose significant upfront and testing effort. Much latent knowledge remains unexpressed and not easily available to learners; it must be exposed, cleaned and vetted. Further upfront effort needs to be spent on selecting the features (variables) used and then to accurately label the positive and negative training sets. Without “gold standards” — at still more cost — it is difficult to tune and refine the learners. The cost to develop tailored extractors, taggers, categorizers, and natural language processors is simply too high.

So recent breakthroughs demonstrate the promise; now it is time to systematize the process and lower the costs. The insight behind Cognonto is that existing knowledge bases can be staged to automate much of the tedium and reduce the costs now required to set up and train machine learners for knowledge purposes. Cognonto’s mission is to make knowledge-based artificial intelligence (KBAI) cheaper, repeatable, and applicable to enterprise needs.

Cognonto (a portmanteau of ‘cognition’ and ‘ontology’) exploits large-scale knowledge bases and semantic technologies for machine learning, data interoperability and mapping, and fact and entity extraction and tagging. Cognonto puts its insight into practice through a knowledge structure, KBpedia, designed to support AI, and a management framework, the Cognonto Platform, for integrating external data to gain the advantage of KBpedia’s structure. We automate away much of the tedium and reduce costs in many areas, but three of the most important are:

  • Pre-staging labels for entity and relation types, essential for supervised machine learning training sets and reference standards; KBpedia’s structure-rich design is further useful for unsupervised and deep learning;
  • Fine-grained entity and relation type taggers and extractors; and
  • Mapping to external schema to enable integration and interoperability of structured, semi-structured and unstructured data (that is, everything from text to databases).

The KBpedia Knowledge Structure

KBpedia is a computable knowledge structure resulting from the combined mapping of six, large-scale, public knowledge bases — Wikipedia, Wikidata, OpenCyc, GeoNames, DBpedia and UMBEL. The KBpedia structure separately captures entities, attributes, relations and topics. These are classed into a natural and rich diversity of types, with their meaning and relationships logically and coherently organized. This diagram, one example from the online demo, shows the topics captured for the main Cognonto page in relation to the major typologies within KBpedia:

Example Network Graph as One Part of the Demo Results

Example Network Graph as One Part of the Demo Results

Each of the six knowledge bases has been mapped and re-expressed into the KBpedia Knowledge Ontology. KKO follows the universal categories and logic of the 19th century American mathematician and philosopher, Charles Sanders Peirce, the subject of my last article. KKO is a computable knowledge graph that supports inference, reasoning, aggregations, restrictions, intersections, and other logical operations. KKO’s logic basis provides a powerful way to represent individual things, classes of things, and how those things may combine or emerge as new knowledge. You can inspect the upper portions of the KKO structure on the Cognonto Web site. Better still, if you have an ontology editor, you can download and inspect the open source KKO directly.

KBpedia contains nearly 40,000 reference concepts (RCs) and about 20 million entities. The combination of these and KBpedia’s structure results in nearly 7 billion logical connections across the system, as these KBpedia statistics (current as of today’s version 1.02 release) show:

Measure Value
No KBpedia reference concepts (RCs) 38,930
No. mapped vocabularies 27
Core knowledge bases 6
Extended vocabularies 21
No. mapped classes 138,868
Core knowledge bases 137,203
Extended vocabularies 1,665
No. typologies (SuperTypes) 63
Core entity types 33
Other core types 5
Extended 25
Typology assignments 545,377
No. aspects 80
Direct entity assignments 88,869,780
Inferred entity aspects 222,455,858
No. unique entities 19,643,718
Inferred no of entity mappings 2,772,703,619
Total no. of “triples” 3,689,849,726
Total no. of inferred and direct assertions 6,482,197,063
First Release KBpedia Statistics

About 85% of the RCs are themselves entity types — that is, 33,000 natural classes of similar entities such as ‘astronauts’ or ‘breakfast cereals’ — which are organized into about 30 “core” typologies that are mostly disjoint (non-overlapping) with one another. KBpedia has extended mappings to a further 20 other vocabularies, including schema.org, Dublin Core, and others; client vocabularies are typical additions. The typologies provide a flexible means for slicing-and-dicing the knowledge structure; the entity types provide the tie-in points to KBpedia’s millions of individual instances (and for your own records). KBpedia is expressed in the semantic Web languages of OWL and RDF. Thus, most W3C standards may be applied against the KBpedia structure, including for linked data, a standard option.

KBpedia is purposefully designed to enable meaningful splits across any of its structural dimensions — concepts, entities, relations, attributes, or events. Any of these splits — or other portions of KBpedia’s rich structure — may be the computable basis for training taggers, extractors or classifiers. Standard NLP and machine learning reference standards and statistics are applied during the parameter-tuning and learning phases. Multiple learners and recognizers may also be combined as different signals to an ensemble approach to overall scoring. Alternatively, KBpedia’s slicing-and-dicing capabilities may drive export routines to use local or third-party ML services under your own control.

Though usable in a standalone mode, only slices of KBpedia may be applicable to a given problem or domain, which then most often need to be extended with local data and schema. Cognonto has services to incorporate your own domain and business data, critical to fulfill domain purposes and to respond to your specific needs. We transform your external and domain data into KBpedia’s canonical forms for interacting with the overall structure. Such data may include other public databases, but also internal, customer, product, partner, industry, or research information. Data may range from unstructured text in documents to semi-structured tags or metadata to spreadsheets or fully structured databases. The formats of the data may span hundreds of document types to all flavors of spreadsheets and databases.

Platform and Technology

Cognonto’s modular technology is based on Web-oriented architectures. All functionality is exposed via Web services and programmatically in a microservice design. The technology for Cognonto resides in three inter-related areas:Cognonto

  • Cognonto Platform – the technology for storing, accessing, mapping, visualizing, querying, managing, analyzing, tagging, reporting and machine learning using KBpedia;
  • KBpedia Structure – the central knowledge structure of organized and mapped knowledge bases and their millions of instances; and
  • Build Infrastructure – repeatable and modifiable build and coherence and consistency testing scripts, including reference standards.

The Cognonto Web services may be manipulated directly from the command line or via cURL calls, or by simple HTML interfaces, by SPARQL, or programmatically. The Web services are written in Clojure and follow literate programming practices.

The base KBpedia knowledge graph may be explored interactively across billions of combinations with sample exports of its content. Here is the example for automobile.

Example Screen Shot for a Portion of the Knowledge Graph

Example Screen Shot for a Portion of the Knowledge Graph Results

There is a lot going on with many results panels and with links throughout the structure. There is a ‘How to’? for the knowledge graph if you really want to get your hands dirty.

These platform, technology, and knowledge structure capabilities combine to enable us to offer services across the full spectrum of KBAI applications, including:

Cognonto is a foundation for doing serious knowledge-based artificial intelligence.

Today and Tomorrow

Despite the years we have been working on this, it very much feels like we are at the beginning. There is so much more that can be done.

First, we need to continue to wring out errors and mis-assignments in the structure. We estimate an accuracy error rate of 1-2% currently, but that still represents millions of potential errors. The objective is not to be more accurate than alternatives, which we already are, but to be the most effective foundation possible for training machine learners. Further cleaning will result in still better standards and mappings. Throughout the interactive knowledge graph we have a button for submitting errors; please so submit if you see any problems!

Second, we are seeing the value of exposing structure, and the need to keep doing so. Each iteration of structure gets easier, because prior ones may be applied to automate much of the testing and vetting effort for the subsequent ones. Structure provides the raw feature (variable) grist used by machine learners. We have a very long punch list of where we can effectively add more structure to KBpedia.

And, last, we need to extend the mappings to more knowledge bases, more vocabularies, and more schema. This kind of integration is really what smooths the way to data integration and interoperability. Virtually every problem and circumstance requires including local and external information.

We know there are many important uses — and an upside of potential — for codifying knowledge bases for AI and machine learning purposes. Drop me a line if you’d like to discuss how we can help you leverage your own domain and business data using knowledge-based AI.

Posted:September 21, 2015

Steam engine in action, from WikipediaPractical and Reusable Designs to Make Knowledge Bases Computable

Wikipedia is a common denominator in question answering and commercial natural language applications that leverage artificial intelligence. Witness Siri, Watson, Cortana and Google Now, among others. DBpedia is a structured data representation of Wikipedia that makes much of this content machine readable. Wikidata is a multilingual repository of 18 million structured entities now feeding the Wikipedia ecosystem. The availability of these sources is remaking and accelerating the role of knowledge bases in powering the next generation of artificial intelligence applications. But much, much more is possible.

All of these noted knowledge bases lack a comprehensive and coherent knowledge structure. They are not computable, nor able to be reasoned over or inferenced. While they are valuable resources for structured data and content, the vast potential in these storehouses remains locked up. Yet the relevance of these sources to drive an artificial intelligence platform geared to data and content is profound.

And what makes this potential profound? Well, properly structured, knowledge bases can provide the features and generation of positive and negative training sets useful to machine learning. Coherent organization of the knowledge graph within the KB’s domain enables various forms of reasoning and inference, further useful to making fine-grained recognizers, extractors and classifiers applicable to external knowledge. As I have pointed out before with regard to knowledge-based artificial intelligence (or KBAI) [1], these capabilities can work to extract still more accurate structure and knowledge from the knowledge base, creating a virtuous circle of still further improvements.

In all fairness, the Wikipedia ecosystem was not designed to be a computable one. But the free and open access to content in the Wikipedia ecosystem has sparked an explosion of academic and commercial interest in using this knowledge, often in DBpedia machine-readable form. Yet, despite this interest and more than 500 research papers in areas leveraging Wikipedia for AI and NLP purposes [2], the efforts remain piecemeal and unconnected. Yes, there is valuable structure and content within the knowledge bases; yes, they are being exploited both for high-value bespoke applications and for small research projects; but, across the board, these sources are not being used or leveraged in anything approaching a systematic nature. Each distinct project requires anew its own preparations and staging.

And it is not only Wikipedia that is neglected as a general resource for AI and semantic technology applications. One is hard-pressed to identify any large-scale knowledge base, available in electronic form, that is being sufficiently and systematically exploited for AI or semantic technology purposes [3]. This gap is really rather perplexing. Why the huge disconnect between potential and reality? Could this gap somehow be related to also why the semantic technology community continues to bemoan the lack of “killer apps” in the space? Is there something possibly even more fundamental going on here?

I think there is.

We have spent eight years so far on the development and refinement of UMBEL [4]. It was intended initially to be a bridge between unruly Web content and reasoning capabilities in Cyc to enable information interoperability on the Web [5]; an objective it still retains. Naturally, Wikipedia was the first target for mapping to UMBEL [6]. Through our stumbling and bumbling and just serendipity, we have learned much about the specifics of Wikipedia [6], aspects of knowledge bases in general, and the interface of these resources to semantic technologies and artificial intelligence. The potential marriage between Cyc, UMBEL and Wikipedia has emerged as a first-class objective in its own right.

What we have learned is that it is not any single thing, but multiple things, that is preventing knowledge bases from living up to their potential as resources for artificial intelligence. As I trace some of the sources of our learning below, note that it is a combination of conceptual issues, architectural issues, and terminological issues that need to be overcome in order to see our way to a simpler and more responsive approach.

The Learning Process Began with UMBEL’s SuperTypes

Shortly after the initial release of UMBEL we undertook an effort in 2009 to split it into a number (initially 33) of mostly disjoint “super types” [7]. This logical segmentation was done for practical reasons of efficiency and modularity. It forced us to consider what is a “concept” and what is an “entity”, among other logical splits. It caused us to inspect the entire UMBEL knowledge space, and to organize and arrange and inspect the various parts and roles of the space.

We began to distinguish “attributes” and “relations” as separate from “concepts” and “entities”. Within the clustering of “entities” we could also see that some things were distinct individuals or entity instances, while other terms represented “types” or classes of like entities. At that time, “named entity” was a more important term of art than is practiced today. In looking at this idea we noted [7]:

The intuition surrounding “named entity” and nameable “things” was that they were discrete and disjoint. A rock is not a person and is not a chemical or an event. … some classes of things could also be treated as more-or-less distinct nameable “things”: beetles are not the same as frogs and are not the same as rocks. While some of these “things” might be a true individual with a discrete name, such as Kermit the Frog, or The Rock at Northwestern University, most instances of such things are unnamed. . . . The “nameability” (or logical categorization) of things is perhaps best kept separate from other epistemological issues of distinguishing sets, collections, or classes from individuals, members or instances.

Because we were mapping Cyc and Wikipedia using UMBEL as the intermediary, we noticed that some things were characterized as a class in one system, while being an instance in the other [8]. In essence, we were learning the vocabulary of knowledge bases, and beginning to see that terminology was by no means consistent across systems or viewpoints.

This reminds me of my experience as an undergraduate, learning plant taxonomy. We had to learn literally hundreds of strange terms such as glabrous or hirsute or pinnate, all terms of art for how to describe leaves, their shapes, their hairiness, fruits and flowers and such. What happens, though, when one learns the terminology of a domain is that one’s eyes are opened to see and distinguish more. What had previously been for me a field of view merely of various shades of green and shrubs and trees, emerged as distinct species of plants and individual variants that I could clearly discern and identify. As I learned nuanced distinctions I begin to be able to see with greater clarity. In a similar way, the naming and distinguishing of things in our UMBEL SuperTypes was opening up our eyes to finer and more nuanced distinctions in the knowledge base. All of this striving was in order to be able to map the millions and millions of things within Wikipedia to a different, coherent structure provided by Cyc and UMBEL.

ABox – TBox and Architectural Basics

One of the clearest distinctions that emerged was the split between the TBox and the ABox in the knowledge base, the difference between schema and instances. Through the years I have written many articles on this subject [9]. It is fundamental to understand the differences in representation and work between these two key portions of a knowledge base.

Instances are the specific individual things in the KB that are relevant to the domain. Instances can be many or few, as in the millions within Wikipedia, accounting for more than 90% of its total articles. Instances are characterized by various types of structured data, provided as key attribute-value pairs, and which may be explained through long or short text descriptions, may have multiple aliases and synonyms, may be related to other instances via type or kind or other relations, and may be described in multiple languages. This is the predominant form of content within most knowledge bases, perhaps best exemplified by Wikipedia.

The TBox, on the other hand, needs to be a coherent structural description of the domain, which expresses itself as a knowledge graph with meaningful and consistent connections across its concepts. Somewhat irrespective of the number of instances (the ABox) in the knowledge base, the TBox is relatively constant in size given a desired level of descriptive scope for the domain. (In other words, the logical model of the domain is mostly independent from the number of instances in the domain.)

For a reference structure such as UMBEL, then, the size of its ontology (TBox) can be much smaller and defined with focus, while still being able to refer to and incorporate millions of instances, as is the case for Wikipedia (or virtually any large knowledge base). Two critical aspects for the TBox thus emerge. First, it must be a coherent and reasonable “brain” for capturing the desired dynamics and relationships of the domain. And, second, it must provide a robust, flexible, and expandable means for incorporating instance records. This latter “bridging” purpose is the topic of the next sub-section.

The TBox-ABox segregation, and how it should work logically and pragmatically, requires much study and focus. It is easy to read the words and even sometimes to write them, but it has taken us many years of practice and much thought and experimentation to derive workable information architectures for realizing and exploiting this segregation.

I have previously spelled out seven benefits from the TBox-ABox split [10], but there is another one that arises from working through the practical aspects of this segregation. Namely, an effective ABox-TBox split compels an understanding of the roles and architecture of the TBox. It is the realization of this benefit that is the central basis for the insights provided in this article.

We’ll be spelling out more of these specifics in the sections below. These understandings help us define the composition and architecture of the TBox. In the case of the current development version of UMBEL [11], here are the broad portions of the TBox:



Distribution of Types in the UMBEL TBox

Distribution of Types in the UMBEL TBox

Structures (types) for organizing the entities in the domain constitute nearly 90% of the work of the TBox. This reflects the extreme importance of entity types to the “bridging” function of the TBox to the ABox.

Probing the Concept of ‘Entities’

Most of the instances in the ABox are entities, but what is an “entity”? Unfortunately, that is not universally agreed. In our parlance, an “entity” and related terms are:

  • The basic, real things in our domain of interest: entities
  • The way we characterize and describe those individual things: attributes
  • The way we describe connections between two or more of those things: relations, and
  • Aggregations or collections or classes of similar entities, which also share some essence: entity types.

We no longer use the term named entities, though nouns with proper names are almost always entities. By definition, entities can not be topics or types and entities are not data types. Some of the earlier typologies by others, such as Sekine [12], also mix the ideas of attributes and entities; we do not. Lastly, by definition, entity types have the same attribute “slots” as all type members, even if no data is assigned in many or most cases. The glossary presents a complete compilation of terms and acronyms used.

The role for the label “entity” can also refer to what is known as the root node in some systems such as SUMO [13]. In the OWL language and RDF data model we use, the root node is known as “thing”. Clearly, our use of the term “entity” is much different than SUMO and resides at a subsidiary place in the overall TBox hierarchy. In this case, and frankly for most semantic matches, equivalences should be judged with care, with context the crucial deciding factor.

Nonetheless, most practitioners do not use “entity” in a root manner. Some of the first uses were in the Message Understanding Conferences, especially MUC-6 and MUC-7 in 1995 and 1997, where competitions for finding “named entities” were begun, as well as the practice of in-line tagging [14]. However, even the original MUC conferences conflated proper names and quantities of interest under “named entities.” For example, MUC-6 defined person, organization, and location names, all of which are indeed entity types, but also included dates, times, percentages, and monetary amounts, which we define as attribute types.

It did not take long for various groups and researchers to want more entity types, more distinctions. BBN categories, proposed in 2002, were used for question answering and consisted of 29 types and 64 subtypes [15]. Sekine put forward and refined over many years his Extended Entity Types, which grew to about 200 types [12]. But some of these accepted ‘named entities‘ are also written in lower case, with examples such as rocks (‘gneiss’) or common animals or plants (‘daisy’) or chemicals (‘ozone’) or minerals (‘mica’) or drugs (‘aspirin’) or foods (‘sushi’) or whatever. Some deference was given to the idea of Kripke’s “rigid designators” as providing guidance for how to identify entities; rigid designators include proper names as well as certain natural kind of terms like biological species and substances. Because of these blurrings, the nomenclature of “named entities” began to fade away.

But it did not take but a few years where the demand was for “fine-grained entity” recognition, and scope and numbers of types continued to creep up. Here are some additional data points to what has already been mentioned:

  • DBpedia Ontology: 738 types [16]
  • schema.org: 636 types [17]
  • YAGO: 505 types; see also HYENA [18]
  • Lee et al.: 147 types [19]
  • FIGER: 112 types [20]
  • Gillick: 86 types [21]
  • OpenCalais: 42 types [22]
  • GeoNames: 654 “feature codes” [23]
  • Nadeau: ~100 types [24].

Lastly, the new version of UMBEL has 25,000 entity types, in keeping with this growth trend and for the “bridging” reasons discussed below.

We can plot this out over time on log scale to see that the proposed entity types have been growing exponentially:



Growth in Recognition of Entity Types

Growth in Recognition of Entity Types

This growth in entity types comes from wanting to describe and organize things with more precision. No longer do we want to talk broadly about people, but we want to talk about astronauts or explorers. We don’t just simply want to talk about products, but categories of manufactured goods like cameras or sub-types like SLR cameras or further sub-types like digital SLR cameras or even specific models like the Canon EOS 7D Mark II (skipping over even more intermediate sub-types). With sufficient instances, it is possible to train recognizers for these different entity types.

What is appropriate for a given domain, enterprise or particular task may vary the depth and scope of what entity types should be considered, which we can also refer to as context. For example, the toucan has sometimes been used as a example of how to refer to or name a thing on the Web [25]. When we inspect what might be a definitive description of “toucan” on Wikipedia, we see that the term more broadly represents the family of Ramphastidae, which contains five genera and forty different species. The picture we display is but of one of those forty species, that of the keel-billed toucan (Ramphastos sulfuratus). Viewing the images of the full list of toucan species shows just how physically divergent these various “toucans” are from one another. Across all species, average sizes vary by more than a factor of three with great variation in bill sizes, coloration and range. Further, if I assert that the picture of the toucan is actually that of my pet keel-billed toucan, Pretty Bird, then we can also understand that this representation is for a specific individual bird, and not the physical keel-billed toucan species as a whole.

The point is not a lesson on toucans, but an affirmation that distinctions between what we think we may be describing occurs over multiple levels. Just as there is no self-evident criteria as to what constitutes an “entity type”, there is also not a self-evident and fully defining set of criteria as to what the physical “toucan” bird should represent. The meaning of what we call a “toucan” bird is not embodied in its label or even its name, but in the accompanying referential information that place the given referent into a context that can be communicated and understood. A URI points to (“refers to”) something that causes us to conjure up an understanding of that thing, be it a general description of a toucan, a picture of a toucan, an understanding of a species of toucan, or a specific toucan bird. Our understanding or interpretation results from the context and surrounding information accompanying the reference.

Both in terms of historical usage and trying to provide a framework for how to relate these various understandings of entities and types, we can thus see this kind of relationship:



Evolving Sophistication of Entity Types

Evolving Sophistication of Entity Types

What we see is an entities typology that provides the “bridging” interface between specific entity records and the UMBEL reasoning layer. This entities typology is built around UMBEL’s existing SuperTypes. The typology is the result of the classification of things according to their shared attributes and essences. The idea is that the world is divided into real, discontinuous and immutable ‘kinds’. Expressed another way, in statistics, typology is a composite measure that involves the classification of observations in terms of their attributes on multiple variables. In the context of a global KB such as Wikipedia, about 25,000 entity types are sufficient to provide a home for the millions of individual articles in the system.

Each SuperType related to entities has its own typology, and is generally expressed as a taxonomy of 3-4 levels, though there are some cases where the depth is much greater (9-10 levels) [26]. There are two flexible aspects to this design. First, because the types are “natural” and nested [27], coarse entity schema can readily find a correspondence. Second, if external records have need for more specificity and depth, that can be accommodated through a mapped bridging hierarchy as well. In other words, the typology can expand and contract like a squeezebox to map a range of specificities.

The internal process to create these typologies also has the beneficial effect of testing placements in the knowledge graph and identifying gaps in the structure as informed by fragments and orphans. The typologies should optimally be fully connected in order to completely fulfill their bridging function.

Extending the Mindset to Attributes and Relations

As with our defined terminology above [28], we can apply this same mindset to the characterizations (attributes) of entities and the relations between entities and TBox concepts or topics. Numerically, these two categories are much less prevalent than entity types. But, the construction and use of the typologies are roughly the same.

Since we are using RDF and OWL as our standard model and languages, one might ask why we are not relying on the distinction of object and datatype properties for these splits. Relations, it is true, by definition need to be object properties, since both subject and object need to be identified things. But attributes, in some cases such as rating systems or specific designators, may also refer to controlled vocabularies, which can (and, under best practice, should) be represented as object properties. So, while most attributes are built around datatype properties, not all are. Relations and attributes are a better cleaving, since we can use relations as patterns for fact extractions and the organization of attributes give us a cross-cutting means to understand the characteristics of things independent of entity type. These all become valuable potential features for machine learning, in addition to the standard text structure.

Though, today, UMBEL is considerably more sophisticated in its entities typologies, we already have a start on an attributes typology by virtue of the prior work on the Attributes Ontology [29], which will be revised to conform to this newer typology model. We also have a start on a relations typology based on existing OWL and RDF predicates used in UMBEL, plus many candidates from the Activities SuperType. As with the entities typology, relation types and attribute types may also have hierarchy, enabling reasoning and the squeezebox design. As with the entities typology, the objective is to have a fully connected hierarchy, of perhaps no more than 3-4 levels depth, with no fragments and no orphans.

A Different Role for Annotations

Annotations about how we label things and how we assign metadata to things resides at a different layer than what has been discussed to this point. Annotations can not be reasoned over, but they can and do play pivotal roles. Annotations are an important means for tagging, matching and slicing-and-dicing the information space. Metadata can perform those roles, but also may be used to analyze provenance and reasoning, if the annotations are mapped to object or datatype properties.

Labels are the means to broaden the correspondence of real-world reference to match the true referents or objects in the knowledge base. This enables the referents to remain abstract; that is, not tied to any given label or string. In best practice we recommend annotations reflect all of the various ways a given object may be identified (synonyms, acronyms, slang, jargon, all by language type). These considerations improve the means for tagging, matching, and slicing-and-dicing, even if the annotations are not able to be reasoned over.

As a mental note for the simple design that follows, imagine a transparent overlay, not shown, upon which best-practice annotations reside.

A Simple Design Brings it All Together

The insights provided here have taken much time to discover; they have arisen from a practical drive to make knowledge bases computable and useful to artificial intelligence applications. Here is how we now see the basics of a knowledge base, properly configured to be computable, and open to integration with external records:

Boiling KBs Down to Basics

Boiling KBs Down to Basics

At the broadest perspective, we can organize our knowledge-base platform into a “brain” or organizer/reasoner, and the instances or specific things or entities within our domain of interest. We can decompose a KB to become computable by providing various type groupings for our desired instance mappings, and an upper reasoning layer. An interface layer of “types”, organized into three broad groupings, provides the interface, or “bridging” layer, between the TBox and ABox. We thus have an architectural design segregating:

  • Topics and upper level — the general organization and “brains” of the domain
  • Entity types — categorizations of the actual things in the space
  • Relation types — the ways that different things are related to, or act upon, one another
  • Attribute types — a structured organization of the ways that individual entities can be described
  • Instances — the individual entities of the domain, and
  • Properties — the source grist for annotations, relation types and attribute types.

Becoming familiar with this terminology helps to show how the conventional understanding of these terms and structure have led to overlooking key insights and focusing (sometimes) on the wrong matters. That is in part why so much of the simple logic of this design has escaped the attention of most practitioners. For us, personally, at Structured Dynamics, it had eluded us for years, and we were actively seeking it.

Irrespective of terminology, the recognition of the role of types and their bridging function to actual instance data (records) is central to the computability of the structure. It also enables integration with any form of data record or record stores. The ability to understand relation types leads to improved relation extraction, a key means to mine entities and connections from general content and to extend assertions in the knowledge base. Entity types provide a flexible means for any entity to connect with the computable structure. And, the attribute types provide an orthogonal and inferential means to slice the information space by how objects get characterized.

Because of this architecture, the reference sources guiding its construction, its typologies, its ability to generate features and training sets, and its computability, we believe this overall design is suited to provide an array of AI and enterprise services:

Machine Intelligence Apps and Services
  • Entity recognizers
  • Relation extractors
  • Event extractors
  • Phrase identification
  • Classifiers
  • Q & A systems
  • Cognitive computing
  • Semantic publishing
  • Knowledge base mappings
  • Sub-graph extraction
  • Ontology development
  • Ontology mappers
  • Entity dictionaries
  • Entity linkers
  • Data conversion and mapping
  • Master data management
  • KB improvements
  • Attribute “slot filling”
  • Disambiguators
  • Duplicates removal
  • Inference and reasoning
  • Sentiment analysis
  • Semantic relatedness
  • Recommendation systems
  • Bespoke analysis
  • Bespoke platforms

By cutting through the clutter — conceptually and terminologically — it has been possible to derive a practical and repeatable design to make KBs computable. Being able to generate features and positive and negative training sets, almost at will, is proving to be an effective approach to machine learning at mass-produced prices.


[1] See M. K. Bergman, 2014. “Knowledge-based Artificial Intelligence,” from AI3:::Adaptive Information blog, November 17, 2014. For additional writings in the series, see https://www.mkbergman.com/category/kbai/.
[2] See Fabian M. Suchanek and Gerhard Weikum, 2014. “Knowledge Bases in the Age of Big Data Analytics,Proceedings of the VLDB Endowment 7, no. 13 (2014) and M.K. Bergman, “SWEETpedia,” listing of Wikipedia research articles, on AI3:::Adaptive Information blog, January 25, 2010; the listing as of its last update included 246 articles. Also, see Wikipedia’s own “Wikipedia in Academic Studies.”
[3] A possible exception to this observation is the biomedical community through its Open Biological and Biomedical Ontologies (OBO) initiative.
[4] See M.K. Bergman, 2007. “Announcing UMBEL: A Lightweight Subject Structure for the Web,” AI3:::Adaptive Information blog, July 12, 2007. Also see http://umbel.org.
[5] See M.K. Bergman, 2008. “Basing UMBEL’s Backbone on OpenCyc,” AI3:::Adaptive Information blog, April 2, 2008.
[6] See M.K. Bergman, 2015. “Shaping Wikipedia into a Computable Knowledge Base,” AI3:::Adaptive Information blog, March 31, 2015.
[7] M.K. Bergman, 2009. ‘SuperTypes’ and Logical Segmentation of Instances, AI3:::Adaptive Information blog, September 2, 2009.
[8] This possible use of an item as both a class and an instance through “punning” is a desirable feature of OWL 2, which is the language basis for UMBEL. You can learn more on this subject in M.K. Bergman, 2010. “Metamodeling in Domain Ontologies,” AI3:::Adaptive Information blog, September 20, 2010.
[9] For a listing of these, see the Google query https://www.google.com/search?q=tbox+abox+site%3Amkbergman.com. One of the 40 articles with the most relevant commentary to this article is M.K. Bergman, 2014. “Big Structure and Data Interoperability,” AI3:::Adaptive Information blog, August 14, 2014.
[10] M.K. Bergman, 2009. ” Making Linked Data Reasonable using Description Logics, Part 1,” AI3:::Adaptive Information blog, February 11, 2009.
[11] The current development version of UMBEL is v 1.30. It is due for release before the end of 2015.
[12] See the Sekine Extended Entity Types; the listing also includes attributes info at bottom of source page.
[14] N. Chinchor, 1997. “Overview of MUC-7,” MUC-7 Proceedings, 1997.
[15] Ada Brunstein, 2002. “Annotation Guidelines for Answer Types”. LDC Catalog, Linguistic Data Consortium. Aug 3, 2002.
[16] Christian Bizer, Jens Lehmann, Georgi Kobilarov, Sören Auer, Christian Becker, Richard Cyganiak, and Sebastian Hellmann, 2009. “DBpedia-A Crystallization Point for the Web of Data.” Web Semantics: science, services and agents on the world wide web 7, no. 3 (2009): 154-165; 170 classes in this paper. That has grown to more than 700; see http://mappings.dbpedia.org/server/ontology/classes/ and http://wiki.dbpedia.org/services-resources/datasets/dataset-2015-04/dataset-2015-04-statistics.
[17] The listing is under some dynamic growth. This is the official count as of September 8, 2015, from http://schema.org/docs/full.html. Current updates are available from Github.
[18] Joanna Biega, Erdal Kuzey, and Fabian M. Suchanek, 2013. “Inside YAGO2: A Transparent Information Extraction Architecture,” in Proceedings of the 22nd international conference on World Wide Web, pp. 325-328. International World Wide Web Conferences Steering Committee, 2013. Also see Mohamed Amir Yosef, Sandro Bauer, Johannes Hoffart, Marc Spaniol, Gerhard Weikum, 2012. “HYENA: Hierarchical Type Classification for Entity Names,” in Proceedings of the 24th International Conference on Computational Linguistics, Coling 2012, Mumbai, India, 2012.
[19] Changki Lee, Yi-Gyu Hwang, Hyo-Jung Oh, Soojong Lim, Jeong Heo, Chung-Hee Lee, Hyeon-Jin Kim, Ji-Hyun Wang, and Myung-Gil Jang, 2006. “Fine-grained Named Entity Recognition using Conditional Random Fields for Question Answering,” in Information Retrieval Technology, pp. 581-587. Springer Berlin Heidelberg, 2006.
[20] Xiao Ling and Daniel S. Weld, 2012. “Fine-Grained Entity Recognition,” in AAAI. 2012.
[21] Dan Gillick, Nevena Lazic, Kuzman Ganchev, Jesse Kirchner, and David Huynh, 2104. “Context-Dependent Fine-Grained Entity Type Tagging.” arXiv preprint arXiv:1412.1820 (2014).
[24] David Nadeau, 2007. “Semi-supervised Named Entity Recognition: Learning to Recognize 100 Entity Types with Little Supervision.” PhD Thesis, School of Information Technology and Engineering, University of Ottawa, 2007.
[25] M.K. Bergman, 2012. ” Give Me a Sign: What Do Things Mean on the Semantic Web?,” AI3:::Adaptive Information blog, January 24, 2012.
[26] A good example of description and use of typologies is in the archaelogy description on Wikipedia.
[27] M.K. Bergman, 2015. “‘Natural’ Classes in the Knowledge Web“, AI3:::Adaptive Information blog, July 13, 2015.
[28] Also see my Glossary for definitions of specific terminology used in this article.
[29] M.K. Bergman, 2015. “Conceptual and Practical Distinctions in the Attributes Ontology“, AI3:::Adaptive Information blog, March 3, 2015.
Posted:November 17, 2014

Manifold Learning Methods, from http://www.skybluetrades.net/blog/posts/2011/10/30/machine-learning/ What Goes Around, Comes Around, Only Now with Real Knowledge

A recent interview with a noted researcher, IEEE Fellow Michael I. Jordan, Pehong Chen Distinguished Professor at the University of California, Berkeley, provided a downplayed view of recent AI hype. Jordan was particularly critical of AI metaphors to real brain function and took the air out of the balloon about algorithm advances, pointing out that most current methods have roots that are decades long [1]. In fact, the roots of knowledge-based artificial intelligence (KBAI), the subject of this article, also extend back decades.

Yet the real point, only briefly touched upon by Jordan in his lauding of Amazon’s recommendation service, is that the dynamo in recent AI progress has come from advances in the knowledge and statistical bases driving these algorithms. The improved digital knowledge bases behind KBAI have been the power behind these advances.

Knowledge bases are finally being effectively combined with AI, a dynamic synergy that is only now being recognized, let alone leveraged. As this realization increases, many forms of useful information structure in the wild will begin to be mapped to these knowledge bases, which will further extend the benefits we are are now seeing from KBAI.

Knowledge-based artificial intelligence, or KBAI, is the use of large statistical or knowledge bases to inform feature selection for machine-based learning algorithms used in AI. The use of knowledge bases to train the features of AI algorithms improves the accuracy, recall and precision of these methods. This improvement leads to perceptibly better results to information queries, including pattern recognition. Further, in a virtuous circle, KBAI techniques can also be applied to identify additional possible facts within the knowledge bases themselves, improving them further still for KBAI purposes.

It is thus, in my view, the combination of KB + AI that has led to the notable AI breakthroughs of the past, say, decade. It is in this combination that we gain the seeds for sowing AI benefits in other areas, from tagging and disambiguation to the complete integration of text with conventional data systems. And, oh, by the way, the structure of all of these systems can be made inherently multi-lingual, meaning that context and interpretation across languages can be brought to our understanding of concepts.

Structured Dynamics is working to democratize a vision of KBAI that brings its benefits to any enterprise, using the same approaches that the behemoths of the industry have used to innovate knowledge-based artificial intelligence in the first place. How and where the benefits of such KBAI may apply is the subject of this article.

A Brief History of Knowledge-based Systems

Knowledge-based artificial intelligence is not a new idea. Its roots extend back perhaps to one of the first AI applications, Dendral. In 1965, nearly a half century ago, Edward Feigenbaum initiated Dendral, which became a ten-year effort to develop software to deduce the molecular structure of organic compounds using scientific instrument data. Dendral was the first expert system and used mass spectra or other experimental data together with a knowledge base of chemistry to produce a set of possible chemical structures. This set the outline for what came to be known as knowledge-based systems, which are one or more computer programs that reason and use knowledge bases to solve complex problems.

Indeed, it was in the area of expert systems that AI first came to the attention of most enterprises. According to Wikipedia,

Expert systems were designed to solve complex problems by reasoning about knowledge, represented primarily as if–then rules rather than through conventional procedural code. The first expert systems were created in the 1970s and then proliferated in the 1980s. Expert systems were among the first truly successful forms of AI software.

Expert systems spawned the idea of knowledge engineers, whose role was to interview and codify the logic of the chosen experts. But, expert systems proved to be expensive to build and difficult to maintain and tune. As the influence of expert systems waned, another branch emerged, that of knowledge-based engineering and their support for CAD– and CASE-type systems. Still, overall penetration to date of most knowledge-based systems can most charitably be described as disappointing.

The specific identification of “KBAI” was (to my knowledge) first made in a Carnegie-Mellon University report to DARPA in 1975 [2]. The source knowledge bases were broadly construed, including listings of hypotheses. The first known patent citing knowledge-based artificial intelligence is from 1992 [3]. Within the next ten years there were dedicated graduate-level course offerings on KBAI at many universities, including at least Indiana University, SUNY Buffalo, and Georgia Tech.

In 2007, Bossé et al. devoted a chapter to KBAI in their book on information fusion, but still, at that time, the references were more generic [4]. However, by 2013, the situation was changing fast, as this quote from Hovy et al. indicates [5]:

“Recently, however, this stalemate [the so-called ‘knowledge acquisition bottleneck] has begun to loosen up. The availability of large amounts of wide-coverage semantic knowledge, and the ability to extract it using powerful statistical methods, are enabling significant advances in applications requiring deep understanding capabilities, such as information retrieval and question-answering engines. Thus, although the well-known problems of high cost and scalability discouraged the development of knowledge-based approaches in the past, more recently the increasing availability of online collaborative knowledge resources has made it possible to tackle the knowledge acquisition bottleneck by means of massive collaboration within large online communities. This, in turn, has made these online resources one of the main driving forces behind the renaissance of knowledge-rich approaches in AI and NLP – namely, approaches that exploit large amounts of machine-readable knowledge to perform tasks requiring human intelligence.” (citations removed)

The waxing and waning of knowledge-based systems and its evolution over fifty years have led to a pretty well-defined space, even if not all component areas have achieved their commercial potential. Besides areas already mentioned, knowledge-based systems also include:

  • Knowledge models — formalisms for knowledge representation and reasoning, and
  • Reasoning systems — software that generates conclusions from available knowledge using logical techniques such as deduction and induction.

We can organize these subdomains as follows. Note particularly that the branch of KBAI (knowledge-based artificial intelligence) has two main denizens: recognized knowledge bases, such as Wikipedia, and statistical corpora. The former are familiar and evident around us; the latter are largely proprietary and not (generally) publicly accessible:

Some prominent knowledge bases and statistical corpora are identified below. Knowledge bases are coherently organized information with instance data for the concepts and relationships covered by the domain at hand, all accessible in some manner electronically. Knowledge bases can extend from the nearly global, such as Wikipedia, to very specific topic-oriented ones, such as restaurant reviews or animal guides. Some electronic knowledge bases are designed explicitly to support digital consumption, in which case they are fairly structured with defined schema and standard data formats and, increasingly, APIs. Others may be electronically accessible and highly relevant, but the data is not staged in a easily-consumable way, thereby requiring extraction and processing prior to use.

The use and role of statistical corpora is harder to discern. Statistical corpora are organized statistical relationships or rankings that facilitate the processing of (mostly) textual information. Uses can range from entity extraction to machine language translation. Extremely large sources, such as search engine indexes or massive crawls of the Web, are most often the sources for these knowledge sets. But, most are applied internally by those Web properties that control this big data.

The Web is the reason these sources — both statistical corpora and knowledge bases — have proliferated, so the major means of consuming them is via Web services with the information defined and linked to URIs.

My major thesis has been that it is the availability of electronically accessible knowledge bases, exemplified and stimulated by Wikipedia [6], that has been the telling factor in recent artificial intelligence advances. For example, there are at least 500 different papers that cite using Wikipedia for various natural language processing, artificial intelligence, or knowledge base purposes [7]. These papers began to stream into conferences about 2005 to 2006, and have not abated since. In turn, the various techniques innovated for extracting more and more structure and information from Wikipedia are being applied to other semi-structured knowledge bases, resulting in a true renaissance of knowledge-based processing for AI purposes. These knowledge bases are emerging as the information substrate under many recent computational advances.

Knowledge Bases in Relation to Overall Artificial Intelligence

A few months ago I pulled together a bit of an interaction diagram to show the relationships between major branches of artificial intelligence and structures arising from big data, knowledge bases, and other organizational schema for information:

What we are seeing is a system emerging whereby multiple portions of this diagram interact to produce innovations. Take, for example, Apple‘s Siri [8], or Google’s Google Now or the many similar systems that have emerged on smartphones. Spoken instructions are decoded to text, which is then parsed and evaluated for intent and meaning and then posed to a general knowledge base. The text results are then modulated back to speech with the answer in the smartphone’s speakers. The pattern recognition at the front and back end of this workflow has been made better though statistical datasets derived from phonemes and text. The text understanding is processed via natural language processing and semantic technologies [16], with the question understanding and answer formulation coming from one or more knowledge bases.

This remarkable chain of processing is now almost taken for granted, though its commercial use is less than five years old. For different purposes with different workflows we see effective question answering and diagnosis with systems like IBM’s Watson [9] and structured search results from Google’s Knowledge Graph [10]. Try posing some questions to Wolfram Alpha and then stand back and be impressed with the data visualization. Behind the scenes, pattern recognition from faces to general images or thumbprints further is eroding the distinction between man and machine. Google Translate now covers language translation between 60 human languages [11] — and pretty damn effectively, too. All major Web players are active in these areas, from Amazon’s recommendation system [12] to Facebook [42], Microsoft [13], Twitter [14] or Baidu [15].

Though not universal, most all recent AI advances leveraging knowledge bases have utilized Wikipedia in one way or another. Even Freebase, the core of Google’s Knowledge Graph, did not really blossom as a separate data crowdsourcing concern until its former owner, Metaweb, decided to bring Wikipedia into its system. Many other knowledge bases, as noted below, are also derivatives or enhancements to Wikipedia in one way or another.

I believe the reasons for Wikipedia’s influence have arisen from its nearly global scope, its mix of semi-structured data and text, its nearly 200 language versions, and its completely open and accessible nature. Regardless, it is also certainly true that techniques honed with Wikipedia are now being applied to a diversity of knowledge bases. We are also seeing an appreciation start to grow in how knowledge bases can enhance the overall AI effort.

Useful Statistical and Knowledge Sources

The diagram on knowledge-based systems above shows two kinds of databases contributing to KBAI: statistical corpora or databases and true knowledge bases. The statistical corpora tend to be hidden behind proprietary curtains, and also more limited in role and usefulness than general knowledge bases.

Statistical Corpora

The statistical corpora or databases tend to be of a very specific nature. While lists of text corpora and many other things may contribute to this category, the ones actually in commercial use tend to be quite focused in scope, very large, and designed for bespoke functionality. A good example, and one that has been contributed for public use, is the Web 1T 5-gram data set [17]. This data set, contributed by Google for public use in 2006, contains English word n-grams and their observed frequency counts. N-grams capture word tokens that often coincide with one another, from single words to phrases. The length of the n-grams ranges from unigrams (single words) to five-grams. The database was generated from approximately 1 trillion word tokens of text from publicly accessible Web pages.

Another example of statistical corpora are what is used in Google’s Translate capabilities [11]. According to Franz Josef Och, who was the lead manager at Google for its translation activities and an articulate spokesperson for statistical machine translation, a solid base for developing a usable language translation system for a new pair of languages should consist of a bilingual text corpus of more than a million words, plus two monolingual corpora each of more than a billion words. Statistical frequencies of word associations form the basis of these reference sets. Google originally seeded its first language translators with multiple language texts from the United Nations [18].

Such lookup or frequency tables in fact can shade into what may be termed a knowledge base as they gain more structure. NELL, for example (and see below), contains a relatively flat listing of assertions extracted from the Web for various entities; it goes beyond frequency counts or relatedness, but does not have the full structure of a general knowledge base like Wikipedia [19]. We thus can see that statistical corpora and knowledge bases in fact reside on a continuum of structure, with no bright line to demark the two categories.

Nonetheless, most statistical corpora will never be seen publicly. Building them requires large amounts of input information. And, once built, they can offer significant commercial value to their developers to drive various machine learning systems and for general lookup.

Knowledge Bases

There are literally hundreds of knowledge bases useful to artificial intelligence, most of a restricted domain nature. Listed below, partially informed by Suchanek and Weikum’s work [20], are some of the broadest ones available. Note that many leverage or are derivatives of or extensions to Wikipedia:

  • BabelNet — is a multilingual lexicalized semantic network and ontology automatically created by linking Wikipedia to WordNet [21]
  • Biperpedia — is an ontology with 1.6M (class, attribute) pairs and 67K distinct attribute names, a totally unique resource, but one that is not publicly available [22]
  • ConceptNet — is a semantic network with concepts as nodes and edges that are assertions of common sense about these concepts [23]
  • Cyc — is an artificial intelligence project that attempts to assemble a comprehensive ontology and knowledge base of everyday common sense knowledge, with the goal of enabling AI applications to perform human-like reasoning [24]
  • DBpedia — extracts structured content from the information created as part of the Wikipedia, principally from its infoboxes [25]
  • DeepDive — employs statistical learning and inference to combine diverse data resources and best-of-breed algorithms in order to construct knowledge bases from hundreds of millions of Web pages [26]
  • EntityCube — is a knowledge base built from the statistical extraction of structured entities, named entities, entity facts and relations from the Web [27]
  • Freebase — is a large collaborative knowledge base consisting of metadata composed mainly by its community members, but centered initially on Wikipedia; Freebase is a key input component to Google’s Knowledge Graph [28]
  • GeoNames — is a geographical database that contains over 10,000,000 geographical names corresponding to over 7,500,000 unique features [29]
  • ImageNet — is an image database organized according to the WordNet hierarchy (currently only the nouns), in which each node of the hierarchy is depicted by hundreds and thousands of images [30]
  • KnowItAll — is a variety of domain-independent systems that extract information from the Web in an autonomous, scalable manner [31]
  • Knowledge Vault (Google) — is a Web-scale probabilistic knowledge base that combines extractions from Web content (obtained via analysis of text, tabular data, page structure, and human annotations) with prior knowledge derived from existing knowledge repositories; it is not publicly available [32]
  • LEVAN (Learning About Anything) — is a fully-automated approach for learning extensive models for a wide range of variations (e.g., actions, interactions, attributes and beyond) from images for any concept, leveraging the vast resources of online books [43]
  • NELL (Never-Ending Language Learning system) — is a semantic machine learning system developed at Carnegie Mellon University that identifies a basic set of fundamental semantic relationships between a few hundred predefined categories of data, such as cities, companies, emotions and sports teams [19]
  • Probase — is a universal, probabilistic taxonomy that contains 2.7 million concepts harvested from a corpus of 1.68 billion web pages [33]
  • UMBEL — is an upper ontology of about 28,000 reference concepts and a vocabulary for aiding that ontology mapping, including expressions of likelihood relationships [34]
  • Wikidata — is a common source of certain data types (for example, birth dates) which can be used by Wikimedia projects such as Wikipedia [35]
  • WikiNet — is a multilingual extension of the facts found in the multiple language versions of Wikipedia [36]
  • WikiTaxonomy — is a large-scale taxonomy derived from the category relationships in Wikipedia [37]
  • Wolfram Alpha — is a computational knowledge engine made available by subscription as an online service that answers factual queries directly by computing the answer from externally sourced “curated data”
  • WordNet — is a lexical database for the English language that groups words into sets of synonyms called synsets, provides short definitions and usage examples, and records a number of relations among these synonym sets or their members [38]
  • YAGO — is extracted from Wikipedia (e.g., categories, redirects, infoboxes), WordNet (e.g., synsets, hyponymy), and GeoNames [39].

What Work is Being Done and the Future

It is instructive to inspect what kinds of work or knowledge these bases are contributing to the AI enterprise. The most important contribution, in my mind, is structure. This structure can relate to the subsumption (is-a) or part of (mereology) relationships between concepts. This form of structure contributes most to understanding the taxonomy or schema of a domain; that is, its scaffolding of concepts. This structure helps orient the instance data and other external structures, generally through some form of mapping.

The next rung of contribution from these knowledge bases is in the nature of the relations between concepts and their instances. These form the predicates or nature of the relationships between things. This kind of contribution is also closely related to the attributes of the concepts and the properties of the things that populate the structure. This kind of information tends to be the kind of characteristics that one sees in a data record: a specific thing and the values for the fields by which it is described.

Another contribution from knowledge bases comes from identity and disamgibuation. Identity works in that we can point to authoritative references (with associated Web identifiers) for all of the individual things and properties in our relevant domain. We can use these identities to decide the “canonical” form, which also gives us a common reference for referring to the same things across information sources or data sets. We also gain the means for capturing the various ways that anything can be described, that is the synonyms, jargon, slang, acronyms or insults that might be associated with something. That understanding helps us identify the core item at hand. When we extend these ideas to the concepts or types that populate our relevant domain, we can also begin to establish context and other relationships to individual things. When we encounter the person “John Smith” we can use co-occurring concepts to help distinguish John Smith the plumber from John Smith the politician or John Smith the policeman. As more definition and structure is added, our ability to discriminate and disambiguate goes up.

Some of this work resides in the interface between schema (concepts) and the instances (individuals) that populate that schema, what I elsewhere have described as the work between the T-Box and A-Box of knowledge bases [40]. In any case, with richer understandings of how we describe and discern things, we can now begin to do new work, not possible when these understandings were lacking. We can now, for example, do semantic search where we can relate multiple expressions for the same things or infer relationships or facets that either allow us to find more relevant items or better narrow our search interests.

With true knowledge bases and logical approaches for working with them and their structure, we can begin doing direct question answering. With more structure and more relationships, we can also do so in rather sophisticated ways, such as identifying items with multiple shared characteristics or within certain ranges or combinations of attributes.

Structured information and the means to query it now gives us a powerful, virtuous circle whereby our knowledge bases can drive the feature selection of AI algorithms, while those very same algorithms can help find still more features and structure in our knowledge bases. The interaction between AI and the KBs means we can add still further structure and refinement to the knowledge bases, which then makes them still better sources of features for informing the AI algorithms:

Once this threshold of feature generation is reached, we now have a virtuous dynamo for knowledge discovery and management. We can use our AI techniques to refine and improve our knowledge bases, which then makes it easier to improve our AI algorithms and incorporate still further external information. Effectively utilized KBAI thus becomes a generator of new information and structure.

This virtuous circle has not yet been widely applied beyond the early phases of, say, adding more facts to Wikipedia, as some of our examples above show. But these same basic techniques can be applied to the very infrastructural foundations of KBAI systems in such areas as data integration, mapping to new external structure and information, hypothesis testing, diagnostics and predictions, and the myriad of other uses to which AI has been hoped to contribute for decades. The virtuous circle between knowledge bases and AIs does not require us to make leaps and bounds improvements in our core AI algorithms. Rather, we need only stoke our existing AI engines with more structure and knowledge fuel in order to keep the engine chugging.

The vision of a growing nexus of KBAI should also prove that efficiencies and benefits also increase through a power function of the network effect, similar to what I earlier described in the Viking algorithm [41]. We know how we can extract further structure and benefit from Wikipedia. We can see how such a seed catalyst can also be the means for mapping and relating more specific domain knowledge bases and structure. The beauty of this vision is that we already can see the threshold benefits from a decade of KBAI development. Each new effort — and there are many — will only act to add to these benefits, with each new increment contributing more than the increment that came before. That sounds to me like productivity, and a true basis for wealth creation.


[2] Lee D. Erman and Victor R. Lesser, 1975. A Multi-level Organization for Problem Solving Using Many Diverse, Cooperating Sources of Knowledge, DARPA Report AD-AO12-919, Carnegie-Mellon University, Pittsburgh, Pa. 15213 , March, 1975, 24 pp. See http://www.dtic.mil/cgi-bin/GetTRDoc?Location=U2&doc=GetTRDoc.pdf&AD=ADA012916. The authors followed this up with a 1978 book, System engineering techniques for artificial intelligence systems. Academic Press, 1978.
[4] Éloi Bossé, Jean Roy, Steve Wark, Eds., 2007. Concepts, Models, and Tools for Information Fusion, Artech House Publishers, 2007-02-28, SKU-13/ISBN: 9781596939349. Chapter 11 is devoted to “Knowledge-Based and Artificial Intelligence Systems.”
[5] Eduard Hovy, Roberto Navigli, and Simone Paolo Ponzetto, 2013. “Collaboratively Built Semi-Structured Content and Artificial Intelligence: The Story So Far,” Artificial Intelligence 194 (2013): 2-27. See http://wwwusers.di.uniroma1.it/~navigli/pubs/AIJ_2012_Hovy_Navigli_Ponzetto.pdf.
[6] See M.K. Bergman, “SWEETpedia,” listing of Wikipedia research articles, on AI3:::Adaptive Information blog, January 25, 2010. The listing as of its last update included 246 articles.
[7] See the combined references between [6] and [20]; also, see Wikipedia’s own “Wikipedia in Academic Studies.”
[8] Wade Roush, 2010. “The Story of Siri, from Birth at SRI to Acquisition by Apple–Virtual Personal Assistants Go Mobile.” Xconomy. com 14 (2010).
[9] Special Issue on “This is Watson“, 2012. IBM Journal of Research and Development 56(3/4), May/June 2012. See also this intro to Watson video.
[10] Amit Singhal, 2012. ” Introducing the Knowledge Graph: Things, not Strings,” Google Blog, May 16, 2012
[11] Thomas Schulz, 2013. “Google’s Quest to End the Language Barrier,” September 13, 2013, Spiegel Online International. See http://www.spiegel.de/international/europe/google-translate-has-ambitious-goals-for-machine-translation-a-921646.html. Also see, Alon Halevy, Peter Norvig, and Fernando Pereira, 2009. “The Unreasonable Effectiveness of Data,” in IEEE Intelligent Systems, March/April 2009, pp 8-12.
[12] Greg Linden, Brent Smith, and Jeremy York, 2003. “Amazon. com Recommendations: Item-to-item Collaborative Filtering,” Internet Computing, IEEE 7, no. 1 (2003): 76-80.
[13] Athima Chansanchai, 2014. “Microsoft Research Shows off Advances in Artificial Intelligence with Project Adam,” on Next at Microsoft blog, July 14, 2014. Also, see [33].
[14] Jimmy Lin and Alek Kolcz, 2012. “Large-Scale Machine Learning at Twitter,” SIGMOD, May 20–24, 2012, Scottsdale, Arizona, US. See http://www.dcs.bbk.ac.uk/~DELL/teaching/cc/paper/sigmod12/p793-lin.pdf.
[15] Robert Hof, 2014. “Interview: Inside Google Brain Founder Andrew Ng’s Plans To Transform Baidu,” Forbes Online, August 28, 2014.
[16] Alok Prasad and Lee Feigenbaum, 2014, “How Semantic Web Tech Can Make Big Data Smarter,” in CMSwire, Oct 6, 2014.
[17] The Web 1T 5-gram data set is available from the Linguistic Data Corporation, University of Pennsylvania.
[18] Franz Josef Och, 2005. “Statistical Machine Translation: Foundations and Recent Advances,” The Tenth Machine Translation Summit, Phuket, Thailand, September 12, 2005.
[19] Andrew Carlson, Justin Betteridge, Bryan Kisiel, Burr Settles, Estevam R. Hruschka Jr, and Tom M. Mitchell, 2010. “Toward an Architecture for Never-Ending Language Learning,” in AAAI, vol. 5, p. 3. 2010.
[20] Fabian M. Suchanek and Gerhard Weikum, 2014. “Knowledge Bases in the Age of Big Data Analytics,Proceedings of the VLDB Endowment 7, no. 13 (2014).
[21] Roberto Navigli and Simone Paolo Ponzetto, 2012. “BabelNet: The Automatic Construction, Evaluation and Application of a Wide-coverage Multilingual Semantic Network,” Artificial Intelligence 193 (2012): 217-250.
[22] Rahul Gupta, Alon Halevy, Xuezhi Wang, Steven Euijong Whang, and Fei Wu, 2014. “Biperpedia: An Ontology for Search Applications,” . Proceedings of the VLDB Endowment 7(7), 2014.
[23] Robert Speer and Catherine Havasi, “Representing General Relational Knowledge in ConceptNet 5,” LREC 2012, pp. 3679-3686.
[24] Douglas B. Lenat and R. V. Guha, 1990. Building Large Knowledge-Based Systems: Representation and Inference in the Cyc Project, Addison-Wesley, 1990 ISBN 0-201-51752-3.
[25] Sören Auer, Christian Bizer, Georgi Kobilarov, Jens Lehmann, Richard Cyganiak, and Zachary Ives 2007. “DBpedia: A Nucleus for a Web of Open Data,” presented at ISWC 2007, Springer Berlin Heidelberg, 2007.
[26] Feng Niu, Ce Zhang, Christopher Ré, and Jude W. Shavlik, 2012. “DeepDive: Web-scale Knowledge-base Construction using Statistical Learning and Inference,” in VLDS, pp. 25-28. 2012.
[27] Zaiqing Nie, J-R. Wen, and Wei-Ying Ma, 2012. ” Statistical Entity Extraction From the Web,” in Proceedings of the IEEE 100, no. 9 (2012): 2675-2687.
[28] Kurt Bollacker, Colin Evans, Praveen Paritosh, Tim Sturge, and Jamie Taylor, 2008. “Freebase: a Collaboratively Created Graph Database for Structuring Human Knowledge,” in Proceedings of the 2008 ACM SIGMOD International Conference on Management of Data, pp. 1247-1250. ACM, 2008.
[29] Mark Wick and Bernard Vatant, 2012. “The GeoNames Geographical Database,” available from the World Wide Web at http://geonames.org (2012).
[30] Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei, 2009. ” ImageNet: A Large-scale Hierarchical Image Database,” in Computer Vision and Pattern Recognition, 2009. CVPR 2009. IEEE Conference on, pp. 248-255.
[31] Oren Etzioni, Michael Cafarella, Doug Downey, Ana-Maria Popescu, Tal Shaked, Stephen Soderland, Daniel S. Weld, and Alexander Yates, 2005. “Unsupervised Named-entity Extraction from the Web: An Experimental Study,” Artificial Intelligence 165, no. 1 (2005): 91-134.
[32] Xin Luna Dong, Evgeniy Gabrilovich, Geremy Heitz, Wilko Horn, Ni Lao, Kevin Murphy, Thomas Strohmann, Shaohua Sun and Wei Zhang, 2014. “Knowledge Vault: a Web-scale Approach to Probabilistic Knowledge Fusion,” KDD 2014.
[33] Wentao Wu, Hongsong Li, Haixun Wang, and Kenny Q. Zhu, 2012. “Probase: A Probabilistic Taxonomy for Text Understanding,” in Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data, pp. 481-492. ACM, 2012.
[34] M.K. Bergman, 2008. “UMBEL: A Subject Concepts Reference Layer for the Web”, Slideshare, July 2008.
[35] Denny Vrandečić and Markus Krötzsch, 2014. “Wikidata: A Free Collaborative Knowledgebase,” Communications of the ACM 57, no. 10 (2014): 78-85.
[36] Vivi Nastase and Michael Strube. 2013. “Transforming Wikipedia into a Large Scale Multilingual Concept Network,” Artificial Intelligence 194 (2013): 62-85.
[37] Simone Paolo Ponzetto, and Michael Strube, 2007. “Deriving a Large Scale Taxonomy from Wikipedia,” in AAAI, vol. 7, pp. 1440-1445. 2007.
[38] Christiane Fellbaum and George Miller, eds., 1998. WordNet: An Electronic Lexical Database, MIT Press, 1998.
[39] Johannes Hoffart, Fabian M. Suchanek, Klaus Berberich, and Gerhard Weikum, 2013. “YAGO2: a Spatially and Temporally Enhanced Knowledge Base from Wikipedia,” Artificial Intelligence 194 (2013): 28-61; and Fabien M. Suchanek, Gjergji Kasneci, Gerhard Weikum, 2007. “YAGO: a Core of Semantic Knowledge,” in Proceedings of the 16th international Conference on World Wide Web, pp. 697-706. ACM, 2007.
[40] See the T-Box and A-Box discussion with particular reference to artificial intelligence on M.K. Bergman, 2014. “Big Structure and Data Interoperability,” on AI3:::Adaptive Information blog, August 18, 2014.
[41] M.K. Bergman, 2014. “The Value of Connecting Things – Part II: The Viking Algorithm,” on AI3:::Adaptive Information blog, September 14, 2014.
[42] Cade Metz, 2013. “Facebook Taps ‘Deep Learninig’ Giant for New AI Lab,” on Wired.com, December 9, 2013.
[43] Santosh Divvala, Ali Farhadi and Carlos Guestrin, 2014. “Learning Everything about Anything: Webly-Supervised Visual Concept Learning,” in CVPR 2014.
Posted:September 4, 2014

Connections Attract ValueBenefits Can be Gained Incrementally, and Are Cumulative

In the earlier installments of this article series we first described how to estimate the value of connections amongst Big Data datasets, premised on the network effect. We then went into detail in the second part about the Viking algorithm (VKG) derived to capture the Value of Knowledge Graphs.

In this concluding part of the series we summarize the use and implications of the Viking algorithm on your Big Data planning. We do so by offering ten guidelines for how including Big Structure may be leveraged in the context of a knowledge graph. We then conclude the series with some caveats on the interpretation of these results and a discussion of possible future directions.

Ten Guidelines for Big Structure

As you think through your Big Data initiatives, we recommend you keep these ten guidelines relating to data structure in mind:

  1. More structure always provides benefits — adding structure always provides a multiplier effect in value
  2. Making connections is more valuable than adding more data — Big Data alone is practically worthless if not connected. Adding more records has an additive effect on the value of the datasets in comparison to the multiplier effects of structure
  3. Benefits of structure increases with increasing dataset sizes (scale) — the multiplier effect of more structure (connections) increases with scale. Big Data projects are thus perfect candidates for consciously “connecting the dots”
  4. Particular kinds of structure — such as types or categorization — have higher benefit than annotations — structural characteristics at the record level that enable cross-dataset selections and comparisons are inherently more valuable than record-specific annotations. Typing of records into entity types is a very powerful lever
  5. The potential value of a knowledge graph depends on the nature of the domain — knowing what kind of knowledge graph is in play is an important metric for being able to estimate potential value from connections. Further, by adding connections (correct and coherent) it may also be possible to move the entire structure to a lower average degree of separation (D), with further multiplier benefits
  6. Structure can be added incrementally, and is cumulative — because these additions to structure are based on the open world assumption (OWA), it is possible to add structure incrementally. OWA enables connection and structuring efforts to be accomplished as budgets allow. But, because these structural benefits are cumulative (and with multipliers), later contributions can have increasing benefits over earlier ones
  7. Data wrangling is justified as a means to increase the accuracy of fact assertions — data wrangling should not be viewed as an overall “cost” to the effort, but as a key means for achieving the multiplier benefits arising from structure and connections
  8. Adding structure at the time of data wrangling is a cost-effective approach — the corollary to standard data wrangling is the wisdom of explicitly including structuring and connection to the efforts. The multiplier benefits that accrue are a means to markedly lower the marginal costs of data wrangling in relation to realized benefits
  9. Ontologies provide inferred capabilities — though many kinds of structure can contribute to the Big Structure effort, ontologies, because of their logical structure, can be used to derive inferred “facts” in essence “for free.” (And remember, more “facts” are the basis for the multiplier benefits.) Inference provides a powerful means to leverage existing connections without explicitly needing to assert new ones
  10. Ontologies are the most preferred means for Big Structure — besides inference, ontologies set a structural framework of relationships (schema) very useful to helping to guide the nature of connections made. Also, ontologies can provide the conceptual and descriptive richness useful for tagging and other structure-adding activities [1]. Because of these advantages and their testable nature based in logic, ontologies represent the pinnacle of structural forms to achieve these value benefits.

Of course, mere connections for structure’s sake is silly. It is important that the structure added and connections made are correct, consistent and coherent. Even then, not all types of connections are created equal, with typing the most important, annotations the least.

The good thing is that Big Structure can be added as a slight increase over standard data wrangling efforts, and with much greater impact than standard wrangling. Further, the structures themselves, preferably guided by domain ontologies, are a means of testing these factors for subsequent structure additions. Not only does adding structure get easier with a foundation of existing structure, but it increases the value of the information by orders of magnitude.

Not the Last Word

Roughly twenty years ago Metcalfe’s law triggered a gold rush in trying to achieve network effects at Internet scale. Though the algorithm proved too optimistic at larger scales, the idea of the benefit from connections was firmly established. Ten years ago it was clear that some form of diminishing returns needed to be applied to connections at scale. Zipf’s law was not a bad guess, though we have subsequently learned that more graph-centric measures are more appropriate and accurate for estimating value. Now, the Viking algorithm has emerged as the best estimator of the value of connections within Big Data.

I suspect we will see further improvements to the Viking algorithm as: 1) we come to better understand graph structures (including the effects of clusters and cliques); and 2) we learn to distinguish the different value of different types of connections [2]. We can already see that typing and categorization have better structural effects above annotations. We further can see that the correctness of asserted “facts” is a key to realizing the multiplier benefits of connections and structures. Thus, we should see improved means for screening and testing assertions for their accuracy at scale.

At this stage, what the Viking algorithm gives us is a defensible means for assessing the value of adding structure (through connections) to our datasets. We see these multiplier effects to be huge, and to compound to even still further benefits with scale. We also see that the most developed forms of structure — namely, ontologies — bring still further benefits in inference and testable coherence. All Big Structure efforts should be aiming to express all of the structural insights for the organization and its datasets into these ontological forms [1].

While our current proxy for value — namely, asserted “facts” — is useful, it would also be helpful to be able to translate these “fact” assertions into a monetary value. As we move down this path we will discover, again, that not all “facts” are created equal, and some have more monetary value than others. Transitioning our estimates of value to a monetary basis will help set parameters for the cost-benefit analysis of data collection and structurizing that is the ultimate basis for planning Big Data and Big Structure initiatives.

In the end, many things need to be analyzed to understand the impacts of each connection and structure metric on the value of the resulting graph. But, what today’s current understanding of the network effect and the Viking algorithm brings us is a better means to understand and quantify the benefits of connected information. By any measure, these value benefits are multiples of what we see for unconnected data, the multiples of which grow massively with the scale of the data and their connections.

Big Structure is fertile ground for bringing in the sheaves. Let the harvesting begin.


[1] Though not further discussed here, the ontologies also provide the means for tagging (providing structure) to unstructured documents, which also brings the multiplier benefits from structure. On the retrieval side, such structure also aids faceting and filtered “slicing and dicing” of underlying datasets, thereby improving retrieval efficacy.
[2] As one of the first approaches to capture these nuances, see Mischa Dohler, Thomas Watteyne, Fabrice Valois and Jia-Liang Lu, 2008. Kumar’s, Zipf’s and Other Laws: How to Structure a Large-Scale Wireless Network?, published in Annales des Telecommunications – Annals of Telecommunications 63, 5-6 pp. 239-251. See http://hal.archives-ouvertes.fr/docs/00/40/58/67/PDF/Large_Scale_Networks_journal_FINAL.pdf.
Posted:September 3, 2014

Value of Connecting ThingsPart II of The Value of Connecting Things: Big Structure Improves Big Data by Orders of Magnitude

Yesterday in the first part of this series we raised the important question of how to value connections made between data. At the Big Data scales represented, we prepared a Basic Facts case of up to 2 billion assertions. We are using asserted “facts” as our value proxy. We’ll talk more about value and caveats in the third part of our series, tomorrow.

We saw that early estimates of network effects, such as Metcalfe’s law, overestimate value at scale. We looked at Zipf’s law as a means to capture the diminishing value of connections given the distance between facts. In today’s article we will focus on these factors of interaction and potential value in the specific context of knowledge graphs. Knowledge graphs are Big Structure representations that capture the schematics, concepts and measures in any given knowledge domain (that is, any domain of human activity).

Since I first tried to address the value of knowledge networks some five years ago [1], I have been disturbed about a couple of things [2]. First, I felt that the exponential or geometric bases for estimating the value of information connections were not correct, both because they fail at scale and they don’t discriminate that some connections work and are more important, while others are trivial or don’t work. Capturing this law of diminishing value in a context that makes sense for knowledge bases was, I felt, the key to answering the value riddle.

I believe we have now, in this series, provided a compelling basis for solving that riddle, which also points the way to further improvements. This assertion is an exciting statement, in that we now may have a quantitative basis in hand for determining where and how to spend our monies for Big Data and Big Structure initiatives. Such quantitative tools are a huge boon to bring analytic rigor to the data collection and integration challenge.

Adding connections (“Big Structure“) to Big Data can increase the value of enterprise information from one to three orders of magnitude; the value also scales linearly with added structure (attributes).

This article shows that adding connections (“Big Structure“) at Big Data scales can increase the value of enterprise information from one (ten) to three (thousands) orders of magnitude. The magnitude of the value scales linearly with each added structure (attribute). These value multipliers from adding Big Structure are a tremendously cost-effective addition to standard data wrangling efforts.

The Value of Knowledge Graphs (VKG) Formulation

The recognition of the need for a law of diminishing returns to reflect the distance between facts or assertions is a central argument in the Briscoe-Odlyzko-Tilly formulation (see [3] directly, and the prior Part I discussion). Not all information is connected, and not all connected information is of equivalent worth. The implied question in these statements, however, is how to capture those differences?

The B-O-T (or sometimes, O-T) formulation does not choose a bad starting proxy for this diminishment law. Zipf’s law reflects many observed distributions in human objects, roughly equivalant to power law, Pareto (“80:20”) distributions or n log (n) diminishing returns with long-tail characteristics. Examples include Internet distributions (such as popularity of Web sites or search terms), human language distributions, income rankings, population distributions, etc. There is no question that Zipf’s law distributions are common and frequent.

The only problem with picking the Zipf’s law basis, however, is that there was absolutely no evidence that such occurred for information networks or knowledge graphs. Zipf’s law distributions tend to be statements across single types for a single attribute distribution. Graphs, we can safely say, are anything but this distribution. Connections and multiple types are the rule, not the exception.

So, maybe the B-O-T formulation was correct, and maybe it was not. There was no empirical evidence to support this assertion for knowledge graphs. And, there did not appear to be a compelling logic argument for relating Zipf’s law to graphs other than they are artifacts of human endeavor.

My discomfort in adopting this arbitrary B-O-T basis, even though solidly embedded in human experience, caused me to seek alternative ideas and explanations, but also ones that fulfilled the key structural insights of diminishing returns and non-equivalent assertions that were the focal points of B-O-T, all within a graph context.

The Starting Basis

The breakthough occurred when I discovered an obscure, un-cited paper by Yaakov Stein [4]. Stein, a network and signals processing researcher of the first rank [4], wrote his paper as a means to understand and quantify his experience of joining LinkedIn and expanding his network. He began without an account and documented his experience as he joined and expanded his network of contacts on LinkedIn. He charted direct links, and then meticulously looked at and recorded secondary and tertiary links.

His formulation recognized that the value to an individual user equaled raising the access to the entire network (1) for that user plus the diminishing benefit represented by the participating graph’s other participants as measured by average degree of separation (d). d is an inherent measure of the graph type.

Though his context was a social network, the basic observation obtains: relations diminish by distance within a graph, with average link distance (directly related to degree of separation) being the key relevance metric. Connected “facts” or “friends” is essentially the same thing. It is all about what is shared amongst graph nodes.

The usefulness of this approach is that it grounds the multiplier effect in an inherent characteristic of the source graph, its average degree of separation [5]. Like Zipf’s law, the degree of separation is a distance measure, but one grounded specifically in graphs. Here is the Stein formulation:

Stein Formulation

A graph with a degree of separation of 4, then, would exhibit a network-wide power factor of 5/4 (4/4 plus 1/4).

The Viking Algorithm

As applied to knowledge graphs, however, this formulation still has two problems. The first minor one is that the degree of separation parameter should be D (the average across structure) rather than d. The second substantive one is that a correction factor needs to be included that accounts for the probability that an assertion may be false. This factor, F, is 1 – the measured error rate.

The resulting algorithm we term the Value of Knowledge Graph formulation, or the VKG (Viking) algorithm. It is expressed like this:

Value of Knowledge Graphs

F is meant to be analogous to F-measure, the combined precision and recall statistic for information retrieval and NLP tasks. F in the case of the Viking algorithm is also meant to be a combined statistic that represents the “accuracy” (verifiable truthfulness) of statements asserted in the graph. F is essentially an estimated value for the residual falsity for the average statement in a graph, after removal of all assertions that do not meet existing coherency, consistency or completeness tests. F is determined by sampling statements across the graph and manually testing for truthfulness (or in a logical sense, validity given the existing statements in the graph). An F of 1 signifies complete truthfulness (accuracy); an F of 0 represents complete falsity [6].

Viking in Relation to Other Network Estimators

Now, with this explanation of basis, we can again look at the value of the Viking (VKG) algorithm in comparison to those discussed in the first part of this series. Again on a logarithmic scale, here are those results:

Knowledge Network Estimates

Figure 1. Knowledge Network Estimates

Excluding the exponential and geometric multipliers (namely, the “laws” of Metcalfe and Reed) in the top two curves, this shows the Viking (VKG) algorithm to have higher value than the B-O-T (O-T) algorithm, both of which are considerably higher than the Basic Facts. However, because the Figure 1 above has a logarithmic scale, these differences are harder to discern.

Viking Benefits Over the Basic Facts

Now corrected with our assumed F factor, we can begin to tease out the value benefits of connecting “facts” versus the unconnected Basic Facts. As with any logarithmic function, we see that the value benefits from connections increase in a growing manner at larger scales. For example, as Figure 2 shows below, at a level of 1000 records, the benefits from connections are 7x greater than unconnected data. By the time the scale grows to 1 million or 500 million records, the value benefits of connections grows to 44x to 215x, respectively:

Percent Improvement from Connections

Figure 2. Percent Improvement from Connections Scales with Records Size

Benefits from connections increase as a power function at increasing scales.

Setting the VKG Factor D

But the potential value of connectedness is also a function of the general degree of information separation for the given domain. We are still in the early phases of gathering statistics for such things, but the table below summarizes what is known about the “standard” level of connectivity in various domains and applications. Note, in general, most any knowledge graph would have a D factor ranging from 2 to 8:

Category Degrees of Separation (D) Notes
Food webs ~ 2 [7]
Genetic differences ~ 3 [8]
LinkedIn ~ 3 [4]
Twitter 3.435 – 4.67 [9]
Facebook 3.74 [10]
Potential research collaborators ~ 4 [11]
UMBEL ~ 5.2 [12]
Social networks (general) ~ 6
Mobile ad hoc networks ~ 7 [13]
Small-world networks (max) ~ 8 [14]
Table 1. Degrees of Separation for Various Knowledge Networks

More tightly linked, cohesive domains tend to have the lower degrees of separation. It is also interesting to note that some social networks, like Twitter and Facebook, are also able to lower degrees of separation (in comparison to their nominal “social network” benchmark) by virtue of the nature of their service.

As experience is gained and with more research, I expect more estimates and more refined ones. Depending on the nature of the domain at hand, it should then be possible to pick the closest analog to use in the Viking valuation algorithm. Nonetheless, we already have a range and respective values to provide meaningful value estimates today.

Using the values in Table 1, we are thus able to plot the effects (again, log scale) of these various degrees of separation in terms of the “fact” assertions that can be made for our Big Data test dataset:

Potential Network Value Varies by Domain

Figure 3. Nature of Knowledge Graph Affects Potential Network Value

At the nominal Big Data scales of 100,000 and 1,000,000 records, the value of data connections in comparison to the unconnected Basic Facts case shows these following value improvement multipliers:

Domain 100,000 Records 1,000,000 Records
Food webs 203x 611x
Genetic differences 38x 84x
Twitter 23x 46x
Facebook 17x 33x
Potential research collaborators 14x 26x
UMBEL 8x 12x
Social networks (general) 5x 8x
Mobile ad hoc networks 3x 5x
Table 2. Multiplier (X) Improvements by Domain from Connections Over the Basic Facts

Of course, our “Big Data Example” from Part I was silent about the exact nature of its knowledge graph. Based on empircial experience to date, the benefits from connecting data that was previously unconnected should fall somewhere within the limits of Table 2. Even at rather low scales and more loosely-connected domains, the value improvements in making connections with data is many-fold. At larger scales for tighter networks, the multipliers can become astounding.

Adding Structure to the Underlying Data

Another implication that the Viking algorithm allows us to test is the comparative benefit from adding structure to our datasets. Actually, “adding structure” is not strictly correct; it is “structurizing” the data via characterizations, attributes and categorizations. Of course, not all structure is created equal. Assigning or classifying our records into types, for example, applies to all records across the datasets and provides powerful cross-record linkages. Adding annotations or metadata to single records provides much lower benefits.

When we add structure across datasets the value improvements are a linear percent, as this figure shows:

Adding Structure has a Linear Effect on Value

Figure 4. Adding Structure has a Linear Effect on Value

For our Big Data example, each across-dataset structure characterization adds about 25% to 30% value per structure. Adding four structural characterizations, for example, more than doubles the “facts” assertion value (~ 140%) to the datasets.

Preview of Last Part

The last part forthcoming tomorrow will summarize the implications from the Viking algorithm on the role and importance of Big Structure to your organization’s Big Data efforts. Some caveats and future directions will conclude the series.


[1] The two articles written at that time were, M.K. Bergman, 2009. Structure the World, in AI3:::Adaptive Information blog, August 3, 2009, and M.K. Bergman, 2009, The Law of Linked Data, AI3:::Adaptive Information blog, October 11, 2009.
[2] See also the then-current state of analysis by Eric Hellman, 2009. Normal and Inverse Network Effects for Linked Data, published in his blog, October 15, 2009; see http://go-to-hellman.blogspot.com/2009/10/normal-and-inverse-network-effects-for.html.
[3] Bob Briscoe, Andrew Odlyzko, and Benjamin Tilly, 2006. “Metcalfe’s Law is Wrong,” in IEEE Spectrum, July 2006. A copy may be viewed at http://www.cse.unr.edu/~yuksem/teaching/nae/reading/2006-briscoe-metcalfes.pdf. Odlyzko, and Tilly had published an earlier version, (sometimes the approach is shown as O-T in addition to B-O-T), and the basic form of the algorithm appears in a single Odlyzko paper.
[4] Yaakov (J) Stein, 2009. The Value of Being Linked In, on his personal Web site, April 2009; see http://www.dspcsp.com/pubs/linkedin.pdf. Note that his empirical tests suggested a degree of separation for LinkedIn of 3.
[5] The average degree of separation is simply the graph’s average path distance – 1. For an explanation of average path distance, see [12].
[6] F is a summed average value across all assertions within a knowledge graph. In information retrieval, F-measures are now being achieved that exceed 0.90 (90%). For the cases used herein, F is estimated at 0.85. Again, this parameter is measured after all standard coherency, consistency, and completeness tests are applied to the ontology. These tests routinely remove many false assertions and establish the basic integrity of the graph. This acceptance threshold is itself constantly improving as experience is gained with basic graph integrity tests. In other words, tomorrow’s thresholds will be higher than today’s.
[7] Richard J. Williams, Eric L. Berlow, Jennifer A. Dunne, Albert-László Barabási, and Neo D. Martinez, 2002. Two Degrees of Separation in Complex Food Webs, in Proceedings of the National Academy of Sciences, 99 (20):12913-12916, September 16, 2002, doi:10.1073/pnas.192448799; see http://www.pnas.org/content/99/20/12913.full.
[9] Reza Bakhshandeh, Mehdi Samadi, Zohreh Azimifar and Jonathan Schaeffer, 2011. Degrees of Separation in Social Networks, in Proceedings, The Fourth International Symposium on Combinatorial Search (SoCS-2011), 6 pp.; see http://www.aaai.org/ocs/index.php/SOCS/SOCS11/paper/viewFile/4031/4352; and Haewoon Kwak, Changhyun Lee, Hosung Park, and Sue Moon, 2010. What is Twitter, a Social Network or a News Media?, in Proceedings of the 19th International Conference on World Wide Web, April 26–30, 2010, Raleigh, North Carolina, pp. 591-600, ACM; see http://snap.stanford.edu/class/cs224w-readings/kwak10twitter.pdf.
[10] Lars Backstrom et al., 2012. Four Degrees of Separation, Archiv.org, January 6, 2012; see http://arxiv.org/pdf/1111.4570.pdf.
[11] Paweena Chaiwanarom, Ryutaro Ichise, and Chidchano Lursinsap, 2010. Finding Potential Research Collaborators in Four Degrees of Separation, pp. 399-410, in Longbing Cao, Jiang Zhong, and Yong Feng, eds., Advanced Data Mining and Applications, Springer Berlin Heidelberg, http://dx.doi.org/10.1007/978-3-642-17313-4_39.
[13] Maria Papadopouli and Henning Schulzrinne, 2000. Seven Degrees of Separation in Mobile ad hoc Networks, presented at Global Telecommunications Conference, 2000 (GLOBECOM’00), IEEE. Vol. 3; see http://www.huaxiaspace.net/academic/classes/wi02/cse294/20020222globecom2000.pdf.
[14] Paolo Pin, 2006. Eight Degrees of Separation, in Nota di Lavoro, Fondazione Eni Enrico Mattei, No. 78.2006 see http://www.econstor.eu/bitstream/10419/74249/1/NDL2006-078.pdf.