Enhancing relationship extraction from the Web using symmetry and inversion
Publication Date: 2011-Jul-13
The IP.com Prior Art Database
AbstractSeveral systems today parse textual information on the Web or in other repositories, and build a fact base -- a large set of relations, that can be queried. For instance, by parsing the Web, facts like "President (Lincoln, United States)" would be added to the fact base. A query of the form "Who was a president of the United States" -- "President(?, United States)" -- will retrieve "Lincoln" and other presidents from the fact base. To be accurate, the system must have confidence in the facts, which can be achieved by counting how often the facts are found in the Web. In this article, we show how a relation R can be determined to be a symetric relation. This enables one to infer that the fact R(x,y) is the same as the fact R(y,x). Similarly, we should how to infer that the relations R and Z are inverse relations. This enables one to infer that the fact R (x,y) is the same as the fact Z(y,x). One can use the information that a relation is symmetric or that two relations are inverses of one other, for several purposes. For instance, it can be used to better answer queries. It can also be used to boost the confidence of facts.
Page 01 of 6
There is currently research into how to greatly improve the ability to search and extract information from Web data.
One idea, pioneered for instance by Textrunner (
is to parse each sentence of each document, and to build relations of the form Predicate(Noun1, Noun2).
An example might be:
One can then ask queries such as "
what chemicals have FDA approval?"
It would retrieve all tuples of the above form, such as approved(FDA, aspartame)
See, for instance,
Michele Banko and Oren Etzioni, The Tradeoffs Between Open and Traditional Relation Extraction,
2008, pages 28-36, http://
which can be found at:
Other related work can be found in the references given in this paper.
Hence one important direction today is to extract not just keywords from web pages, but relations, that can then be used to more intelligently answer queries and mine the Web for information.
In order to scale, no predefined ontology or taxonomy is used. The confidence of an answer purely based upon the number of times it finds instances of a particular form. Hence, if in searching web pages it find 100 sentences that cause it to build the relation
approved(FDA, aspartame) and two sentences that cause it to build the tuple approved(FDA, dynamite) it would give a lot more credibility to the former than to the latter. I.e., approved(FDA, aspartame)
"support" from 100
and approved(FDA, dynamite) from just 2
However, since there is no deep ontology or semantics behind this approach, but the relations are just built out of
parsing sentences, the word order and terms used are of critical importance.
(1) Hence sentences of the form "John and Jack are brothers"
would produce a relation of the form:
But sentences of the form "Jack and John are brothers"
would produce a relation of the form:
Since the order of Jack and John in these two relations are reversed, they would be treated as separate instances of the
Page 02 of 6
relation. If we discover 5 instances of one and 100 of the other,
credibility to the one with 5 instances
and more credibility to the one with 100 instances, instead of treating them the same and giving them both credibility
commensurate with 105 instances.
Hence it would be beneficial to treat all the instances of a symmetric relation as the same.
This will allow a system to more accurately answers queries and process relations more accurately. It will also
allow the system to give better confidences to results, as it will be able to aggregate the statistics of
symmetric relations together, independent of the order of the arguments.
would produce predicates of the form:
But sentences of the form "Isaac was the father of Jaco...