Abstract: Mining Semantic Information From Unst..., By Michael Ross
|
|||||
|
|
|
||||
Abstract: Mining Semantic Information From Unst..., By Michael Ross
|
||||||||||||
|
Abstract posted for the May 20-21 AGI Workshop: Mining Semantic Information From Unstructured Data By Michael Ross, Science Applications International Corporation PPT: Mining Semantic Information From Unstructured DataMICCE is a general purpose algorithm, based on Clustering-By-Committee (CBC), for mining semantic data. The algorithm discovers semantic classes by grouping together items with similar features or which appear in similar contexts. Instances of ambiguous items can then be assigned to classes. MICCE is designed to work in any domain which contains structural and classification ambiguities. With domain-specific plugins, it can be applied to linguistic, visual, auditory, spatial, or mixed data. In addition, MICCE can be applied to the results of other algorithms, and make use of external ontologies. This flexibility allows it to use its own classifications or other new information in a feedback loop which can iteratively improve results. As an example of the algorithm's behavior in a text domain, MICCE may discover two semantic classes for the noun 'plant.' These classes correspond with different meanings or senses of 'plant,' and are represented by sets of words which occur in similar contexts, i.e., {plant, flower, shrub, tree} and {plant, factory, mill, warehouse}. The senses are associated with a representations of the contexts in which they occur. Once these senses are discovered, occurrences of the word 'plant' can be disambiguated. MICCE compares an occurrence's context with each candidate sense's context representation. The occurrence can then be assigned to the best candidate. Thus, the documents (and individual words in the documents) may be tagged with semantic data. Furthermore, because MICCE can handle multiple types of context representation, the disambiguation data can be fed back into subsequent iterations of the algorithm. For instance, if an occurrence of 'plant' is disambiguated to the sense {plant, flower, shrub, tree}, this provides a more detailed description of the context for nearby words such as 'soil' or 'garden.' Such words can then be more accurately disambiguated, and syntactic parse ambiguities involving these words may be more accurately resolved. In non-text domains, such as visual/spatial data, the algorithm could similarly discover classes of objects based on their contexts and/or features, and assign instances to classes. Thus, although MICCE is currently being tested only on sequences of words, the algorithm can be understood more generally as a method for bootstrapping high-level representations from low-level data in any domain with structural and classification ambiguities. At SAIC, MICCE is being developed to discover and disambiguate referents of names (determining whether "Michael Jordan" refers to the famous basketball player or to another individual with the same name ). Interim results are presented along with a discussion of applications and future development directions. Important Links: Main Workshop Website: http://www.agiri.org/workshop Directions/Hotel: http://www.agiri.org/directions.htm Workshop Schedule: http://www.agiri.org/schedule.htm Printable Version / Handout: http://www.agiri.org/workshop/AGIRI_Workshop_2006.pdf |
![]() ![]() ![]() ![]() |
|
AGIRI · Forums · Discussion Lists · Wiki |
AGIRI - Artificial General Intelligence Research Institute Copyright © 2001-2007 :: User Agreement :: Discussion Lists :: Contact |