«EXTRACTION OF CONTEXTUAL KNOWLEDGE AND AMBIGUITY HANDLING FOR ONTOLOGY IN VIRTUAL ENVIRONMENT A Dissertation by HYUN SOO LEE Submitted to the Office ...»
“autonomous/intelligent execution”. Suppose that a process consists of several subprocesses. An initially designed model and input data (e.g., system parameters) may be changed after processing every sub process. Whenever the parameters or data are changed at each sub process, efforts such as a modification of data or parameters may be needed for the next sub process. Such efforts can prove a significant burden by limiting rapid execution and autonomous processing. Yet, if a type of knowledge extraction process between two consecutive sub processes exists, knowledge can then be extracted from a previous sub process and used to reduce the effort required for subsequent sub processes. Furthermore, knowledge extraction is a basic process for constructing intelligence. A system with knowledge extraction functions can greatly increase its degree of knowledge and handle dynamic changes inside and outside its environment.
This dissertation will introduce a more efficient methodology for constructing large-scale virtual environments with virtual ontology from 2D images. A hierarchy with contexts is referred to as virtual ontology (the detailed definition appears in Section 3).
The overall procedure is termed “Large Scale Virtual Environment (LSVE) construction through extracting virtual ontology”. An LSVE with virtual ontology can be used for virtual interaction analysis automatically or semi-automatically. A 3D traffic simulation can be an example of a virtual interaction analysis, since 2D photos gathered from multiple cameras or active vision systems along a road cannot describe a car accident in full detail, due to hidden or untaken features.
The dissertation is organized as follows: Section 1.1 provides an overview of the methodology and introduces the problem statement. Section 2 describes the background and related research studies. Detailed definitions of virtual model, context, ontology and their relationships with the effective data structure for LSVE are given in Section 3.
Section 4 describes the construction of LSVE using one 2D image and Section 5 describes the construction process using multi-view scenes. Sections 6 and 7 describe a context mapping process for LSVE and related issues.
This dissertation focuses on the generation of a virtual environment with virtual ontology, using the Metaearth architecture  as a data structure for describing relationships among virtual components and capturing common patterns in virtual components. Figure 2 illustrates the methodology.
The Metaearth architecture can be generated in two ways. The first method uses one 2D image while the second uses multi-view 2D images. The first method is related to scene decomposition or scene understanding techniques in the existing literature.
Scene understanding can be summarized as “a 2D scene without any structure and context is analyzed and merged into some related groups”. Figure 2 shows an IDEF-0 diagram (level-1) of the first method.
Figure 3. IDEF-0 diagram of “virtual ontology extraction from a scene”.
This process is called “virtual ontology extraction from a scene” in this dissertation. Figure 3 shows IDEF-0 diagram for “virtual ontology extraction from a scene”. As an input, a 2D scene such as a photo or a sketch is considered and it is converted into relational model. In this dissertation, a virtual model or generated architecture is defined as an atomic (virtual) model, relational (virtual) model and model
with structural knowledge, as follows:
Definition 1. Atomic (virtual) model A (reconstructed virtual) model which does not have any relationship among model components. For example, a reconstructed 3D model using general stereo vision techniques is an atomic model.
Definition 2. Relational (virtual) model A (virtual) model with relationships among model components which can be represented as a graph, structure, hierarchy or architecture. For example, a 3D model with its scene graph is a relational model.
Definition 3. (virtual) model with structural knowledge A model with a hierarchy among model components and contextual knowledge in the hierarchy which can be used directly in virtual interaction analysis and can also interact with other models with structural knowledge.
Figure 3 (and Section 4) illustrates the generation of a relational model. Since its only input is a 2D image, Z depth cannot be extracted in general. Thus this model is a “relational model” with a hierarchy. Like other scene understanding techniques, the suggested technique uses “over-segmentation”, yet the technique differs, because it converts a 2D image into a type of graph structure and modifies it. The detailed procedures and explanations are discussed in Section 4.
Figure 4. IDEF-0 diagram for generating a virtual model with virtual ontology.
The second method generates Z-depth information using stereo vision or multiview vision approaches, producing a hierarchy with X, Y, Z coordinates. Figure 4 illustrates an input-output-control-mechanism (ICOM) model of this approach. The detailed processes and implementation are discussed in Section 5.
Using both methods, we can now generate a relational model. The remaining procedure is to map contexts into the obtained hierarchy. This objective is achieved using comparisons between the generated Metaearth architecture and context library.
Since both the generated Metaearth architecture and context library are represented as types of graphs, we classify the comparison as a graph-subgraph isomorphic matching problem . Figure 5 shows the ICOM model of the context mapping procedure. The difference between a general subgraph isomorphic matching and this problem and an algorithm for mapping context using the Metaearth architecture is explained in Section 6.
This dissertation incorporates knowledge and issues in areas such as computer vision, pattern classification, optimization, data modeling and semantics/ontology. As a result, many theories and implementations are required. In general, state-of-the art research studies term virtual ontology extraction (A2 shown in Figure 2), a scene understanding, image parsing and decomposition.
The virtual model generation procedure (A1 in Figure 2) is related to many 3D reconstruction techniques, and virtual ontology intensification (A3) is related to ontology/semantics related studies. Table 1 shows several related research fields and stages in which the various algorithms are used. The following sections classify the related research studies by 3D reconstruction (A1), scene understanding (A2) and context mapping in a graph structure (A3).
Existing 3D reconstruction algorithms have been classified according to these
System input and source type Preprocessing criteria Modeling methodologies Post-processing task Types of virtual model Format of virtual model Usage/application of virtual model Format of ontology/knowledge Figure 6 shows a classification for generating a virtual model using existing algorithms which can be used for both identifying the characteristics of each algorithm and comparing the algorithms.
For example, Cornelis et al. reconstruct a 3D urban scene using stereovision matching  with images from a streaming video. The model has been used for identifying vehicles on the road. Their algorithm can be classified based on the blue procedures in Figure 7; many existing research studies using stereovision have similar frameworks.
The main characteristic of stereo vision-based 3D reconstruction is that the generated model is usually an atomic model (see Section 1.1), meaning that it has no hierarchy. This limitation causes difficulties for the generated virtual environment’s uses in a virtual interaction analysis.
Other algorithms can construct a relational model or model with structural knowledge having virtual ontology. In general, an active virtual model can be generated using expert systems or pre-defined rules. Grimsdale and Lambourn employ an expert system to identify types of roads  using parameters such as average length, curvature, width, junction type and so on. Even though the generated virtual road model is an model with structural knowledge, the usage of parameter-based identification methods causes low effectiveness and fails to identify more complex shape and models.
This approach is in blue in Figure 8.
Figure 8. Grimsdale and Lambourn’s reconstruction method using an expert system [11, 12].
Lechner et al. use an agent-based system to describe a city model .
However, these algorithms are unsuitable for generating a LSVE. To overcome this limitation, a vision based virtual model generation approach is described in section 5.
Input data and recognized data can be considered as criteria, such as:
Geometry-based construction Image–based construction Hybrid construction Geometry-based algorithms construct a 3D model from point clouds. The points are measured and usually generated using 3D scanners. These methods are reviewed in Biggers et al.  in detail. The main characteristic is that these points already contain depth information (e.g. Z values). Using this depth information, a mesh model can be generated and converted to a 3D model format such as B-rep format. However, these methods are also not suitable for reconstructing LSVEs such as buildings or cities.
Another approach uses 2D images as an input. The images can be stereovision/multi-view based images, sketches or other general images. At source level, these images can be classified into ground-based imagery, airborne imagery or hybrid imagery. The level information of input images strongly influences the shape and characteristics of reconstructed 3D volume. For example, ground-based imagery methods create a 3D virtual model without roof information. Some research studies have used hybrid approaches considering both geometry information and image information, i.e. Light Detection and Ranging (LiDAR) data [15-17]. With similar data, digital elevation model (DEM), digital terrain model (DTM) and digital surface model (DSM) have been used [18, 19]. These data have Z-depth information, making them, useful for acquiring more accurate virtual models and generating large scale virtual models. However, these methods are dependent on the measuring devices.
Regarding the issue of data format, Prusinkiewicz et al. have introduced Lsystem and the Chomsky grammar to describe plants’ shape . While their data format can track changes in design, it is only useful for 2D models and cannot describe complex 3D volumes.
The suggested approach in Section 5 generates a virtual model from several 2D images overcoming these limitations. The algorithm uses computer vision techniques and fuzzy logic for handling ambiguities in 3D reconstruction. The details are provided in the following sections.
One research trend is a new and important topic in image processing. It is known by similar names, such as: scene understanding [21, 22], generation of scene structure , scene decomposition , image parsing , image labeling [26, 27], and multiclass image segmentation [28-30]. The main objective is to obtain a scene structure from an input image, such as a taken photo or multiple images such as image frames or stereo images. The acquired structure is related to the knowledge acquisition process, as discussed in Section 1. The research is relatively new and there are fewer examples comparatively than other image processing techniques.
An early stage, however, has been invoked by studies in image segmentation [31Using visual cues, extracted edge information and shape information, an image is first over-segmented and the over-segmented regions are then merged using pre-defined similarity measures. Since these image segmentation techniques fail to describe the meaning of each reason and relationship between two regions, additional research examines efforts for representing relationship among segmented regions and generating a structure. Various pattern classification techniques are applied and merged with image segmentation techniques. For example, some predefined patterns are detected using detection algorithms and the detected image regions are converted to a structure. Tu et al.  detects faces and texts from a image using a Bayesian approach and AdaBoost method. However, their approach can parse only the pre-defined patterns.
Another avenue of research uses color, shape  or other information [41, 42], to first over-segment an image and merge it using supervised learning methods. Gould et al.  use a specific energy function for merging over-segmented regions and train it using predefined patterns such as sky, tree, road, grass, etc. In general, this approach depends heavily on the quality of the trained model and input data. This characteristic has a significant influence on the performance of each algorithm. Another characteristic of these research trends is that the output of these algorithms is only segmentation. Even though grouped segmentations may have several meanings, they are not useful for capturing and reorganizing higher semantics. The limitations and disadvantages of
current scene understanding approaches are summarized as:
Heavily dependent on predefined patterns and training algorithms: most approaches use supervised learning techniques Most over-segmentation techniques use a visual cue such as color and cannot generate more accurate structures