C H A P T E R 7
SCOPE OF THE PRESENT THESIS: The Role Of Depth Cues On Visual Object Recognition And Naming
The focus of the present thesis is to be how do we recognize visual objects in three‑dimensions and how this might be differentially affected by binocular and monocular viewing?
Our ability to perceive the depth of objects in three‑dimensions (3D) is helped by our having two eyes and not just one. This is because information about the depth of objects can be derived from the degree of difference between the images in the two eyes (their binocular disparity).
Our normal conscious perception of the visual world is based on information integrated from both our eyes. This integration process enhances our perception of depth by coding disparities between the two retinal images, and by translating these disparities into information about depths of surfaces relative to the viewer.
The problems of visual perception have attracted the curiosity of scientists for centuries. One problem that has intrigued philosophers and psychologists for centuries is to understand how we see a three‑dimensional world given only a two‑dimensional visual image. William James (1890) pointed out that “we can perceive only what we have perceived”. This means that visual perception involves the interactions of two sources of information; on the one hand, the visual stimulus available to the visual sensory system and on the other, the knowledge of the perceiver. Both sources of this information are essential ingredients.
The idea that feature integration from vision might pose a special problem for the perceptual system dates back at least to the 1960s. Neisser (1967) following Minsky (1961), claimed that “to deal with the whole visual input at once, and make discriminations based on any combinations of features in the field, would require too large a brain to be plausible”. The knowledge of the perceiver takes advantage of visual features, or cues in the visual scene, to perceive a three dimensional world. An example of how these two sources of information interact is illustrated by one of Gregory’s patients. Gregory and Wallace (1963) report the experience of a man who, blind from the age of ten months, had his sight restored at the age of fifty‑two. When he was shown a simple lathe, he could not recognise it or see it clearly although he knew what a lathe was and how it functioned. When he was allowed to touch it, he closed his eyes and ran his hands over the parts of the lathe, “Now that I’ve felt it I can see it”. Gregory’s patient saw the world by touch not by sight. However, after learning about the shape and function of a particular object by touch, he was able to use this information to see the object as it should have been seen. Without this type of information, the object could not be perceived. The patient could not perceive the visual input accurately in the three‑dimensions, but when he used his knowledge about the object he recognized it. This shows that, in some sense, our ability to recognize our visual environment requires the integration of various types of information to apprehend the meaning of objects, our prior associations with them and their uses.
Various proposals have been put forward towards understanding the problem of how we recognise visual objects in three dimensions. Human adults can perceive the three dimensions of an object from single views or from the continuously transforming two‑dimensional projections of an object rotating in depth.
Models of object recognition have made two important assumptions about the the time taken to recognize objects as a function of orientation. First, objects are represented as structured sets of parts (Biederman, 1987). Second, the visual features of an object are matched to stored features in the corresponding relative positions. Thus, the parts of an object are matched to structured representations of parts in long‑term memory (Marr and Nishihara, 1978; Pentland 1986; Beiderman 1978). It is interesting to note that Biederman’s theory differs from Marr and Nishihara’s in that it does not posit a full 3‑D model for most objects. As different geons and relations come into view, different object models will be needed.
Theories of object recognition offer two different proposals regarding how people recognize an object from different views. One proposal is viewpoint‑dependent representations. According to viewpoint‑dependent theories ( Jolicoeur, 1990; Tarr & Pinker, 1989), orientation invariance for the recognition of an object is dependent on what is observed during familiarization. A viewer‑centred approach to object recognition proposes that objects are represented in long‑term memory at one or several specific orientations relative to the observer. Some viewpoint‑dependent theories postulate that the input object is mentally transformed to a familiar or standard orientation where it can be matched to a stored representation. For example, Tarr (1989) used computer‑generated shapes rotated in depth about the x, y and z axes. The objects, composed of cubes, were formed into arms of varying length at right angles to each other. The objects were three‑dimensional versions of those used by Tarr and Pinker (1989) and were similar to some of those used by Shepard and Cooper (1984) in their mental rotation research. The results replicated Tarr and Pinker’s (1989) results in suggesting that view‑specific representations of objects were stored. Viewpoint‑dependent theories predict that recognition time will depend on the difference in orientation between the input image and the stored representation (or the nearest of several familiar orientations), due to the required transformation process.
Viewpoint‑independent theories, on the other hand, predict that recognition time would be invariant across different orientations. This assumes that the assignment of a coordinate system to an input image takes the same time regardless of the object’s orientation. Results of experiments assessing these hypotheses have been mixed; some have supported the viewpoint‑independent approach (e.g., Cooper, Biederman & Hummel, 1992; Biederman,1987; Biederman & Gerhadstein, 1993; Corballis, 1988), while others have supported the viewpoint‑dependent approach (e.g., Jolicoeur, 1985; Tarr & Pinker, 1989).
7.2 Mode of visual presentation: sequential vs simultaneous
A point relevant to the present study is that the issue regarding the mode of stimulus presentation. Some recent evidence demonstrates that the rate of mental rotation can be influenced systematically by the mode of stimulus presentation, either simultaneously or sequential. In most previous mental rotation experiments using depictions of three‑dimensional shapes, the stimuli have been presented simultaneously; in most studies using two‑dimensional shapes, the stimuli have been presented sequentially, or they have involved a comparison with a long‑term memory representation of the shape (see Shepard & Cooper, 1982).
Steiger and Yuille (1983) have demonstrated that the apparent rate of mental rotation for three‑dimensional shapes is much slower when two shapes are presented simultaneously than when one shape is compared with long‑term memory representation.
Steiger and Yuille proposed two possible explanations for this difference between sequential and simultaneously presentation. One possibility hinges on the fact that only one shape is present in the visual field when a shape is compared with long‑term memory representation. Perhaps operating on only one shape does not require as much processing capacity as operating on two shapes. Another possibility is that the reduced slope in the sequential mode of presentation results from superior encoding of the shape, which reduces the time needed to compare the two representations (for example, because shapes encoded in long‑term memory may have been parsed into an optimal number of sub‑parts, thereby reducing the number of comparisons that must be made). Therefore, any theory of mental rotation will need to explain the difference in mental rotation results obtained with depictions of two‑dimensional and three‑dimensional patterns.
7.2.1. The issue of recognizing Familiar Objects
According to Edelman and Bülthoff (1992), Edelman and Weinshall (1991), and Tarr (1989) familiarity is critical for the apparent ease demonstrated in everyday recognition of objects presented at novel orientations.
An important question bearing on the objects used in various studies concerned with the effect of orientation on recognizing familiar and unfamiliar objects is: Are these objects representative of the objects we normally recognize in daily life ? Generally, the objects used in these experiments had an unusual structure lacking in symmetry and other regularities. Rock and DiVita (1987) used unfamiliar objects, as was the case in the other experiments which showed that subjects have extreme difficulty in perceiving depth ‑rotated images.
For all of Rock and DiVita’s ( 1987) objects, the relative depth of each point on the object could be accurately determined (Rock et al., 1989), so it was not the case that the difficulty was a consequence of an input that was initially indeterminate with respect to its three‑dimensional structure. Rock and DiVita’s (1987) demonstrations are important in that they show that even with accurately perceived depth, a viewpoint‑invariant representation may not be possible for an object. Rock and DiVita (1987) have argued that we encounter such objects as clouds and rocks that are similar in geometric properties to their wire objects. However, Farah and Rochlin (1990) and Gerhardstien & Biederman (1991) have criticised the results obtained by Rock and his associates on the grounds that their results may not be indicative of how we recognize objects with well defined and readily representable structure. Farah and Rochlin (1990) obtained results consistent with those of Rock when wire frame objects were used, but found that subjects used an object‑centred reference frame when presented with objects in which the regions between the wire contours were filled with clay. Thus the presence of surfaces facilitated orientation invariant‑recognition.
Gerhardstien and Biederman (1991) have pointed out that many of the studies that obtained significant viewpoint‑dependent results have used objects in which the stimuli are not distinguishable by geon type or by first‑order relations such as “top‑of” or “side‑connected”. As Biederman’s theory (1987) deals with basic level categorizations (Rosch et al., 1976), such as distinguishing between a cup and telephone in which the basic geon types differ, a more appropriate test of orientation invariance for the recognition of novel objects would employ a set of objects that were readily distinguishable by their geons. When this was done, Gerhardstien and Biederman (1991) found that effects for depth rotation of novel objects were significantly reduced.
Marr (1982) confronts both the question of how visual information is represented at successive stages from 2‑D sensory information transmitted by the retina to a 3‑D object that can be identified from multiple view points, and also the more difficult question of the mechanisms whereby transcoding from one stage to another is achieved. Marr’s distinction is the transformation of an image with viewer‑centred coordinates into a representation with an object‑centred co‑ordinate frame. Marr and Nishihara (1978) proposed that the primitive units of visual information used to drive the 3‑D model representation are the principal co‑ordinate of the original axis of the shape itself together with the component axes with their attached volumetric values. Marr made the general point that it is possible to code visual stimuli in various ways, and that these different codings might be needed to support different behaviours. For instance, one purpose of vision is to enable us to manipulate objects in space. Another purpose is to enable objects to be recognised when they are seen from different viewpoints. Different information is required for these purposes. For the first purpose the visual system needs to preserve information which is specific to the viewpoint from the which the object is seen. For the second purpose (to achieve recognition across differing viewpoints), the visual system needs to use information which is constant to the object irrespective of the viewpoint. For example, Warrington and James (1986) attempted to investigate the ability of normal subjects and patients with right‑hemisphere lesions to identify 3‑D objects from different viewpoints. Object recognition threshold were measured in terms of angle of rotation (through the horizontal or vertical axis) required for correct identification. The results obtained by Warrington and James (1986) showed that the effects of axial rotation were very variable and no evidence was found of a typical recognition threshold function relating angle of view to object identification. The findings were discussed in relation to Marr’s theory that the 2.5‑ D sketch can be derived directly without a reference frame to stored information from the visible structure, only the “knowledge” of its geometry is not achieved. Warrington and James (1986) suggested object identification is achieved by matching the 2.5‑ D sketch with stored descriptions, and they argued that the results are consistent with a distinctive‑features model of object recognition.
However, there are a number of limitations to using rotated objects (Warrington and Taylor, 1973). Warrington and Taylor’s study (1973) with a group of normal subjects and a right‑hemisphere brain damaged group found a number of limitations when testing both groups using angle of view. First, the manipulation of angle of view was not satisfactory because performance on the conventional view photographs for both groups alike was at ceiling. Second, by using only two arbitrary views, the function relating the prototypical and unconventional view‑points was not considered which might have allowed the subject to describe the properties of a prototypical view more adequately. Third, most object recognition research has been carried out with artificial stimuli, which give data about the perception of representations of objects and not about perception of objects themselves.
The present study demonstrates the problem of object recognition, in such a way that the subject takesone flat image of an object from one eye, combines it with another flat image of the same object viewed from a different angle by the other eye. The subject is required to produce a single percept with the added quality of depth (that neither one image possesses in isolation), and produce the name of that object.
Sternberg (1969 a), following Donders (1868‑1869), has noted that by using reaction‑time data we can make some additional assumptions about the temporal relation of the component of the processes of two different tasks such as recognition versus naming of objects.
Sternberg assumed (a) that only one component process may be active at any one time, and (b) that the amount of time taken up by one component process does not interfere with the time required for another component.
Accordingly, I expected in the present study that both naming and recognition tasks will differ in that particular component process. The difference in the time it takes to perform the two tasks will indicate the duration of the process that differs between them (Donders, 1868,1869). These assumptions of component processes have been embodied in a model of performance in which the sub‑processes are identified as successive temporal stages, each of which occupies a separate interval of time (discrete stage model, McClelland, 1979).
7.3 The goal of the research
A fundamental goal of the present research is to determine how the visual system identifies rotated common objects in depth.
By varying the experimental method and stimuli, it is possible to investigate in some detail the problem of how people can achieve stable recognition of everyday common objects rotated in depth.
All theories of the internal representation of the shape of objects make assumptions about the nature of coordinate system used to describe these shapes. For example, in Kosslyn and Shwartz’ computer simulation model (see section 4.7.3, pp 87) the shapes of objects are represented in long‑term memory using an explicit polar coordinate system. In contrast, structural descriptions of objects usually make use of implicit gravitational coordinates and do not actually specify the coordinates of the various parts in the description.