The Role of Visual Depth Cues on Visual Object Recognition and Naming :By
Dr. Fawzy Osman, Ph.D., Senior Consultant Clinical Neuropsychologist
C H A P T E R 1
0.1 A general introduction to visual object recognition and naming behaviour
The term perception refers to the means by which information acquired from the environment via the sense organs is transformed into experiences of objects, events, sounds, tastes, and so on. There are several modalities (hearing, taste, smell, touch and kinaesthetic modality) which give a person information about his or her own movement, bodily position and orientation in space.
1.1 Visual object recognition
Object recognition is one of the most important, yet least understood, aspects of visual perception. Why is object recognition difficult ?
1.2 Outline of the problem
Visual object recognition is not a single problem. There are several different paths leading to visual object recognition; shape, texture, colour, location and so on. Recognition can be said to be primarily visual; the recognition process proceeds primarily on the basis of visual data. There are also situations in which the recognition process uses sources that are better classified as not primarily visual in nature; i.e., prior knowledge, expectations, and temporal continuity (Morton, 1969; Palmer, 1975). Finally, in some cases, visual recognition employs processes that may be described as reasoning.
Most common objects can be recognized in isolation, without the use of context or expectations. For many objects, their colour, texture and motion play only a secondary role. In these cases the objects are recognized by their shape.
1.2.1. Shape recognition
Shape is the most common and important aspect of visual recognition. Shape is defined as that set of static spatial features of an object which remains invariant under similarity transformations. An image is normalized when it is transformed to a standard location, orientation, and size on the retina. Thus, the task of recognizing whether or not two objects have the same shape is reduced to deciding whether their outlines are congruent. Normalization would also explain the way certain common objects are stored in memory so that an object can be recognized on a subsequent occasion. With normalization each shape could be stored as a template consisting of the converged inputs from all those receptors which the outline of a given shape stimulates. Thus, the template with the greatest number of active units would indicate the shape of the stimulus. Normalization of orientation is the easiest type of normalization to achieve because most objects in the visual world are mono‑oriented, (i.e., they remain the same way up relative to gravity). Animals, particularly more primitive animals, partially normalize their visual inputs by moving their eyes or heads so as to bring the image into a given location and orientation on the retina. When this is done, it reduces the complexity of the neural coding processes responsible for shape comparison and recognition.
In general, the results of several studies show that human pattern perception and recognition is often quite sensitive to the orientation of the stimuli (see Rock, 1956; Rock and Heimer, 1975; Ghent, 1960; Braine, 1965; Kolers and Perkins, 1969a,b; Yin, 1969, 1970; Shinar and Owen, 1973; Cavanagh, 1977; Navon, 1978; Pylyshyn, 1979; Jolicoeur and Kosslyn, 1983; Humphreys, 1984; Koriat and Norman, 1985; Maki, 1986; Humphrey and Jolicoeur, 1988; Tarr and Pinker, 1989; Jolicoeur, 1990).
If objects are not in their normal orientations, it takes much longer to decide whether or not the shapes in each different orientation are the same, and we find ourselves having to make allowances for this disorientation when comparing them.
1.3. Object constancy
How do we recognize objects as being the same despite differences in their retinal projections when they are seen at different orientations? This is the problem of visual object constancy. Object constancy (Humphreys and Riddoch, 1988, 1989) is defined as the ability to recognize the structure of the object despite various transformations of the retinal image. The biological function of object constancy is to emphasize the permanent characteristics of objects.
Much of the laboratory research on object constancy has focused on orientation invariance. There are many reasons for this focus, one of which is that it is relatively easy to manipulate the orientation of visual stimuli by rotating them in the image plane (e.g. Dearborn, 1899). Furthermore, work on orientation effects has generated a number of theoretical positions, some of which have led to current debate. For instance, the manipulation of stimulus orientation has been fundamental to research in “mental rotation” which enriched the imagery versus proposition debate (see Kosslyn, 1981; Pylyshyn, 1981).
Experimentally, object constancy may be investigated by instructing a subject to match one object, presented at different angles to the line of vision, with another presented in a familiar orientation. Typically, the response time will be shorter for an object appearing in the usual view than would be expected from a prediction based on the geometry of the retinal image. Various mechanisms have been proposed to account for this phenomenon among which are visual cues such as those for binocular depth, texture, slant, etc. (Beck & Gibson, 1955; Hake, 1957).
1.3.1. Neurological Impairments of Human Vision
Studies of neurological impairments to object constancy have examined brain damaged patients who have particular difficulty in matching different views of objects; (Warrington and Taylor 1978, 1973 ; Humphreys & Riddoch 1988, 1989).
Warrington and Taylor (1973) found that right hemisphere damaged patients were poor at identifying photographs of objects taken from unusual views, relative to left hemisphere damaged patients and age‑matched control subjects. Furthermore, this deficit was confined to patients with damage to posterior areas of the right hemisphere, a result confirmed by Warrington and Taylor (1978). These results suggest that damage to posterior regions of the right hemisphere can bring about marked deficits in identifying objects not presented under prototypical viewing conditions (from a canonical viewpoint and under even lighting conditions; Palmer et al., 1981).
Marr (1982) emphasised the importance of deriving the major and minor axes of an object. For example, in recognising the structure of an object photographed from the side view, one might use information about the major axis (length) to retrieve information about the geometry and volume of the object. Marr interpreted the “unusual views” findings in these terms and argued that they were difficult for right‑hemisphere damaged patients because a major axis had been foreshortened or obscured. Therefore, the right posterior lesion patients were unable to retrieve the necessary information to access a stored “catalogue” of object structures.
Recent research by Warrington and James (1986) has questioned Marr’s (1982) analysis of the impairment of right‑hemisphere damaged patients on unconventional views. Warrington and James (1986) noted that some unusual views which patients found difficult to perceive did not foreshorten a major axis. They attempted to specify the properties of an “unusual” view empirically by exploring the function relating an “unusual” to a “usual” viewpoint. They rotated three‑dimensional shadow images in a series of steps with a starting point which was 90 degrees to the longest axis of the object. The subjects’ task was to identify the object, and thresholds were measured in terms of the angle of rotation which was required for successful recognition. Warrington and James (1986) did not find that views with a foreshortened principal axis produced poorer performance in the right‑hemisphere‑damaged patients, relative to controls, than did views that did not foreshorten that axis.Warrington and James suggested that their data could be explained by a distinctive‑features model of object recognition.
Single case studies of patients with impairments in identifying and matching objects across different viewpoints have been reported by Humphreys and Riddoch (1984) and Riddoch & Humphreys, (1986).
Humphreys and Riddoch (1984) considered the unusual views defect observed in their four patients to be a primary deficit in axis transformation, or a “transformational agnosia”. The fact that foreshortening disrupted the right hemisphere damaged patients is consistent with the idea that identification and matching across viewpoints was dependent on an axis‑based structural description, so that the performance suffered when the principal axis was made difficult to derive (by foreshortening). Consequently, the naming responses of these patients suggested that they often misidentified foreshortened objects because they failed to perceive that the object was oriented in‑depth, and they interpreted the form information as being oriented in‑the‑plane. That is, the patients failed to derive the principal axis of the objects. A similar argument was made by Layman and Greene (1988) who noted that there was a close relationship between failure on the unusual views task and impairment on a test of mental rotation. They suggested that at least part of the difficulty that the right‑hemisphere‑damaged patients have in the perception of unconventional views of objects is in the use of depth cues that would lead to correct object description. This interpretation is consistent with the theorizing of Marr (1982) and with the results of Humphreys & Riddoch’s (1984) manipulation of depth cues.
1.4 Source of information for object recognition
Object recognition is often taken to mean the visual recognition of objects based on their shape properties.
The data from experiments on the recognition of simple 2‑D shapes by normal subjects suggests the importance of structural description based on a spatial reference frame for
representing shapes independently of their retinal projections.
1.4.1. Axis‑ based Structural descriptions
Marr and Nishihara (1978) proposed that shapes are represented in memory as structural descriptions in object centred co‑ordinate systems, so that an object is represented identically regardless of its orientation on the retina. A common strategy in computer vision (e.g., Hinton,1981; Marr, 1980; Marr and Nishihara, 1978) is to use a reference frame intrinsic to the object. For instance, Marr (1980) proposes that an intrinsic reference frame could be derived from one of the major axes of the object, such as its principal axis of elongation or symmetry. The description taken from about this axis is object‑centred and will not differ following rotation of the object about any given axis, thus enabling object constancy to be achieved. Objects are not perceived as collections of parts, though parts must be represented (Rock, 1986). These parts must have stability i.e. they are marked by place (Marr terms this being viewer‑centred) and they must be able to compensate for changes in orientation and other conditions, e.g. lighting, necessary to maintain constancy (Marr terms this being object‑centred). Object recognition presumably must also rely on pictorial details such as structure, colour and texture. Only when colour, texture and structure are combined can object recognition be said to reflect the object in the real world. Yet the role of colour and texture within object recognition is difficult to establish.
Objects are represented by a set of dimensions and any particular object will be defined by its values on this list of dimensions. There is a debate as to whether elements represented on one dimension (Garner, 1970; Dykes, 1981; Ward, 1985) or more (Treisman & Gelade, 1980) are processed integrally, together, or separately. However, these dimensions must somehow be used to form the structure of objects. Object structure may be independent of object colour and texture or it may be intrinsically linked. This distinction is important to the object recognition process because, if shape and colour represent a separable source of information, object recognition must also require procedures which can combine separable sources of information. If shape‑ colour‑ texture is an integral source of information then no such procedures are required. The role of colour and texture in object recognition is uncertain.
Experimental investigation of the role of colour in object perception confirms that spatial features and surface detail are important at different stages. It appears that coloured objects are named, but not recognised, more quickly than monochrome objects (Ostergaard & Davidoff, 1985; Davidoff & Ostergaard, 1988). This conclusion is strengthened by evidence that patients with achromatopsia are not visually impaired (see, Humphreys & Riddoch, 1987b).
There has been a continuing debate over the role of colour in the recognition of objects.
Humphreys, Riddoch & Quinlan (1988) have argued that naming is retarded for objects from categories with structurally similar exemplars, because extra time is then required to differentiate any given object from its structurally related competitors. Reaction times are slowed for structurally similar objects with high name frequencies. This result suggests that the effects of surface information on object naming may be stronger on objects from categories with structurally similar exemplars than on objects from categories with structurally dissimilar exemplars.
In contrast to naming, classification tasks (e.g. distinguishing natural from man‑made objects) may be formed on the basis of general physical characteristics. Objects from
structurally similar categories may be selected quickly to their superordinate categories because of the structural similarity between category members. In such cases classification may be based on structural rather than semantic information (Snodgrass & McCullough, 1986).
1.4.2. Feature Accounts
An alternative approach, however, argues that the distinctive features of objects are used to achieve object constancy independent of their relation to other parts of the objects. Hence, objects have certain invariant properties that are common to all of their views so rotation will not affect recognition.
After the pioneering work of Hubel and Wiesel (1959) it became reasonable to suppose that object recognition is based on the processing of features such as edges, line orientations and angles in objects (Neisser,1967).
It is usually assumed that the visual local features of an object are orientation invariant (e.g. Selfridge, 1959; White, 1980). Furthermore, it has been argued that only distinctive features need be stored (Chase, 1986). Marr (1982) suggested that recognition is achieved through a library of canonical descriptions while Hoffman and Richards (1984) and Biederman (1987) considered it to be achieved through an analysis of parts separated at the extreme of curvature.
Objects may contain certain distinctive features which directly specify their structural identity independent of orientation. However, Vania (1978) pointed that some objects such as pineapples and sheep, have characteristic surface textures which could allow us to recognize the object in the absence of visible distinctive features or salient axes. These results suggest that different routes to object recognition may be more or less useful for recognizing the same object in different viewpoints or for recognizing different kinds of object.
It is unlikely that the whole of the pattern recognition process could be achieved by analysing distinctive features since features that are distinctive for any object are governed by the group of objects from which they must be discriminated. Distinctive features, therefore, depend upon context, and furthermore, the distinctiveness of features depends upon the context in which they are presented. Palmer (1975) demonstrated that recognition of a face was better when features were presented within the context of a face rather than either as a scrambled face or alone.
Thus, recognition of detail or individual parts may normally be preceded and influenced by recognition of the global figural context within which the details are embedded. This suggestion gains some independent support from experiments on shape perception. Palmer et al. (1989) examined whether the effects of configural orientation on shape perception operate on two‑dimensional or three‑dimensional representations space. They showed that if pictorial depth cues were added to those displays where, for example, squares were arranged such that the axis of the configuration was diagonal, then the effect of this configural axis was dramatically reduced, suggesting that perceptual organization is based primarily on reference axes constructed within a three‑dimensional frame of reference. Such findings are consistent with theories such as Marr and Nishihara’s (1978) who see object recognition operating on representations derived from the 2.5‑D sketch level in which the layout of surfaces in space has already been described.
The invariant hypothesis assumes that shape perception is mediated by detecting those geometrical properties of an object that do not change (are invariant) when an object is transformed in orientation. When an object is rotated, for example, through a 90 degrees angle anti‑clockwise away from the orientation of the standard (usual view), the orientations of the lines change, but the number of lines and the sizes of angles between them do not. Therefore, line number and angle size are invariant features of the group of rotations.
A human can find an arbitrary object visually in a cluttered environment and proceed to grasp the object and move it at will, avoiding obstacles along the way and not damaging the object if it is fragile. In this sense recognition means understanding an object’s position and orientation in space in a viewpoint independent manner. Although object recognition has been thought to be a unitary process, experimental evidence (Humphreys & Riddoch, 1987b) suggests that there are distinct stages in the processing of objects at which perceptual classification, semantic classification, and name retrieval are achieved (e.g. Warren & Morton, 1982; Ratcliff & Newcombe, 1982; Riddoch & Humphreys, 1987a). Problems in recognizing objects can arise through deficits within or between any of the stages, and the patterns of impairment in object recognition which have been observed in brain damaged patients are revealing about the relationship between these different stages, and the internal organization of each.
In the next chapter, I review selectively some of the empirical and theoretical work that focused on the role of stereopsis in recognizing and naming photographs of common
rotated objects in depth. The first four sections of chapter two review studies that have looked for the anatomical and neuropsychological basis of depth perception.
A number of studies have revealed the advantage of binocular vision in perceiving depth information. These studies are reviewed in the first section. Theoretical accounts of stereopsis are reviewed in the second section. The third section examines results that show some significant impairment to stereopsis in brain damaged patients. In the last section I review computational models of stereopsis and discuss how these models can be used to integrate a wide range of empirical and theoretical work pertaining to the perception of depth.