C H A P T E R 4
RECOGNISING AN OBJECT FROM DIFFERENT VIEWS
4.0 The Role of Canonical Views in Object Recognition
Earlier in chapter 3, I reviewed the theories of RBC (Biederman, 1987) and Marr & Nishihara (1978). Although both of the two theories emphasizes the recognition of objects from any viewpoint, there is evidence that not all views of objects are equally easy to recognize. Palmer et al. (1981) described how each of the different objects they examined appears to have a “canonical” viewpoint. For instance, Palmer & Rosch (1981) proposed that an automobile in side view is perhaps best recognised as an automobile rather than if it is perceived from the front or back. The authors’ explanation for this is that the canonical orientation of a three‑dimensional object is the one that “reveals the most information of greatest salience about it”. (p.147)
Furthermore, Palmer (1981) demonstrated that the latency to name familiar objects decreases as the view of the objects becomes more canonical.
Although the theories of Marr & Nishihara (1978) and Biederman (1987) stress the recognition of objects independent of viewpoint, each could readily accommodate such canonical view effects. For example, Biederman emphasises that certain viewpoints may conceal the non‑accidental properties which define the “geons”, and other viewpoints may better reveal them. Therefore, it is not critical for “view‑independent” models of object recognition that certain aspects of the process may be view‑dependent.
Neuropsychological studies support a role for “canonical views” in object recognition. The ability to identify an object from an unusual perspective has been linked to damage to right hemisphere. For example, Warrington and Taylor (1973) have shown that patients with posterior (usually parietal lobe lesions of the right cerebral hemisphere) have great difficulty recognizing common objects depicted from unconventional points of view, but relatively little difficulty if those same objects are depicted conventionally (canonical/upright). Warrington and Taylor; (1973; 1978) findings were that few errors were made from conventional view, but that group of patients (posterior injuries of right hemisphere) were poor at identifying the objects from an unconventional view.
Warrington and Taylor (1973) investigated the role of angle of viewpoint on identifying common objects. The “Unusual views” task consists of black and white photographs of common objects pictured from an unusual or unconventional angle .
FIGURE 4.1 Examples from the Unusual Views Test (Warrington & Taylor, 1973).
Although the view of the object was one which would be relatively familiar in everyday life, it was considered to be unusual as a photographic representation of the stimulus. Warrington & Taylor (1973) point out that they chose the unusual views so that they were not necessarily unfamiliar views. Patients were selectively impaired in identifying the “unusual‑view” objects as compared with the same items pictured from a conventional viewpoint. For example, [see Figure 4.1] the iron might be called “a cap on its side”, “the bucket “a fried egg”, the basket “a table mat”, and goggles “a belt”. The patients’ with posterior injuries of the right hemisphere were poor at identifying objects from an unconventional view. “Errors” appear to reflect the visual properties of the stimulus, either over emphasising some detail in the stimulus or ignoring a major detail.
Since Warrington and Taylor’s (1973, 1978), research was conducted, a number of studies have shown that patients with right‑hemisphere damage have difficulty in recognizing unconventional views of objects, but generally perform well with conventional views of the same objects (Humphreys & Riddoch, 1984; Ratcliff & Newcombe, 1982; Warrington & James, 1986, 1988).
4.1 Cerebral Lesions and object recognition
Neuropsychological studies of vision attempt to understand the visual impairment caused by lesions to specific areas of the brain. Current neuropsychological research assumes that dissociations between processes are more informative than associated deficits. It is well established that a disorder of visual object recognition occurs in patients with post‑rolandic lesions of the right hemisphere. Milner (1958) compared patients who had undergone temporal lobotomy for the treatment of epilepsy. She found that the right‑hemisphere group were significantly poorer than a comparable left‑hemisphere group on the McGill Anomalies test. This test requires patients to identify an anomaly in a sketchily drawn scene, and she suggested that the right temporal lobe “facilitates rapid visual identification“.
Warrington & James (1967a) tested a consecutive series of patients with verified localised unilateral cerebral lesions on Collin’s picture test, a graded difficulty task of identifying incomplete outline drawings of objects. The authors found that there was an insignificant deficit in the right temporal group compared with the left temporal group, and also reported a highly significant deficit in the right parietal group compared with a comparable left parietal group.
This result was later replicated by Warrington & Taylor (1973) who reported findings of a significant right posterior deficit, but no trace of an impairment in the right temporal group compared with the left temporal group.
In all the group studies cited, the right hemisphere lesion groups were compared with comparable left hemisphere groups. It is commonly found that the incidence of visual field defects is much the same in the two groups. However, it has been well documented that the recognition deficit cannot be accounted for by raised sensory thresholds in supposedly intact parts of the visual field. Warrington & Rubin (1970) found that while tachistoscopic threshold measurements for detecting the presence or absence of a ‘black dot’ or identifying a letter (not degraded) in the central visual field were impaired in all brain‑damaged groups, the critical right parietal group were not selectively impaired.
4.2 Disorders of visual object recognition
Neuropsychological studies of vision attempt to understand the visual impairments caused by a lesion to specific areas of the brain. For instance, the occipital lobes at the back of the brain receive the major projections from the retina, and are conventionally thought of as the primary centre for visual processing in the brain.
Humphreys & Riddoch (1984) and Riddoch & Humphreys (1986) found that in four patients the ability to recognize photographs of foreshortened objects [see figure 4.2] was impaired. One group of patients showed impaired performance only if the major axis of a target object was foreshortened. Another patient was impaired only if the primary distinctive features of the target object were obscured. This double dissociation within the group of patients suggests two functionally independent routes to object constancy: one specifying the object’s structure defined with reference to its major axis; the other characterized by something like a feature‑list of the object’s distinguishing properties or parts. Such views could occlude parts or features as well as obscure an object’s main axis of elongation.
FIGURE 4.2‑Examples of the foreshortened and minimal‑feature conditions of Humphreys and Riddoch’s (1984) matching task.
Humphreys and Riddoch (1987) documented the case of an integrative agnosic ‘patient’ called H.J.A. who suffered an impairment in the perceptual organization stage. H.J.A. attempted to identify objects by their salient local features, for example he recognised a pig “because of the curly tail.” However, the authors conclude that in order to recognise more complicated objects by their salient local features, a different process is needed. A patient described by Pallis, (1955) who appears similar to H.J.A. in many respects, stated of faces that ” I can see the eyes, nose and mouth quite clearly but they just don’t add up.” It is particularly necessary to be able to relate local parts of a figure with its global form in order to recognize and distinguish between items which are visually similar to other items from the same category. For instance, the various local parts of one face are very similar to those of many others (the nose, eyes, etc., considered in isolation), as is the general global shape of each face. The proposal concerning faces also generalize to other stimuli from categories with visually similar exemplars, such as buildings. Face recognition appears to be dependent on some emergent property which arises out of the specific combination of global and local features: it is this specific combination that defines the individual’s facial identity. Consequently, patients such as HJA should show poor recognition and exploration of the environment because they fail to identify visual landmarks (Humphreys & Riddoch, 1987a).
For Riddoch and Humphreys (1978a), then, HJA’s perception is impaired, but it is impaired at the highest level of visual analysis. HJA can pick up local features, shape cues, depth cues, and so on, but Riddoch and Humphreys think that he does not readily integrate these into a coherent representation of what he is looking at. Young and Deregowski (1981) suggested that a similar process is quite generally implicated in picture perception, because under certain conditions children will pick up local depth and feature cues correctly but fail to integrate these into a coherent representation of the depicted object, leading to problems strikingly similar to those experienced permanently by HJA.
According to the model of object recognition of Ellis and Young (1988) the idea of an agnosic integration defect is due to the construction of an adequate object‑centred representation which involves at least two steps: (1) finding the object’s axis of elongation; and (2) integrating local details correctly with respect to this. Patients with posterior lesions of the right hemisphere would then be impaired for the first step, but HJA was only impaired for the second step. It remains possible that HJA had some further, subtle defect in perceptual integration to the construction of effective viewer‑centred representations, as it is to object‑centred representations.
The fact that foreshortening views disrupted the right hemispheres damaged patients is consistent with the idea that identification and matching across viewpoints was dependent on an axis‑based structural description, so that performance suffered when the principal axis was more difficult to derive (by foreshortening). The naming performance of these patients suggested that they often misidentified foreshortened objects because they failed to perceive that the object was oriented in‑depth, and they interpreted the form information as being oriented in‑the‑plane.
Three dimensional vision (3‑D) depends on a number of independent cues, both monocular and binocular; ( i.e., the creation of 3‑D information from the disparities present between the images present in the left and right eyes). Patients with complete loss of 3‑D vision seem to lose the ability to use some of these different cues. But there are also cases where patients can selectively lose the ability to use only some of the cues. This means that lesions to particular parts of the brain can selectively impair different visual functions. The term “visual agnosia” refers to one such class of selective deficit. Lissauer (1890) proposed that there might be two different kinds of agnosia : “apperceptive visual agnosia” and “associative visual agnosia”. Apperceptive agnosic patients are thought to have difficulty in perceptual processing (i.e. the perceptual abilities that enable us to tell one shape from another); associative agnosic patients are thought to have intact perceptual processing, but have difficulty linking the products of this perceptual processing with their stored memories about objects.
4.3 Anatomical basis of the visual agnosias
Cases of visual agnosia are rare, although they have been documented since the late nineteenth century (e.g., Charcot 1883). This rarity can be explained on anatomical grounds. In some cases, agnosia is linked to bilateral brain damage (i.e., damage to both the left and right cerebral hemispheres) to regions bordering the occipital and temporal lobes (to the rear of the brain; see Mack & Boller, 1977; Ratcliff & Newcombe, 1982). The posterior parts of the brain, especially the occipital lobes, are supplied by two posterior cerebral arteries, which come from a common basilar artery. It is therefore possible that obstruction to the blood supply to that part of the brain could produce bilateral brain damage. However, any moderate to large bilateral lesion produced in this way is likely to damage major portions of the striate cortex (V1) and so render the patient blind to visual forms. Thus agnosia must be due to a relatively small lesion of this type. Moreover, vascular damage tends to occur more frequently with the middle rather than the posterior cerebral arteries, because the middle cerebral arteries follow a more convoluted course. Thus posterior brain damage of this sort tends in any case to be infrequent.
The recognition of other classes of stimulus material has also been shown to be impaired by a right posterior lesion.
4.4 Human faces as a class of visual object
An issue which is of considerable interest is that the functional organization similarity between visual objects and faces. Patients who are agnosic for objects also typically have problems recognizing faces. Agnosic patients often have more difficulty identifying objects from categories with visually similar exemplars (such as living things) than those from categories with visually dissimilar exemplars (e.g. man‑made objects).
The nature of the stored information of faces may differ from those of other objects. For example, the particular local and global features involved, the role of three‑dimensional coding which may be especially more important for face recognition
than for objects.
An impairment in the perception of photographs of faces was first noted by De Renzi & Spinnler (1966), and extended by Warrington & James (1967b) to the perception of letters. More specifically, the latter authors maintain that if letters are degraded along a perceptual dimension they give rise to a right hemisphere deficit (Warrington & James 1967a). A similar result was reported by Faglioni et al., (1967). The hallmark of this syndrome appears to be a difficulty in perceiving meaningful visual stimuli when the redundancy normally present within the figure is reduced or degraded. Moreover, Bruner & Potter (1964), presented subjects with pictorial images that were degraded by projecting them out of focus. They found that verbal cueing aided recognition, and also noted an interference phenomenon. The more de‑focused the object was initially, the higher the identification threshold when it was finally recognised. This may be due to the subjects’ incorrect hypotheses from earlier trials, interfering with later ones.
4.4.1. Failure to recognise familiar faces‑Prosopagnosia
Prosopagnosia is an impairment of the ability to process visual information derived from the faces. Several different forms of prosopagnosia have been proposed (e.g. De Renzi, Scottie, & Spinnler, 1969; De Renzi, Faglioni, Grossi, & Nichelli, 1991).
The anatomical basis of these forms, associative and amenesic associative (Damaiso, Tranel, Damasio, 1990), provides useful clues as to how and where information about faces is processed within the ventral stream.
A number of models of face recognition have been proposed, using methods and models developed in cognitive psychology, to explain the different forms of prosopagnosia. The proposed sub‑components of these models are based almost entirely on behavioural data from human subjects.
In fact, there is a strong correlation between the inability of brain damaged patients to recognize familiar faces and occurrence of topographical impairments (Ajuriguerra & Chiarelli, 1957; Meadows, 1974), and in many cases patients with topographical problems seem reliant on using cues to orient themselves and to identify their surroundings.
The findings of Warrington and Taylor (1973, 1978) appear to have been an important factor in the development of Marr’s theory of object recognition (Marr & Nishihara, 1978) in which the axis of elongation plays a crucial role.
Marr and Nishihara (1978) argued that many important kinds of object could be considered to be constructed from generalized cone components, and that the lengths and arrangement of the axes of these components, relative to the major axis of the object as a whole, could be used to distinguish between different objects classes. Thus, axis‑based representations allow us to construct descriptions at different spatial scales in a hierarchically organized way.
Most computational theories of object recognition emphasizes that the problem of recognizing the unusual views of the same object can arise if descriptions constructed of an image of an object are not suitable to match the stored representations of object appearances. According to Marr, visual object recognition can be described as the outcome of the comparison of two information structures; one is a representation of the visually presented object, the other is a long‑term representation of the object or object class.
Object recognition can thus be considered to involve a comparison of the structure of a seen object with the structures of objects that are already known. In this case the comparison will often demand knowledge of the three‑dimensional structure of the objects concerned. Marr suggested that the long‑term representation of an object consists of a structural description of the object, and to make contact with such a long‑term representation, Marr argued that the representation of the visually present object must be a similar structural description based on an object‑centred coordinate system.
Ellis & Young’s (1988) model makes use of Marr’s idea that three levels of representation of visual input can be distinguished. They have initially called these initial, viewer‑centred and object‑centred representations. Neuropsychological evidence is consistent with this model.
Problems in recognizing objects can arise through deficits within or between any of the levels, and the patterns of impairment in object recognition which have been observed in brain damaged patients are revealing about the relationship between these different levels, and the internal organization of each.
Warrington (1982; 1978) has argued that object recognition requires some means of assigning equivalent stimuli to the same perceptual category, in order to cope with transformations of orientation, lighting, distance and so on. It is this perceptual categorization that Warrington think is defective in patients with posterior injuries to the right cerebral hemisphere.
Warrington‘s idea of perceptual categorisation involves the combined action of the functional components described as viewer‑centred representation, object‑centred representation, and object recognition units (i.e. stored descriptions of the structures of familiar objects). The key feature of many of the unusual views used by Warrington & Taylor (1973) is most likely to be the foreshortening of the object’s principal axis of elongation which would make it particularly difficult to derive an object‑centred representation (Marr & Nishihara, 1978) Marr and Nishihara (1978) proposed that shapes are represented in memory as structural descriptions in object‑centred co‑ordinate systems, so that an object is represented identically regardless of its orientation.
Ellis and Young (1988) suggest that at least part of this defect in recognizing an object from an unconventional view is due to an impairment in constructing object‑centred representations. Moreover, the intact performance of the patients with right posterior injuries on conventional views suggests that the viewer‑centred representations and object recognition units are relatively unimpaired.
4.5 Object categories
There is evidence showing that object recognition can be based on quite general descriptions of objects. For example, Rosch et al (1967) found objects could be classified as members of their basic category more quickly than they could be classified as members of superordinate or subordinate categories. Thus you would be quicker to identify a friend’s dog as a dog than as an animal (superordinate category) or as a collie (sub‑ ordinate category). The existence of category‑based representations could be used to explain why pictures are categorized very quickly. However, exceptions that are poor category examples can be found. Thus we identify a penguin as a penguin faster than we identify it as a bird (Jolicoeur, Gluck and Kosslyn,1984).
Problems in matching foreshortened objects might arise for a number of reasons: (i) if patients were impaired at deriving axis‑based descriptions; (ii) if patients were sensitive to viewpoint‑dependent information, as foreshortening produces large changes to any viewpoint‑dependent representation of an object.
Neuropsychological data from brain damaged patients such as visual agnosic and prosopagnosic patients suggest that those patients suffered ancillary deficits concerning the representation of visual information in 3D space. These results shed some light on the problem of stereopsis disorder, in which the perceptual complaints might be exacerbated by visual motion (Zihl et al., 1983).
Object recognition requires that viewer‑centred and object centred representations of seen objects to be matched to stored descriptions of the structures of known objects which then allows access to semantic representations. However, neuropsychological impairments do not affect object recognition as if it were a single function. There are, for instance, no accounts of patients who show impaired processing of shape information and yet recognise seen objects without difficulty. One patient (HJA) can perceive shape but fails to form an effective integrated representation that combines local and global features. The organization of complex abilities such as object recognition seems to be into a number of separable functional components or modules, any one of which may be impaired selectively.
Computer‑vision theorists are concerned with perceptual problems; they try to solve problems of edge extraction and resolving stereoscopic images rather than those of modelling human object recognition. Thus, current algorithms for object recognition concentrate on the formation of the initial representations at the pictorial register. Nevertheless, computer vision models must propose some sort of permanent representation of an object (template) against which the temporary representation is matched.
A satisfactory model needs to explain what functional components are involved in the ability of object recognition or object naming, and how these are organised with respect to each other. It should be able not only to account for the patterns of impairment observed, but also for those that are not found.
The study of mental processes in brain damaged individuals has provided insights into the processing mechanisms involved in a range of different cognitive tasks. For example, most people are able to recognize visually presented shapes of objects regardless of how they oriented in space. To do so, however, apparently requires an act of “mental rotation” that takes place in real time just as actual physical rotation does.
Generally, our ability to identify objects successfully is not contingent upon viewing objects in a canonical orientation. However, the process of identifying objects that are disoriented is not entirely without consequence. For a disoriented object to be recognized, the input representation has to be aligned with the stored representation through a normalization process (Ullman, 1989). Moreover, the normalization processes underlying the linear potion of the naming function reflects the same analogue transformation process that is assumed to underlie the orientation effect observed in tasks requiring some judgement of left‑right reflection‑namely, mental rotation (Jolicoeur, 1985, 1988, 1990; Tarr & Pinker, 1989).
The next three sections review studies that have looked for evidence for a mental rotation transformation hypothesis.
The question is posed is how long it takes subjects to recognize an object from different view.