HCNN_2024_2-1-30.pdf

Deep Convolutional Neural Networks for Object Recognition: Which is the most Brain-like? Giuseppe di Pellegrino Department of Psychology, University of Bologna [email protected] Cognition and Neuroscience Second cycle Degree in Artificial Intelligence – 2032/24 Convolutional neural networks 1 In primates, the visual ventral pathway is critically involved in object recognition and extends from V1 to the IT cortex in temporal lobe. Ventral visual pathway gradually ‘‘untangles’’ information about object identity DiCarlo et al., Neuron, 2012 3 Readout of object identity from primate inferotemporal (IT) cortex By using a classifier-based readout technique, Hung et al (2005) showed that the activity of small populations of IT neurons (~ 300 units), over very short time intervals (as small as 12.5 milliseconds) contain accurate and robust information about both object ‘‘identity’’ and ‘‘category.’’ 4 Hung et al., Science, 2005 Deep convolutional neural networks (DCNN) DCNNs are good candidates for models of the ventral visual pathway and have achieved near-human-level performance on challenging object categorization tasks. 5 Core object recognition Ability to rapidly identify objects in the central visual field, in a single natural fixation (~200 ms), despite various image transformations (i.e., changes in viewpoint) and background. Nonhuman and human primates reveal similar invariant visual object recognition when performing the same binary object recognition tasks 6 Monkey performance shows a pattern of object confusion that is highly correlated (consistency) with human performance confusion pattern (0.78). Importantly, low-level visual representations (pixels) do not share these confusion patterns (pixels, 0.37). These results are in line with with the hypothesis that rhesus monkeys and humans share a common neural shape representation that directly supports object recognition. 7 Rajalingham et al., J. Neurosci, 2015 Deep convolutional neural networks (DCNNs), optimized by supervised training on large scale category-labeled image sets (for instance, ImageNet) display internal feature representations similar to neuronal representations along the primate ventral visual stream and they exhibit behavioral patterns similar to the behavioral patterns of pairwise object confusions of primates. 8 However…. several studies have shown that DCNN models can diverge drastically from humans in object recognition behavior. Such failures of the current DCNN models would likely not be captured using low- resolution behavioral measures (i.e., object-level) but could be revealed at higher resolution (image level). A recent study employed both low- and high-resolution measurements of behavior (over a million behavioral trials) from 1472 anonymous humans and five male macaque monkeys with 2400 images for over 276 binary object discrimination tasks. 9 Two example images for each of the 24 objects. To enforce invariant object recognition behavior, each image included one object, with randomly chosen viewing parameters (e.g., position, rotation and size) placed onto a randomly chosen, natural background. 10 Rajalingham et al., J. Neurosci, 2018 For monkeys, each trial was initiated when they held fixation on a central point for 200 ms, after which a test image (6° of visual angle) appeared at the center for 100 ms. Immediately after extinction of the test image, two choice images, each displaying the canonical view of a single object with no background, were shown to the left and right. The monkey was allowed to freely view the response images for up to 1500ms and respond by holding fixation over the selected image for 700 ms. Rajalingham et al., J. Neurosci, 2018 11 Each behavioral metric computes a sensitivity (discriminability) index: d’ = Z(HitRate) - Z(FalseAlarm-Rate), where Z is the standard z score Rajalingham et al., J. Neurosci, 2018 12 Object-level (across all images and distractors ) behavioral comparison B.O1 signatures (discriminability measures) for the human (n=1472), monkey (n=5), and several DCNN models as 24-dimensional vectors using a color scale (warm colors=lower discriminability). Each element of the vector corresponds to the system’s discriminability of one object against all others that were tested (i.e., all other 23 objects). Human consistency was a used to quantify the similarity between a model visual system and the human visual system with respect to a given behavioral metric (signatures). 13 Image-level behavioral comparison The one-versus-all image-level signature (B.I1) is shown as a 240-dimensional vector (a subset of 240 images, 10 images/object) using a color scale, where each colored bin corresponds to the system’s discriminability of one image against all distractor objects. 14 15 Examining behavior at the higher resolution of individual images, all leading DCNN models failed to replicate the image-level behavioral signatures of primates. Rhesus monkeys are more consistent with the archetypal human than any of the tested DCNN models (at the image level). Synthetic image-optimized models were no more similar to primates than ANN models optimized only on ImageNet, suggesting that the tested ANN architectures have one or more fundamental flaws that cannot be readily overcome by manipulating the training environment. DCNN models diverge from primates in their core object recognition behavior. This suggests that either the model architectural (e.g., convolutional, feedforward) and/or the optimization procedure (including the diet of visual images) that define this model subfamily are fundamentally limiting. Rajalingham et al., J. Neurosci, 2018 16 Recurrent neural networks 17 Deep CNNs trained on object categorization are the best predictors of primate behavioral patterns across multiple core object recognition tasks; These networks are also the best predictors of individual responses of macaque IT neurons Unlike the primate ventral stream, these neural networks in this family are almost entirely feedforward and lack cortico-cortical, subcortical, and intra-areal recurrent circuits. Kar et al., Nat. Neurosci,. 2019 18 The short duration (~200 ms) needed to accomplish accurate object identity inferences in the ventral stream suggests the possibility that recurrent circuit-driven computations are not critical for these inferences. In addition, it has been argued that recurrent circuits might operate at much slower time scales, being more relevant for processes such as regulating synaptic plasticity (learning). One hypothesis is that core object recognition behavior does not require recurrent processing. However…. Feedforward DCNNs fail to accurately predict primate behaviour in many situations. Specific images (i.e., blurred, cluttered, occluded) for which the object identities are difficult for DCNNs, but are nevertheless easily solved by primates, might involve recurrent computations. The impact of recurrent computations on the ventral stream might be most relevant at later time in the object recognition process. Kar et al., Nat. Neurosci,. 2019 19 To compare the behavioral performance of primates (humans and macaques) and current DCNNs image-by-image, a binary object discrimination task was used, with 1,320 images (132 images per object) in which the object belonged to 1 of 10 different categories. Macaques and humans outperform AlexNet (2012). There were 266 challenge images (red dots) and 149 control images (blue dots) Reaction times (RTs) for both humans and macaques for challenge images were significantly higher than for the control images (monkeys: ΔRT = 11.9 ms, humans: ΔRT = 25 ms=, suggesting that additional processing time is required for the challenge images. 20 Kar et al., Nat. Neurosci,. 2019 21 To determine the time at which object identities are formed in the IT cortex, neural decode accuracy (NDAs) was estimated for each image, every 10 ms (from stimulus onset), by training and testing linear classifiers per object independently at each time bin. The term object solution time (or OST) refers to the time at which the NDA measured for each image reached the level of the behavioral accuracy of each subject (pooled monkey). Kar et al., Nat. Neurosci,. 2019 22 First, for both the control and the challenge images, the accuracy of the IT decodes become equal to the behavioral accuracy of the monkeys at some time point after the image onset. Second, the IT decode solutions for challenge images emerge slightly later than the solutions for the control images (average difference ~30 ms). The challenge image required an additional time of ~30 ms to achieve full solution compared with the control images regardless of whether the animal was actively performing the task or passively viewing the images. Kar et al., Nat. Neurosci,. 2019 23 IT predictivity across time from feedforward DCNNs If the late-emerging IT solutions for challenges images are dependent on recurrent computations, then purely feedforward DCNNs: should accurately predict IT neural responses for control images, should fail to predict IT neural responses for challenges images To test this idea, it was investigated how well the DCNN features could predict the time-evolving IT population response using a partial least square analysis. 24 Predicting IT neural responses with DCNN features Data collection: Neural responses are collected for each of the 1320 images (50 repetitions); e.g. shown is that of example neural site #3, across 10 ms time bins. 25 IT predictivity Data collection: Neural responses are collected for each of the 1320 images (RTRAIN) across 10 ms time-bins for each recorded IT neuron. Mapping: For the train images, the image evoked activations (FTRAIN) of the DCNN model from a specific layer was computed. Partial least square regression was uused to estimate the set of weights (w) and biases (β) that allows to best predict RTRAIN from FTRAIN. Test Predictions: Given the best set of weights (w) and biases (β) that linearly map the model features onto the neural responses, the predictions (MPRED) from this synthetic neuron were generated for the test image evoked activations of the model FTEST. These predictions were then compared with the test image evoked neural features (RTEST) to compute the IT predictivity of the model. 26 IT predictivity across time from feedforward DCNNs The fc7 layer of AlexNet predicted 44.3 ± 0.7% of the explainable IT neural response variance during the early (putative largely feedforward) response phase (90–110 ms). However, the ability of DCNN to predict the IT population pattern significantly worsened (

HCNN_2024_2-1-30.pdf

Document Details

Tags

Related

Full Transcript