HCNN_2024_2-31-67-20-37.pdf
Document Details
Uploaded by DaringRadon4272
Tags
Full Transcript
Unsupervised neural networks 49 Unsupervised neural network models of the ventral visual stream Today’s best models of visual cortex are trained on ImageNet, a dataset that contains millions of category-labeled images organized into thousands of categories. Such su...
Unsupervised neural networks 49 Unsupervised neural network models of the ventral visual stream Today’s best models of visual cortex are trained on ImageNet, a dataset that contains millions of category-labeled images organized into thousands of categories. Such supervision is however highly implausible, since human infants and nonhuman primates simply do not receive millions of category labels during development. Supervised DCNN cannot provide a correct explanation of how such representations are learned in the brain. Substantial effort has been devoted to unsupervised learning algorithms with the goal of learning representations from natural statistics without high-level labeling. 50 Human data is continuous and egocentric this is not the case for standard image databases; model input is most often unimodal, human input is multimodal Humans may rely on different inductive biases—that is, constraints or assumptions prior to training (learning)—allowing for more data-efficient learning (i.e., objects obey the laws of physics and behave in a causally predictable way) Humans may enlarge their initial dataset by using already encountered instances to create new instances during offline states (i.e., imagination, dreaming) 51 Unsupervised learning could support the continuous adaptation of cortical sensory representations to sensory input statistics, acting as a bridge between the largely hard- wired and evolutionary determined processing circuits of low-level areas (e.g. the retina) and the categorial/conceptual representations learned under supervision in higher-order memory/decision centers. 52 Unsupervised learning algorithms 53 Local Aggregation (LA) method. For each input image, a DCNN was used to embed it into a lower dimension space (”Embedding Space”). Its close neighbors (blue dots) and background neighbors (black dots) were identified. The optimization seeks to push the current embedding vector (red dot) closer to its close neighbors and further from its background neighbors.The blue arrow and black arrow are examples of influences from different neighbors on the current embedding during optimization. ”After Optimization” panel illustrates the typical structure of the final embedding after training. 54 Multi-dimensional scaling (MDS) algorithm used to visualize the embedding space. Classes with high validation accuracy (left) and classes with low validation accuracy (right). For each class, 100 images of that class were randomly chosen from the training set and apply the MDS algorithm to the resulting 600 images. Dots represent individual images Top three rows show the images that were in each color-coded category. successfully classified using a weighted K- nearest-neighbour (KNN) classifier in the embedding space (K top nearest neighbours), while Bottom three rows show unsuccessfully classified examples. 55 Contrastive embedding methods yield high-performing neural networks A standard ResNet18 network architecture was used. Training data were drawn from ImageNet, a large-scale database of hand-labeled natural images. Across all evaluated objective functions, contrastive embedding objectives showed substantially better transfer than other unsupervised methods. The best of the unsupervised methods (Sim-CLR and local aggregation) equaled or even outperformed the category-supervised model in several tasks, including object position and size estimation. Unsurprisingly, all unsupervised methods are still somewhat outperformed by the category-supervised model on the object categorization task. 56 Unsupervised neural network were compared to neural data from macaque V1, V4, and IT cortex. Previously established technique for mapping artificial network responses to real neural response patterns were used. The figure in the next slide shows the correlation between model and neural responses across held-out images, for the best-predicting layer for each model. Area V1: all unsupervised methods were significantly better than the untrained baseline at predicting neural responses, although none were statistically better from the category- supervised model on this metric. Area V4: only a subset of methods achieved parity with the supervised model in predictions of responses. Area IT: only the best-performing contrastive embedding methods achieved neural prediction parity with supervised models. 57 Unsupervised neural network were compared to neural data from macaque V1, V4, and IT cortex 58 Deep Contrastive Learning on first-person video data from children ImageNet dataset used to train unsupervised networks diverges significantly from real biological data streams ImageNet contains single images of a large number of distinct instances of objects in each category, presented cleanly from stereotypical angles; Human infants receive images from a much smaller set of object instances, under much noisier continuous; ImageNet consists of statistically independent static frames; Human infants receive streams of temporally correlated inputs; 59 Is deep contrastive unsupervised learning sufficiently robust to handle real- world developmental video streams such as SAYCam? A better proxy of the real infant data stream is represented by the recently released SAYCam dataset, which contains head-mounted video camera data from three children (about 2 h/wk spanning ages 6 to 32 mo). To test whether contrastive unsupervised learning is sufficiently robust to handle real-world developmental video streams such as SAYCam, video instance embedding (VIE) algorithm was used VIE algorithm is an extension of LA to video, which achieves state-of-the-art results on a variety of dynamic visual task, including action recognition. 60 Representations learned by VIE are highly robust approaching the neural predictivity of those trained on ImageNet 61 Partial Supervision Semisupervised learning seeks to leverage small numbers of labeled datapoints in the context of large amounts of unlabeled data. A semisupervised learning algorithm, local label propagation (LLP) embeds datapoints into a compact embedding space, but additionally takes into account the embedding properties of sparse labeled data This algorithm first uses a label propagation method to infer the pseudolabels of unlabeled images from those of nearby labeled images. The network is then jointly optimized to predict these inferred pseudolabels while maintaining contrastive differentiation between embeddings with different pseudolabels. 62 * The embedding * of an unlabeled input is used to infer its pseudolabel considering its labeled neighbors with voting weights determined by their distances from * and their local density (the highlighted areas). DCNNs is jointly optimized to predict these inferred pseudolabels while maintaining contrastive differentiation between embeddings with different pseudolabels. 63 Pearson correlations between human and different models’ behavior performing the same object recognition task on 2400 images of 24 different objects Using just 36,000 labels (corresponding to 3% supervision), semisupervised models lead to representations that are substantially more behaviorally consistent than purely unsupervised methods, although a gap to the supervised models remains. 64 Unsupervised models represent high-performing but biologically plausible visual learning system. The neural predictivity of the best unsupervised method only slightly surpasses that of supervised categorization models. Both for neural response pattern and behavioral consistency metrics, results show that there remains a substantial gap between all models (supervised and unsupervised) and the noise ceiling (variance) of the data. 65 66