Bio-inspired Model of Spatial Cognition
We developed a biologically inspired unsupervised connectionist architecture for grounding the spatial terms. This two-layer architecture integrates information from visual and auditory inputs. In the first layer, it employs separate visual ‘what’ and ‘where’ subsystems to represent spatial relations of two objects in 2D space. The images are presented to an artificial retina and the phonologically encoded five-word sentences describing the image serve as auditory inputs. The visual scene is represented by several self-organizing maps (SOMs) and the auditory description is processed by a Recursive SOM that learns to topographically represent sequences. Primary representations from the first layer are unambiguously integrated in a multimodal module (implemented by SOM or ``neural gas'' algorithms) in the second layer. The simulations reveal that separate processing and representation of spatial location and object shape significantly significantly improves the performance of the model. The system is able to bind proper lexical and visual features without any prior knowledge. The results confirm theoretical assumptions about the different nature of visual and auditory coding that become efficiently integrated at the multimodal layer.
In our model, the representations take advantage of the two or three unimodal layers of units. The auditory layer represents unique labels (linguistic terms), whereas the ‘where’ part of the visual system represents fuzzy information about the spatial locations of objects in the external world and ‘what’ system captures shapes and colors of objects in a fixed foveal position. The multimodal level integrates the outputs of these unimodal layers. The grounded meaning is simultaneously represented by all layers (auditory, visual and multimodal), making this approach resemble the theory of Peirce who defined three components of a sign -- representamen, interpretant and the sign itself. Our model represents the sign hierarchically which guarantees better processing and storing of representations, because the sign (the multimodal level) is modifiable from both modalities (the sequential ``representamen'' auditory level and the parallel ``interpretant'' visual level). This feature makes the units in the higher layer bimodal (i.e.~they can be stimulated by any of the primary layers) and their activation can be forwarded for further processing. Bimodal (and multimodal) neurons are known to be ubiquitous in the association areas of the brain. The multimodal layer is formed by exploiting the concept of self-organized conjunctive representations that have been hypothesized to exist in the brain with the purpose of binding the features such as various perceptual properties of objects. Here we extend the concept of grounding by linking the subsymbolic and symbolic information. Hence, each output unit learns to represent a unique combination of perceptual and symbolic information (that could be forwarded to another, higher module).
Our model proposes the unsupervised solution to the visual binding, based on the integration of ‘what’ and ‘where’ pathways. With respect to the visual binding problem, the model is based on convergent hierarchical coding, also called combination coding. The neurons react only to combinations of features, that is, to an object of a particular shape and color at a particular retinal position (localist representation). The hierarchical processing implies that increasingly complex features are represented by higher levels in the hierarchy. Complex objects and situations are constructed by combining simpler elements. On the other hand, the convergent hierarchical coding requires as many binding units as there are distinguishable objects. It should result in a combinatorial explosion for large-scale simulations. Our model is able to represent 840 combinations, but it can also suffer from combinatorial explosion because we represent pairs of objects instead of separate entities in the primary layers. In case of 10 objects, 5 colors in 4 spatial locations we would need to represent 2450 object pairs in a primary ‘what’ system, instead of 50 separate objects. It is also possible to add a separate layer for the color processing, in which case there will only be 10 objects presented in the ‘what’ system (we plan to test this architecture in the future).
Some authors have raised the question whether the combinatorial explosion is really a problem. It is estimated that the number of objects, scenarios, colors and other features in the brain is approximately 10 million items. It is obviously beyond the limits of recent cognitive systems, but it is below the number of neurons in the mammalian visual cortex, so the combination coding could be a sufficient method. We could also adopt Neural Modeling Fields (Perlovsky, 2001), the unsupervised learning method based on Gaussian mixture models that arguably does not suffer from combinatorial complexity.
Our model is able to map the words in the sentence with the fixed grammar to the objects in the environment without any prior knowledge (lexical binding). The ability of lexical binding should be considered as an extension of the symbol grounding. We present sentences as linguistic inputs to be bound with proper features from the visual subsystem (shape, color, location). Compared to the classic sensorimotor toil experiments based on the grounding of two features, our system is able to ground 5 features simultaneously that speeds up the process of symbol grounding (faster acquisition of the grounded lexicon). Tikhanoff (2009) proposed the architecture (and implemented it in iCub robot) that was able to understand basic sentences but it was based on supervised learning. Our model is a proof of concept that also unsupervised architectures can find proper mapping between visual and lexical features. We are able to build representations solely from the sensory inputs, arguing that the co-occurrence of inputs from the environment is a sufficient source of information to create an intrinsic representational system.
Physical model of mind - Extension of Neural modeling field theory
We are developing an extension of Neural modeling fields (NMF) theory for unknown number of models. This is a physical model of mind which uses as a learning algorithm a fuzzy dynamic logic. There plays an essential role a knowledge instinct which is an ability to find patterns in information without any external supervisor. This knowledge instinct can be
formalized as an unsupervised cluster analysis. The goal of NMF is to create an architecture similar to brain. At each level and any moment a lot of concepts are
competing for their evidence. The process is based on the adaptive convergence from vague high-fuzzy concepts to crisp and deterministic ones. NMF was
successfuly tested in pattern recognition tasks, tracking or acquisition of language in cognitive robotics.
In the general NMF theory, it is supposed that we have an rough idea how many components are in the mixture system. Initialization is performed by setting
covariance matrices to large values while each initialized cluster has a diferent center. To find the optimal number of components the algorithm forms a new
concept or eliminates an old one after a fixed number of iterations. After the optimal number of clusters was found, parameters describing the concept can be
changed from general Gaussian distribution to a more precise one (e.g. parabolic shape) which will further increse overall log-likelihood.
In our project we propose a novel greedy GMM heuristic method with merging(gmGMM) which is able to deal with the unknown number of components. As a
part of our method we have designed an estimation based initialization algorithm which considers automatic attention mechanisms in human brain and therefore keeps the
execution time low. This method is designed so it could be integrated into the NMF theory. We further focus on comparison with other initialization techniques and
confront diferent stopping criteria and propose their enhancements.