Grounding of spatial language

Abstract

We propose a bio-inspired unsupervised connectionist architecture and apply it to grounding the spatial phrases. The two-layer architecture combines by concatenation the information from the visual and the phonological inputs. In the first layer, the visual pathway employs separate ‘what’ and ‘where’ subsystems that represent the identity and spatial relations of two objects in 2D space, respectively. The bitmap images are presented to an artificial retina and the phonologically encoded five-word sentences describing the image serve as the phonological input. The visual scene is hence represented by several self-organizing maps (SOMs) and the phonological description is processed by the Recursive SOM that learns to topographically represent the spatial phrases, represented as five-word sentences (e.g., ‘blue ball above red cup’). Primary representations from the first-layer modules are unambiguously integrated in a multimodal second-layer module, implemented by the SOM or the ‘neural gas’ algorithms. The system learns to bind proper lexical and visual features without any prior knowledge. The simulations reveal that separate processing and representation of the spatial location and the object shape significantly improve the performance of the model. We provide quantitative experimental results comparing three models in terms of their accuracy.


Conclusion

We have created an unsupervised connectionist system that is able to extract constant attributes and regularities from the environment and link them with abstract symbols. The meaning is non-arbitrarily represented at the conceptual level that guarantees the correspondence of the internal representational system with the external environment. We can also conclude that it is advantageous to follow the biologically inspired hypothesis about the processing of visual information in separate subsystems. The question for future research is to find a proper way of output coding from the unimodal layers to increase system accuracy and to scale up the model. The main advantage of our model is the hierarchical representation of the sign components.

You can download a draft of this article here.