New Manuscript: Resolving References to Objects in Photographs

We have released a new manuscript, “Resolving References to Objects in Photographs using the Words-As-Classifiers Model” (Schlangen, Zarrieß, Kennington, 2015). This uses the model introduced in (Kennington et al. 2015) and (Kennington and Schlangen 2015) and applies it to a (pre-processed) corpus of real-world images. The idea of this model is that (a certain aspect of) the meaning of words can be captured by modelling them as classifiers on perceptual data, specifying how well the perceptual representation “fits” to the word. In previous work, we tested this on our puzzle domain. With this paper, we’ve moved this to a more varied domain (with much more data), which also makes it possible to explore some other things. (For example, it turns out that you can easily get a representation from this with which you can do similarity computations as in distributional semantics.)

This being on arXiv is also a bit of an experiment for us, as this is the first time that we’ve released something before it has been “officially” published. We hope that we get some comments this way that we can perhaps incorporate into a version that will be submitted to more traditional venues. So, comment away!

Demo of Words-as-Classifiers Model of Reference Resolution

The short video below shows how the Words-as-Classifiers (WAC) [2,3] model of reference resolution (RR) resolves verbal references made to an object that is visually present and tangible.

The setup is as follows: several Pentomino puzzle tiles (i.e., geometric shapes) are on a table. A camera placed above them. The video is fed to computer vision processing software which segments the objects and provides a set of low-level features of each object to the WAC module of RR as implemented as a module in InproTK [4]. Another module uses Kaldi to recognize speech through a microphone and that output is also fed to the WAC module. Using the visual features and the speech, the WAC module determines which object is being referred by producing a probability distribution over the objects, where the object with the highest probability is chosen as the referred object.

The module works incrementally, i.e., it processes word for word as they are recognized. The WAC model is trained on examples of similar interactions (i.e., geometric pentomino tiles and corresponding referring expressions).

This demo corresponds to the demo presented by Soledad Lopez at the SEMDial Conference which was held in Gothenburg, Sweden in 2015 [1]. Also part of that demo, but not showcased here, was the ability for the module to speak some simple utterances including confirmation if the object selected by WAC was the intended one.


[1] Kennington, C., Lopez Gambino, M. S., & Schlangen, D. (2015). Real-world Reference Game using the Words-as-Classifiers Model of Reference Resolution. In Proceedings of SemDial 2015 (pp. 188–189).

[2] Kennington, C., & Schlangen, D. (2015). Simple Learning and Compositional Application of Perceptually Grounded Word Meanings for Incremental Reference Resolution. In Proceedings of ACL. Beijing, China: Association for Computational Linguistics.

[3] Kennington, C., Dia, L., & Schlangen, D. (2015). A Discriminative Model for Perceptually-Grounded Incremental Reference Resolution. In Proceedings of IWCS. Association for Computational Linguistics.

[4] Baumann, T., & Schlangen, D. (2012). The InproTK 2012 Release. In Proceedings of NAACL-HLT.