New Manuscript: Resolving References to Objects in Photographs

We have released a new manuscript, “Resolving References to Objects in Photographs using the Words-As-Classifiers Model” (Schlangen, Zarrieß, Kennington, 2015). This uses the model introduced in (Kennington et al. 2015) and (Kennington and Schlangen 2015) and applies it to a (pre-processed) corpus of real-world images. The idea of this model is that (a certain aspect of) the meaning of words can be captured by modelling them as classifiers on perceptual data, specifying how well the perceptual representation “fits” to the word. In previous work, we tested this on our puzzle domain. With this paper, we’ve moved this to a more varied domain (with much more data), which also makes it possible to explore some other things. (For example, it turns out that you can easily get a representation from this with which you can do similarity computations as in distributional semantics.)

This being on arXiv is also a bit of an experiment for us, as this is the first time that we’ve released something before it has been “officially” published. We hope that we get some comments this way that we can perhaps incorporate into a version that will be submitted to more traditional venues. So, comment away!

2 thoughts on “New Manuscript: Resolving References to Objects in Photographs

  1. Very enjoyable paper — thanks!

    Just a few quick comments for now:

    The “words as classifiers model” — you should probably take a look at the paper : D. Skocaj, M. Kristan, A. Vre ˇ cko, M. Mahni ˇ c, M. Jan ˇ ´ıcek, G.-J. M. ˇ
    Kruijff, M. Hanheide, N. Hawes, T. Keller, M. Zillich, and K. Zhou, “A
    system for interactive learning in dialogue with a tutor,” in IROS 2011,
    San Francisco, CA, USA, 25-30 September 2011, pp. 3387 –3394.

    - this also tries to learn word meanings as perceptual classifiers.
    There is also a bunch of work in the tradition of visual attribute classification / learning, (maybe you know it) that you might need to discuss as related work: e.g.
    Attribute-Based Classification for Zero-Shot Visual Object Categorization
    CH Lampert, H Nickisch, S Harmeling
    IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI)

    Describing objects by their attributes
    A Farhadi, I Endres, D Hoiem, D Forsyth, CVPR 2009.

    Learning grounded meaning representations with autoencoders
    C Silberer, M Lapata, Proceedings of ACL, 721-732

    I found figure 3 to be counter-intuitive. I would expect referents to be picked out more accurately as you accumulate more evidence about them. The fact that much of this graph has a negative gradient is a bit of a worry for the model, isn’t it?

    In section 8, figure 4 evaluating the classifiers individually, wouldn’t it be better to look at Precision/Recall/ F-score and Area under ROC curve, as in e.g. work on attribute-based vision systems?

    For figure 1, the classifier scores are not probability distributions over the objects – does that matter for when you take the avg of them to model composition?

    Could you run this sort of model in reverse to generate referring expressions ?

    thanks for a very stimulating paper — looking forward to updates!

    • Great comments, thanks a lot, Oliver! Will need to think about them for a bit.

      Re: figure 3: I agree, it is a bit worrying that the quality goes down, and by so much, when more words are added. My theory so far is that adding more words increases the chances that you hit one that is actively hurtful and not just not helping. Also, some of these longer expressions should be given more structure, and be composed in other ways. (Averaging really is only supposed to be done for simple NPs.) But all this needs to be investigated mores systematically.

      Re: generation: stay tuned! :-)

Comments are closed.