news 2017 / Q3

To get back (?) into the habit of posting here, a quick couple of news items:

  • Soledad Lopez received the best paper award at SIGdial for her paper “Beyond On-Hold Messages: Conversational Time-Buying in Task-Oriented Dialogue”. (Lopez Gambino MS, Zarrieß S, Schlangen D. Beyond On-Hold Messages: Conversational Time-Buying in Task-Oriented Dialogue. In: Proceedings of SIGdial 2017. 2017. [pdf]
  • Sina Zarrieß will present a demo at INLG (“Refer-iTTS: A System for Referring in Spoken Installments to Objects in Real-World Images”, [pdf]), and our paper “Deriving continous grounded meaning representations from referentially structured multimodal contexts” at EMNLP [pdf].
  • Ting Han had two papers accepted at IJCNLP (upcoming in November): “Natural Language Informs the Interpretation of Iconic Gestures: a Computational Approach”, and “Draw and Tell: Multimodal Descriptions Outperform Verbal- or Sketch-Only Descriptions in an Image Retrieval Task”.
  • .. and finally, Julian Hough has just returned from presenting “Towards Deep End-of-Turn Prediction for Situated Spoken Dialogue Systems” (together with Angelika Maier, on whose MA thesis this is based; [pdf]) at Interspeech and “Grounding Imperatives to Actions is Not Enough: A Challenge for Grounded NLU for Robots from Human-Human Data” at the “Grounding Language Understanding” workshop there ([pdf]).

Papers on grounding and uncertainty for robots presented at HRI conference and HRI Intentions workshop

Robots face a lot of uncertainty about what users want them to do from their speech, and they also face difficulty in knowing when they’ve made their own goals common knowledge with the user. We recently presented a model as to how to best make a robot’s goal, and uncertainty about the user’s goal, common knowledge with the user at the  2017 Conference on Human-Robot Interaction (HRI2017) main conference:

Julian Hough and David Schlangen. It’s Not What You Do, It’s How You Do It: Grounding Uncertainty for a Simple Robot

and in the HRI workshop on intentions:

Julian Hough and David Schlangen. A Model of Continuous Intention Grounding for HRI.

Any comments on this are welcome.

2016 Round-up: Lots of papers, including 2 ‘best paper’ awards

2016 was a successful year for the DSG, where we had acceptances for papers in multiple international venues- see our 2016 publications.

We also had 2 ‘best paper’ awards, one at INLG, and one at the EISE workshop at ICMI:

Sina Zarrieß and David Schlangen. Towards Generating Colour Terms for Referents in Photographs: Prefer the Expected or the Unexpected?
In: Proceedings of the 9th International Natural Language Generation conference. Edinburgh, UK: Association for Computational Linguistics: 246–255.

Birte Carlmeyer, David Schlangen, Britta Wrede. Exploring self-interruptions as a strategy for regaining the attention of distracted users
In: Proceedings of the 1st Workshop on Embodied Interaction with Smart Environments – EISE ’16. Association for Computing Machinery (ACM). 

Check out the papers and let us know your thoughts in comments!

Papers at ACL 2016

We are going to present two papers at ACL this year:

  • Zarrieß S, Schlangen D. “Easy Things First: Installments Improve Referring Expression Generation for Objects in Photographs”. In: Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (ACL 2016). pdf
  • Schlangen D, Zarrieß S, Kennington C. “Resolving References to Objects in Photographs using the Words-As-Classifiers Model”. In: Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (ACL 2016) pdf

Both are in the area of grounded semantics, and both make use of our variant of the “word-as-classifiers” model, where the (perceptual parts of the) semantics of a word is modelled as a classifier on perceptual input. Those models are trained from referring expressions. The first paper uses this for generation and shows that we can work around the relative noisiness of the classifiers in generation by following a strategy of generating what Herb Clark calls installments, trial NPs. The other paper is on resolution and shows that we can reach relatively decent performance with this simple model, even compared to (more) end-to-end deep learning based ones.

Let us know what you think in the comments!

Goodbye, Dr. Kennington!

Last week saw us say goodbye to Casey, who after a short stint as a postdoctoral researcher here has now moved back closer to home, on to a tenure-track position at Boise State, Idaho, USA. Congratulations!

Casey was the first to join the Bielefeld incarnation of the Dialogue Systems Group, so with him and Spyros Kousidis (the second member) now having moved on, it does feel like the start up phase has now concluded. Casey’s had quite the run here, with almost 20 peer-reviewed publications (including 3 full *acl papers). We’re excited to see what he’ll do next!

(From left to right: Casey, Iwan, Ting, David, Sina, Julian, Soledad; Birte & Simon couldn’t make it on that day.)

We’re going to LREC 2016 with 2 papers!

We’re going to the 10th edition of the Language Resources and Evaluation Conference (LREC) in Portorož (Slovenia) this month presenting the following two papers:

Sina Zarrieß, Julian Hough, Casey Kennington, Ramesh Manuvinakurike, David DeVault, Raquel Fernández, and David Schlangen. PentoRef: A Corpus of Spoken References in Task-oriented Dialogues (PUB

Julian Hough, Ye Tian, Laura de Ruiter, Simon Betz, David Schlangen and Jonathan Ginzburg. DUEL: A Multi-lingual Multimodal Dialogue Corpus for Disfluency, Exclamations and Laughter (PUB:

It should be a lot of fun to interact with the linguistic resources community and share the cool stuff we’ve built up over the last few years!

Feel free to ask us (see the corresponding authors’ email addresses at the top of the papers) if you’re interested or need help getting the data.

Tutorial: ASR using InproTK

This tutorial is meant to help you use speech recognition (ASR) in your applications. The kind of ASR we use follows the Incremental Unit dialogue processing model, which produces ASR output that is more timely than traditional approaches. We have modules for Sphinx4, Kaldi and Google ASR systems.

If you aren’t sure what InproTK is or how to install it, please refer to the instructions here. Once you have completed that tutorial, you can continue with this one.

(Note that some of the functionality in this tutorial may only exist in the DSG-Bielefeld BitBucket fork of InproTK and will appear in the main fork in the development branch shortly.)

New Manuscript: Resolving References to Objects in Photographs

We have released a new manuscript, “Resolving References to Objects in Photographs using the Words-As-Classifiers Model” (Schlangen, Zarrieß, Kennington, 2015). This uses the model introduced in (Kennington et al. 2015) and (Kennington and Schlangen 2015) and applies it to a (pre-processed) corpus of real-world images. The idea of this model is that (a certain aspect of) the meaning of words can be captured by modelling them as classifiers on perceptual data, specifying how well the perceptual representation “fits” to the word. In previous work, we tested this on our puzzle domain. With this paper, we’ve moved this to a more varied domain (with much more data), which also makes it possible to explore some other things. (For example, it turns out that you can easily get a representation from this with which you can do similarity computations as in distributional semantics.)

This being on arXiv is also a bit of an experiment for us, as this is the first time that we’ve released something before it has been “officially” published. We hope that we get some comments this way that we can perhaps incorporate into a version that will be submitted to more traditional venues. So, comment away!

Demo of Words-as-Classifiers Model of Reference Resolution

The short video below shows how the Words-as-Classifiers (WAC) [2,3] model of reference resolution (RR) resolves verbal references made to an object that is visually present and tangible.

The setup is as follows: several Pentomino puzzle tiles (i.e., geometric shapes) are on a table. A camera placed above them. The video is fed to computer vision processing software which segments the objects and provides a set of low-level features of each object to the WAC module of RR as implemented as a module in InproTK [4]. Another module uses Kaldi to recognize speech through a microphone and that output is also fed to the WAC module. Using the visual features and the speech, the WAC module determines which object is being referred by producing a probability distribution over the objects, where the object with the highest probability is chosen as the referred object.

The module works incrementally, i.e., it processes word for word as they are recognized. The WAC model is trained on examples of similar interactions (i.e., geometric pentomino tiles and corresponding referring expressions).

This demo corresponds to the demo presented by Soledad Lopez at the SEMDial Conference which was held in Gothenburg, Sweden in 2015 [1]. Also part of that demo, but not showcased here, was the ability for the module to speak some simple utterances including confirmation if the object selected by WAC was the intended one.


[1] Kennington, C., Lopez Gambino, M. S., & Schlangen, D. (2015). Real-world Reference Game using the Words-as-Classifiers Model of Reference Resolution. In Proceedings of SemDial 2015 (pp. 188–189).

[2] Kennington, C., & Schlangen, D. (2015). Simple Learning and Compositional Application of Perceptually Grounded Word Meanings for Incremental Reference Resolution. In Proceedings of ACL. Beijing, China: Association for Computational Linguistics.

[3] Kennington, C., Dia, L., & Schlangen, D. (2015). A Discriminative Model for Perceptually-Grounded Incremental Reference Resolution. In Proceedings of IWCS. Association for Computational Linguistics.

[4] Baumann, T., & Schlangen, D. (2012). The InproTK 2012 Release. In Proceedings of NAACL-HLT.