news 2017 / Q3

To get back (?) into the habit of posting here, a quick couple of news items:

  • Soledad Lopez received the best paper award at SIGdial for her paper “Beyond On-Hold Messages: Conversational Time-Buying in Task-Oriented Dialogue”. (Lopez Gambino MS, Zarrieß S, Schlangen D. Beyond On-Hold Messages: Conversational Time-Buying in Task-Oriented Dialogue. In: Proceedings of SIGdial 2017. 2017. [pdf]
  • Sina Zarrieß will present a demo at INLG (“Refer-iTTS: A System for Referring in Spoken Installments to Objects in Real-World Images”, [pdf]), and our paper “Deriving continous grounded meaning representations from referentially structured multimodal contexts” at EMNLP [pdf].
  • Ting Han had two papers accepted at IJCNLP (upcoming in November): “Natural Language Informs the Interpretation of Iconic Gestures: a Computational Approach”, and “Draw and Tell: Multimodal Descriptions Outperform Verbal- or Sketch-Only Descriptions in an Image Retrieval Task”.
  • .. and finally, Julian Hough has just returned from presenting “Towards Deep End-of-Turn Prediction for Situated Spoken Dialogue Systems” (together with Angelika Maier, on whose MA thesis this is based; [pdf]) at Interspeech and “Grounding Imperatives to Actions is Not Enough: A Challenge for Grounded NLU for Robots from Human-Human Data” at the “Grounding Language Understanding” workshop there ([pdf]).

Papers at ACL 2016

We are going to present two papers at ACL this year:

  • Zarrieß S, Schlangen D. “Easy Things First: Installments Improve Referring Expression Generation for Objects in Photographs”. In: Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (ACL 2016). pdf
  • Schlangen D, Zarrieß S, Kennington C. “Resolving References to Objects in Photographs using the Words-As-Classifiers Model”. In: Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (ACL 2016) pdf

Both are in the area of grounded semantics, and both make use of our variant of the “word-as-classifiers” model, where the (perceptual parts of the) semantics of a word is modelled as a classifier on perceptual input. Those models are trained from referring expressions. The first paper uses this for generation and shows that we can work around the relative noisiness of the classifiers in generation by following a strategy of generating what Herb Clark calls installments, trial NPs. The other paper is on resolution and shows that we can reach relatively decent performance with this simple model, even compared to (more) end-to-end deep learning based ones.

Let us know what you think in the comments!

Goodbye, Dr. Kennington!

Last week saw us say goodbye to Casey, who after a short stint as a postdoctoral researcher here has now moved back closer to home, on to a tenure-track position at Boise State, Idaho, USA. Congratulations!

Casey was the first to join the Bielefeld incarnation of the Dialogue Systems Group, so with him and Spyros Kousidis (the second member) now having moved on, it does feel like the start up phase has now concluded. Casey’s had quite the run here, with almost 20 peer-reviewed publications (including 3 full *acl papers). We’re excited to see what he’ll do next!

(From left to right: Casey, Iwan, Ting, David, Sina, Julian, Soledad; Birte & Simon couldn’t make it on that day.)

New Manuscript: Resolving References to Objects in Photographs

We have released a new manuscript, “Resolving References to Objects in Photographs using the Words-As-Classifiers Model” (Schlangen, Zarrieß, Kennington, 2015). This uses the model introduced in (Kennington et al. 2015) and (Kennington and Schlangen 2015) and applies it to a (pre-processed) corpus of real-world images. The idea of this model is that (a certain aspect of) the meaning of words can be captured by modelling them as classifiers on perceptual data, specifying how well the perceptual representation “fits” to the word. In previous work, we tested this on our puzzle domain. With this paper, we’ve moved this to a more varied domain (with much more data), which also makes it possible to explore some other things. (For example, it turns out that you can easily get a representation from this with which you can do similarity computations as in distributional semantics.)

This being on arXiv is also a bit of an experiment for us, as this is the first time that we’ve released something before it has been “officially” published. We hope that we get some comments this way that we can perhaps incorporate into a version that will be submitted to more traditional venues. So, comment away!

5 more 2013 papers!

Timo Baumann and David Schlangen, Interactional Adequacy as a Factor in the Perception of Synthesized Speech, to appear in Proceedings of the 8th ISCA Speech Synthesis Workshop, Barcelona, Spain, August 2013


Speaking as part of a conversation is different from reading out aloud.
Speech synthesis systems, however, are typically developed using assumptions
(at least implicitly) that are more true of the latter than the former situation.
We address one particular aspect, which is the assumption that
a fully formulated sentence is available for synthesis.
We have built a system that does not make this assumption
but rather can synthesize speech given incrementally extended input.
In an evaluation experiment, we found that in a dynamic domain
where what is talked about changes quickly,
subjects rated the output of this system as more `naturally pronounced’ than that
of a baseline system that employed standard synthesis,
despite the synthesis quality objectively being degraded.
Our results highlight the importance of considering a synthesizer’s ability to
support interactive use-cases when determining the adequacy of synthesized

Casey Kennington and David Schlangen, Situated Incremental Natural Language Understanding Using Markov Logic Networks, to appear in Computer Speech and Language


We present work on understanding natural language in a situated domain in an incremental, word-by-word fashion.
We explore a set of models specified as Markov Logic Networks and show that a model that has access to information about the visual context during an utterance, its discourse context, the words of the utterance, as well as the linguistic structure of the utterance performs best and is robust to noisy speech input. We explore the incremental properties of the models and offer some analysis. We conclude that MLNs provide a promising framework for specifying such models in a general, possibly domain-independent way.

Casey Kennington, Spyros Kousidis and David Schlangen, Interpreting Situated Dialogue Utterances: an Update Model that Uses Speech, Gaze, and Gesture Information, to appear in Proceedings of SIGdial 2013, Metz, France, August 2013


In situated dialogue, speakers share time and space. We present a statistical model for understanding natural language that works incrementally (i.e., in real, shared time) and is grounded (i.e., links to entities in the shared space). We describe our model with an example, then establish that our model works well on non-situated, telephony application-type utterances, show that it is effective in grounding language in a situated environment, and further show that it can make good use of embodied cues such as gaze and pointing in a fully multi-modal setting.

Spyros Kousidis, Casey Kennington and David Schlangen, Investigating speaker gaze and pointing behaviour in human-computer interaction with the collection, to appear in Proceedings of SIGdial 2013, Metz, France, August 2013


Can speaker gaze and speaker arm movements be used as a practical information source for naturalistic conversational human–computer interfaces? To investigate this question, we’ve recorded (with eye tracking and motion capture) a corpus of interactions with a (wizarded) system. In this paper, we describe the recording and analysis infrastructure that we’ve built for such studies, and the particular analyses we performed on these data.
We find that with some initial calibration, a “minimally invasive”, stationary camera-based setting provides data of sufficient quality to support interaction.

Timo Baumann and David Schlangen, Open-ended, Extensible System Utterances Are Preferred, Even If They Require Filled Pauses, to appear in Proceedings of SIGdial 2013, Metz, France, August 2013


In many environments (e.g. sports commentary), situations incrementally unfold over time and often the future appearance of a relevant event can be predicted, but not in all its details or precise timing. We have built a simulation framework that uses our incremental
speech synthesis component to assemble in a timely manner complex commentary
utterances. In our evaluation, the resulting output is preferred over that from a baseline system that uses a simpler commenting strategy. Even in cases where the incremental
system overcommits temporally and requires a filled pause to wait for
the upcoming event, the system is preferred over the baseline.

PDFs to follow (or, should I forget to add links here, will be available under “Publications” in any case).

Two more recent papers: Virtual Agents, and Dysfluencies

Paper at DiSS 2013: The 6th Workshop on Disfluency in Spontaneous Speech: Ginzburg, J., Fernández, R., & Schlangen, D. (2013) Self Addressed Questions in Dysfluencies, In: Proceedings of the 6th Workshop on Disfluency in Spontaneous Speech, Stockholm, 2013

Short paper at IVA 2013: van Walbergen, H., Baumann, T., Kopp, S., Schlangen, D. (2013) „Incremental, Adaptive and Interruptive Speech Realization for Fluent Conversation with ECAs“, In: Proceedings of the Thirteenth International Conference on Intelligent Virtual Agents (IVA 2013), Edinburgh, August 2013.

Links to pdfs to follow.

Paper at Interspeech 2013, II: Tools for Multimodal Data Recording, Handling, and Analysis

And the other paper: Tools and Adaptors Supporting Acquisition, Annotation and Analysis of Multimodal Corpora; Spyros Kousidis, Thies Pfeiffer, David Schlangen.


This paper presents a collection of tools (and adaptors for ex- isting tools) that we have recently developed, which support ac- quisition, annotation and analysis of multimodal corpora. For acquisition, an extensible architecture is offered that integrates various sensors, based on existing connectors (e.g. for motion capturing via VICON, or ART) and on connectors we contribute (for motion tracking via Microsoft Kinect as well as eye track- ing via Seeingmachines FaceLAB 5). The architecture provides live visualisation of the multimodal data in a unified virtual real- ity (VR) view (using Fraunhofer Instant Reality) for control dur- ing recordings, and enables recording of synchronised streams. For annotation, we provide a connection between the annotation tool ELAN (MPI Nijmegen) and the VR visualisation. For anal- ysis, we provide routines in the programming language Python that read in and manipulate (aggregate, transform, plot, anal- yse) the sensor data, as well as text annotation formats (Praat TextGrids). Use of this toolset in multimodal studies proved to be efficient and effective, as we discuss. We make the collection available as open source for use by other researchers.

Bibtex, pdf: here. There’s also a dedicated website for the tool set.

Paper at Interspeech 2013: Cross-Linguistic Study on Turn-Taking

We will be presenting two papers this year at Interspeech (in Lyon). The first is

A cross-linguistic study on turn-taking and temporal alignment in verbal interaction; Spyros Kousidis, David Schlangen, Stavros Skopeteas


That speakers take turns in interaction is a fundamental fact across languages and speaker communities. How this taking of turns is organised is less clearly established. We have looked at interactions recorded in the field using the same task, in a set of three genetically and regionally diverse languages: Georgian, Cabe ́car, and Fongbe. As in previous studies, we find evidence for avoidance of gaps and overlaps in floor transitions in all languages, but also find contrasting differences between them on these features. Further, we observe that interlocutors align on these temporal features in all three languages. (We show this by correlating speaker averages of temporal features, which has been done before, and further ground it by ruling out potential alternative explanations, which is novel and a minor methodological contribution.) The universality of smooth turn-taking and alignment despite potentially relevant grammatical differ- ences suggests that the different resources that each of these languages make available are nevertheless used to achieve the same effects. This finding has potential consequences both from a theoretical point of view as well as for modeling such phenom- ena in conversational agents.

BibTex, pdf here.