Accepted Papers: SemDial 2014

We have 4 new accepted short papers for the SemDial 2014 conference, which will be in Edinburgh:

Title: Towards Automatic Understanding of `Virtual Pointing’ in Interaction
Authors: Ting Han, Spyros Kousidis, David Schlangen

Title: Multimodal Incremental Dialogue with InproTKs
Authors: Casey Kennington, Spyros Kousidis, David Schlangen
Abstract: We present extensions of the incremental processing toolkit InproTK which make it possible to plug in sensors and to achieve situated, real-time, multimodal dialogue. We also describe a new module which enables the use in InproTK of the Google Web Speech API, which offers speech recognition with a very large vocabulary and a wide choice of languages. We illustrate the use of these extensions with a description of two systems handling different situated settings.

Title: Dialogue Structure of Coaching Sessions
Authors: Iwan de Kok, Julian Hough, Cornelia Frank, David Schlangen and Stefan Kopp
Abstract: We report initial findings of the ICSPACE (`Intelligent Coaching Space’) project on virtual coaching. We describe the gathering of a corpus of dyadic squat coaching interactions and initial high-level models of the structure of these sessions

Title: The Disfluency, Exclamation and Laughter in Dialogue (DUEL) Project
Authors: Jonathan Ginzburg, David Schlangen, Ye Tian and Julian Hough

Accepted Paper: CoLing 2014

We have a recently accepted paper at the CoLing 2014 conference which will take place in Dublin, Ireland.
Title: Situated Incremental Natural Language Understanding using a Multimodal, Linguistically-driven Update Model

Authors: Casey Kennington, Spyros Kousidis, David Schlangen

A common site of language use is interactive dialogue between two people
situated together in shared time and space. In this paper, we present a
statistical model for understanding natural human language that works
incrementally (i.e., does not wait until the end of an utterance to begin
processing), and is grounded by linking semantic entities with objects in a
shared space. We describe our model, show how a semantic meaning representation
is grounded with properties of real-world objects, and further show that it can
ground with embodied, interactive cues such as pointing gestures or eye gaze.

Accepted Paper: AutomotiveUI 2014

We have recently accepted paper to the upcoming AutomotiveUI 2014 conference, which will take place in Seattle, U.S.A.

Title: Better Driving and Recall When In-car Information Presentation Uses Situationally-Aware Incremental Speech Output Generation

Authors: Spyros Kousidis, Casey Kennington, Timo Baumann, Hendrik Buschmeier, Stefan Kopp, David Schlangen

It is by now established that driver distraction is the result of
sharing cognitive resources between the primary task (driving)
and any other secondary task. In the case of holding conversations,
a human passenger who is aware of the driving
conditions can choose to interrupt his/her speech in situations
potentially requiring more attention from the driver; but incar
information systems typically do not exhibit such sensitivity.
We have designed and tested such a system in a driving
simulation environment. Unlike other systems, our system
delivers information via speech (calendar entries with scheduled
meetings) but is able to react to signals from the environment
to interrupt and subsequently resume its delivery when
the driver needs to be fully attentive to the driving task. Distraction
is measured by a secondary short-term memory task.
In both tasks, drivers perform significantly worse when the
system does not adapt its speech, while they perform equally
well to control conditions (no concurrent task) when the system
intelligently interrupts and resumes.

DUEL project launch- Laughing all the way!

Last month, DUEL (“Disfluencies, exclamations and laughter in dialogue”), a joint project between the Bielefeld DSG and Université Paris Diderot (Paris 7) launched in Paris.

The project aims to investigate how and why people’s talk is filled with disfluent material such as filled pauses (`”um”, “uh”), repairs (e.g. “I, uh, I really want to go”..), exclamations such as “oops” and laughter of all different kinds, from the chortle to the titter.

Traditionally in theoretical linguistics, such phenomena are rendered outside of the human linguistic faculty, an opinion held since the dawn of the modern field, particularly owing to Chomsky’s early performance and competence distinction (Chomsky, 1965).  However, as Jonathan Ginzburg and our own group head David Schlangen claim in their recent paper, disfluency is analogous to friction in physics: while an idealized theory of language can do without it, one that purports to model what actually happens in dialogue cannot throw these frequent phenomena aside.

The project aims to investigate the interactive contribution of the disfluency and laughter that fill our every conversation through a three-fold attack: empirical observation, theory building and, of course, dialogue system implementation. The project aims to investigate how the phenomena vary across languages and use the insights gained from data analyses and formal modelling to incorporate them into the interpretation and generation of a working spoken dialogue system. We aim to build a system that can be disfluent in a natural way, and is also capable of interactionally appropriate laughter when interacting with users. These are milestones for moving towards more natural spoken conversations between humans and machines, which despite the recent questionable press claiming this has recently leaped forward, is still a far-from-solved problem.

You can follow the progress of the DUEL project on its new website. Which- uh, I mean, haha, watch this space..


We announce the release of InproTKs, which is a set of extension modules for InproTK that allow for easier integration of multimodal sensors for situated (hence the s) dialogue. Also included in this release is a module that can directly use the Google Web Speech interface, allowing a dialogue system to have access to a large-domain speech recognition engine.

You can read about it in our 2014 SigDial paper. An explanation is given below. Download instructions can be found at the end of this page.

InproTK is an implementation of the incremental unit processing architecture described in Schlangen & Skantze (2011). It includes modules for several components, including an incremental version of Sphinx ASR, incremental speech synthesis using Mary TTS, as well as some example modules for getting you up and running with your own incremental dialogue system.

The extensions released with InproTKs make it possible to get information into InproTK  from outside sources (even from network devices on various platforms), as well as get information from InproTK out to other modules.

Below is an example. A human user can interact with the system using speech and gestures. InproTK provides the ASR (speech recognition) as a native module. Using the extensions in InproTKs, information from a motion sensor (such as a Microsoft Kinect) can be fed into an external processing module which, for example, can detect certain gesture types, like when the human is pointing to an object. That information can then be sent to InproTKs  using the extension modules (denoted in the figure as a Listener). That information can then be used to help the dialogue system make decisions. Information can also be outputted to external modules (such as the logger in the figure).


There are three methods which one can use to get information into and out of InproTK: XmlRpc, the Robotics Service Bus, and InstantReality. Modules in InproTKs exist for each type, a module for getting data into InproTKs known as Listeners and modules for getting modules out of InproTKs, known as Informers. Each method now be briefly explained.

XmlRpc is a remote procedural call protocol that can be used to send information across processes (potentially running on different machines). Libraries exist in most programming languages.

Robotics Service Bus (from their website) “The Robotics Service Bus (RSB) is a message-oriented, event-driven middleware aiming at scalable integration of robotics systems in diverse environments. Being fundamentally a bus architecture, RSB structures heterogeneous systems of service providers and consumers using broadcast communication over a hierarchy of logically unified channels instead of a large number of point-to-point connections. Nevertheless RSB comes with a collection of communication patterns and other tools for structuring communication, but does not require a particular functional or decomposition style.”

RSB has bindings for Java, Python, and C/C++ languages.

InstantReality (from their website) “The instantreality-framework is a high-performance Mixed-Reality (MR) system, which combines various components to provide a single and consistent interface for AR/VR developers.”

InstantReality only handles Java and C++, but can be used directly with the InstantReality player where one can create a virtual reality scene.

Download Instructions

The InproTKs is found in the InproTK git reposity under the “develop” branch.

Open a terminal and run the following two commands:
>git clone https://[your-user-name]
>git fetch && git checkout develop

This will reveal the package in the src folder. The package contains a README file that explains examples. You will need the instantreality.jar from the InstantReality website (simple download and install the software and you will find the jar included) as well as the protobuf jar from the Google protobuf website. (Eclipse users can import the InproTK as a project; note that the package might be excluded from the build because it relies on a jar that we cannot distribute; please add the two above jars to the classpath and include the package).

Addendum to SigDial 2013 Long Paper

In our 2013 SigDial paper, Interpreting Situated Dialogue Utterances: an Update Model that Uses Speech, Gaze, and Gesture Information, we noticed a mistake in the final derivation of our model. In the paper, equation 2 stands as:


However, P(U|R) can be rewritten as P(U,R) / P(R), which would result in cancelling out P(U|R) entirely, resulting in no mention of U in the model, which was certainly not our intention, nor was it how our model actually works.

Instead, R should have been kept on the left-hand side and then marginalized, resulting in an accurate depiction of our model:


This doesn’t change the arguments made in the paper and the results. We apologize if this has caused any confusion.


Error on Ubuntu 12.04 with InstantPlayer and FaceLab

If you are attempting to use InstantReality’s InstantPlayer as a 3D environment, and you are using Seeing Machines’s FaceLab as an eye tracker (via a FaceLab plugin), then you may have run into the “undefined symbol” problem while running InstantPlayer.

This is a hashed message that can be filtered through “c++filt” (pass the hashed string as an argument), which gives the following:


That is, there is a problem during runtime that the above constructor cannot be found in the dynamic linked library, even if that library is in the correct path. The problem is that when the FaceLab node was compiled, it required the coredata sdk.h file, which pointed to the other .h files. More specifically, during compile the FaceLab plugin for InstantPlayer only had a definition of the constructor and not the constructor itself. During runtime, it never finds the constructor. A fix for this is to do the following.

1. Open and edit the coredata/include/eod/io/sdk.h file
2. Change the #include references for three .h files, namely:
3. Instead, use their .cpp counterparts:
i.e., #include “../src/eod/io/datagramsocket.cpp”
and do the same for the other two files.
4. Recompile the coredata (you may need to run “clean” first):
>make -f Makefile.linux32_ia32 all
This will make the new libraries where the constructor is accessible. This is our proposed fix until the Seeing Machines developers find a better solution.

5 more 2013 papers!

Timo Baumann and David Schlangen, Interactional Adequacy as a Factor in the Perception of Synthesized Speech, to appear in Proceedings of the 8th ISCA Speech Synthesis Workshop, Barcelona, Spain, August 2013


Speaking as part of a conversation is different from reading out aloud.
Speech synthesis systems, however, are typically developed using assumptions
(at least implicitly) that are more true of the latter than the former situation.
We address one particular aspect, which is the assumption that
a fully formulated sentence is available for synthesis.
We have built a system that does not make this assumption
but rather can synthesize speech given incrementally extended input.
In an evaluation experiment, we found that in a dynamic domain
where what is talked about changes quickly,
subjects rated the output of this system as more `naturally pronounced’ than that
of a baseline system that employed standard synthesis,
despite the synthesis quality objectively being degraded.
Our results highlight the importance of considering a synthesizer’s ability to
support interactive use-cases when determining the adequacy of synthesized

Casey Kennington and David Schlangen, Situated Incremental Natural Language Understanding Using Markov Logic Networks, to appear in Computer Speech and Language


We present work on understanding natural language in a situated domain in an incremental, word-by-word fashion.
We explore a set of models specified as Markov Logic Networks and show that a model that has access to information about the visual context during an utterance, its discourse context, the words of the utterance, as well as the linguistic structure of the utterance performs best and is robust to noisy speech input. We explore the incremental properties of the models and offer some analysis. We conclude that MLNs provide a promising framework for specifying such models in a general, possibly domain-independent way.

Casey Kennington, Spyros Kousidis and David Schlangen, Interpreting Situated Dialogue Utterances: an Update Model that Uses Speech, Gaze, and Gesture Information, to appear in Proceedings of SIGdial 2013, Metz, France, August 2013


In situated dialogue, speakers share time and space. We present a statistical model for understanding natural language that works incrementally (i.e., in real, shared time) and is grounded (i.e., links to entities in the shared space). We describe our model with an example, then establish that our model works well on non-situated, telephony application-type utterances, show that it is effective in grounding language in a situated environment, and further show that it can make good use of embodied cues such as gaze and pointing in a fully multi-modal setting.

Spyros Kousidis, Casey Kennington and David Schlangen, Investigating speaker gaze and pointing behaviour in human-computer interaction with the collection, to appear in Proceedings of SIGdial 2013, Metz, France, August 2013


Can speaker gaze and speaker arm movements be used as a practical information source for naturalistic conversational human–computer interfaces? To investigate this question, we’ve recorded (with eye tracking and motion capture) a corpus of interactions with a (wizarded) system. In this paper, we describe the recording and analysis infrastructure that we’ve built for such studies, and the particular analyses we performed on these data.
We find that with some initial calibration, a “minimally invasive”, stationary camera-based setting provides data of sufficient quality to support interaction.

Timo Baumann and David Schlangen, Open-ended, Extensible System Utterances Are Preferred, Even If They Require Filled Pauses, to appear in Proceedings of SIGdial 2013, Metz, France, August 2013


In many environments (e.g. sports commentary), situations incrementally unfold over time and often the future appearance of a relevant event can be predicted, but not in all its details or precise timing. We have built a simulation framework that uses our incremental
speech synthesis component to assemble in a timely manner complex commentary
utterances. In our evaluation, the resulting output is preferred over that from a baseline system that uses a simpler commenting strategy. Even in cases where the incremental
system overcommits temporally and requires a filled pause to wait for
the upcoming event, the system is preferred over the baseline.

PDFs to follow (or, should I forget to add links here, will be available under “Publications” in any case).

Two more recent papers: Virtual Agents, and Dysfluencies

Paper at DiSS 2013: The 6th Workshop on Disfluency in Spontaneous Speech: Ginzburg, J., Fernández, R., & Schlangen, D. (2013) Self Addressed Questions in Dysfluencies, In: Proceedings of the 6th Workshop on Disfluency in Spontaneous Speech, Stockholm, 2013

Short paper at IVA 2013: van Walbergen, H., Baumann, T., Kopp, S., Schlangen, D. (2013) „Incremental, Adaptive and Interruptive Speech Realization for Fluent Conversation with ECAs“, In: Proceedings of the Thirteenth International Conference on Intelligent Virtual Agents (IVA 2013), Edinburgh, August 2013.

Links to pdfs to follow.