Several DSG Members are off to Gothenburg for the SemDial 2015 conference

Several long papers have been accepted:

Hough, Juliande Kok, IwanSchlangen, DavidKopp, Stefan. Timing and Grounding in Motor Skill Coaching Interaction: Consequences for the Information State

Han, Ting ; Kennington, CaseySchlangen, DavidBuilding and Applying Perceptually-Grounded Representations of Multimodal Scene Descriptions

As well as some demo papers:

Kennington, Casey ; Lopez Gambino, Maria Soledad ; Schlangen, DavidReal-world Reference Game using the Words-as-Classifiers Model of Reference Resolution

de Kok, Iwan;  Hough, Julian; Hülsmann, Felix; Waltema, Thomas; Botsch, Mario; Schlangen, David; and Kopp, Stefan. Demonstrating the Dialogue System of the Intelligent Coaching Space

Forced Alignment with InproTK (and Sphinx)

Forced alignment is the task of determining start and end times for words within an audio file, given a reference text. For example, if I record myself on a microphone to a wav file saying the words “hello world” I can then pass that wav file and the text “hello world” to a program that can do forced alignment and it will be able to tell me at what point in the file each of the two words started and ended.

In this tutorial, we will use InproTK (which in turn uses CMU’s Sphinx4 speech recognizer) to perform the forced alignment (henceforth, FA). We will see how it is done with InproTK’s SimpleReco, then we will see how one can do FA in other languages. Below, two common problems are addressed.

It was explained in a previous post how to “install” InproTK in eclipse. Please refer to that explanation on downloading and getting InproTK to run (it might not hurt to checkout the develop branch). Once you have it working as a project in eclipse, then please continue with the steps below.

FA in InproTK is quite simple. In eclipse, navigate to src, then inpro.apps.SimpleReco and open it. The java source file should appear in the editor. Click on the little arrow next to the green play button on the top, then Run As -> Java Application. You should see a bit of output in the console explaining what the command line parameters should be.

The command line parameters that are necessary for FA are:
-F <URI to audio file>
-fa “<reference text>”
-Lp <path to inc_reco output file>

In order to test FA on your system, please download this: audio. Put it in a folder that you know the path to.

In eclipse, click on the arrow next to the Run icon, then click on “Run Configurations”. Find SimpleReco under Java Applications and click on it. Then click on the tab “Arguments”. In the Program arguments box, copy in the following, then change the two paths within the brackets to reflect your own setup:

-F file://<path to audio>/audio.wav  -fa “der rote stricht oben rechts richtig” -L -Lp <path for output file>

Click “Apply” then “Run”.  You should see output in the command line. You should also see an inc_reco file appear in the <path for output file>. Open that file and you will see lines that look like this:

Time: 6.80
0.000    2.930    <sil>
2.930    3.450    der
3.450    3.990    rote
3.990    4.200    <sil>

What if I want to perform FA on many files? You can call InproTK from the command line. Make sure all the necessary jars are in the lib are on the classpath, and it can be called from, e.g.,  a shell or python script. E.g.:

java -classpath inprotk/lib/oaa2/*:inprotk/lib/common-for-convenience/*:inprotk/lib/*:inprotk/lib/sphinx4/* inpro.apps.SimpleReco -c file:inprotk/config.xml -F file:<path to audio file> -fa “<reference text>” -L -Lp <path to inc_reco output file>

FA with InproTK in English

The above example was audio from a German speaker. What if I want to use this in another language? No problem. You will of course, need three things for that language: an acoustic model, a language model, and a dictionary. There are all three kinds of models on for various languages. We will not look at how to create models here. Rather, we will see an example of how an already existing model for English can be used in InproTK.

In the inpro.apps folder, you will see a config.xml. In that file, there are references to other xml configuration files. Note the sphinx-de.xml. Replacing the information in that file with the information for the English model will tell Sphinx/InproTK where to find it. To do so, please change sphinx-de.xml to sphinx-en.xml and save the config.xml file.

Fortunately, InproTK already has the WSJ acoustic models and dictionary. You will also need a language model file which you can download here. Download the wsj5kc.Z.DMP file and put it into InproTK’s res folder. Next, create a file in called sphinx-en.xml and paste the following contents into it:

<property name=”wordInsertionProbability” value=”0.7″/>
<property name=”languageWeight” value=”11.5″/>

<!– ******************************************************** –>
<!– The Grammar  configuration                               –>
<!– ******************************************************** –>
<component name=”jsgfGrammar” type=”edu.cmu.sphinx.jsgf.JSGFGrammar”>
<property name=”dictionary” value=”dictionary”/>
<property name=”grammarLocation” value=”file:src/inpro/domains/pentomino/resources/”/>
<property name=”grammarName” value=”pento”/>
<property name=”logMath” value=”logMath”/>

<component name=”ngram” type=”edu.cmu.sphinx.linguist.language.ngram.SimpleNGramModel”>
<property name=”dictionary” value=”dictionary”/>
<property name=”logMath” value=”logMath”/>
<!– dies lässt sich mit dem -lm switch in SimpleReco verbiegen –>
<property name=”location” value=”file://res/wsj5kc.Z.DMP”/>
<property name=”maxDepth” value=”3″/>
<property name=”unigramWeight” value=”.7″/>

<!– this second ngram model can be used to blend multiple language models using interpolatedLM defined in sphinx.xml –>
<component name=”ngram2″ type=”edu.cmu.sphinx.linguist.language.ngram.SimpleNGramModel”>
<property name=”dictionary” value=”dictionary”/>
<property name=”logMath” value=”logMath”/>
<property name=”location” value=”file://res/wsj5kc.Z.DMP”/>
<property name=”maxDepth” value=”3″/>
<property name=”unigramWeight” value=”.7″/>

<!– ******************************************************** –>
<!– The Dictionary configuration                            –>
<!– ******************************************************** –>
<component name=”dictionary” type=”edu.cmu.sphinx.linguist.dictionary.FullDictionary”>
<property name=”dictionaryPath” value=”resource:/WSJ_8gau_13dCep_16k_40mel_130Hz_6800Hz/dict/cmudict.0.6d”/>
<property name=”fillerPath” value=”resource:/WSJ_8gau_13dCep_16k_40mel_130Hz_6800Hz/dict/fillerdict”/>
<!–        <property name=”addSilEndingPronunciation” value=”false”/>
<property name=”allowMissingWords” value=”true”/>
<property name=”createMissingWords” value=”true”/>
<property name=”g2pModelPath” value=”file:///home/timo/uni/projekte/inpro-git/res/Cocolab_DE-g2p-4.fst.ser”/>
<property name=”g2pMaxPron” value=”2″/>
–>        <property name=”unitManager” value=”unitManager”/>

<!– ******************************************************** –>
<!– The acoustic model and unit manager configuration        –>
<!– ******************************************************** –>
<component name=”sphinx3Loader” type=”edu.cmu.sphinx.linguist.acoustic.tiedstate.Sphinx3Loader”>
<property name=”logMath” value=”logMath”/>
<property name=”unitManager” value=”unitManager”/>
<property name=”location” value=”resource:/WSJ_8gau_13dCep_16k_40mel_130Hz_6800Hz”/>
<property name=”modelDefinition” value=”etc/WSJ_clean_13dCep_16k_40mel_130Hz_6800Hz.4000.mdef”/>
<property name=”dataLocation” value=”cd_continuous_8gau/”/>

Now InproTK can perform FA on English audio. Go back to the SimpleReco configuration (arrow next to Play button, configurations, SimpleReco) and change the command line arguments to refer to an English audio file and change the reference text to reflect that audio. You can change back to German by opening up the config.xml file and replacing sphinx-en.xml with sphinx-de.xml.

Problem! I don’t see the inc_reco output file anywhere!! It could be the case that your LabelWriter isn’t working properly. In inpro.apps you will find iu-config.xml. Open that and find labelWriter2. Make sure the component definition for that looks like this:

<component name=”labelWriter2″ type=”inpro.incremental.sink.LabelWriter”>
<property name=”writeToStdOut” value=”false”/>
<property name=”writeToFile” value=”true”/>

If not, then copy the above text over the original labelWriter2 definition. Save the iu-config.xml file.

Problem! My output has some “Can’t find pronunciation for <word>”.  That means one of the words in your reference text does not exist in the dictionary. The dictionary for the German model is in the lib/Cocolab_DE_8gau_13dCep_16k_40mel_130Hz_6800Hz jar in the Cocolab_DE_8gau_13dCep_16k_40mel_130Hz_6800Hz/dict folder. You can view its contents in eclipse by going to the InproTK project, then Referenced Libraries, then Cocolab_DE_ …. then dict, then Cocolab_DE.lex. Here is an example of the contents:

zog    t s o: k
Sekretär    z e k r e t E: 6
tanzen    t a n t s n
tanzen(1)    t a n t s @ n
Ladentür    l a: d n t y: 6
kochten    k O x t n

where each line has a word, a tab, then the phonetic transcription. To add words to the dictionary (ones that appear in the above-mentioned warning in the console), add the word to the bottom, press tab, then add a phonetic transcription (we will not see how a phonetic transcription is made here; for now just find other words that look similar to the one you are adding and form the transcription based on those). You may need to open the jar with an archive tool, add the missing word+transcription, save the dictionary file, then re-make the jar and replace the old one. Alternatively, you can make your own dictionary file and set the dictionary path to that file in sphinx-de.xml.

2 Interspeech 2015 Papers accepted!

We had two papers accepted to the Interspeech Conference in Dresden in August:

Title: Micro-Structure of Disfluencies: Basics for Conversational Speech Synthesis

Authors: Simon Betz, Petra Wagner and David Schlangen

Abstract: Incremental dialogue systems can produce fast responses and can interact in a human-like fashion. However, these systems occasionally produce erroneous material or run out of things to say. Humans in such situations use disfluencies to remedy their ongoing production and signal this to the listener. We devised a new model for inserting disfluencies into synthesis and evaluated this approach in a perception test. It showed that lengthenings and silent pauses can be built for speech synthesis with low effort and high output quality. Synthesized word fragments and filled pauses, while potentially useful in incremental dialogue systems, appear more difficult to handle for listeners. While we were able to get consistently high ratings for certain types of disfluencies, the need for more basic research on their micro structure became apparent in order to be able to synthesize the fine phonetic detail of disfluencies. For this, we analysed corpus data with regard to distributional and durational aspects of lengthenings, word fragments and pauses. Based on these natural speaking strategies, we explored further to what extent speech can be delayed using disfluency strategies, and how to handle difficult disfluency elements by determining the appropriate amount of durational variation applicable.


Title: Recurrent Neural Networks for Incremental Disfluency Detection

Authors: Julian Hough and David Schlangen

Abstract: For dialogue systems to become robust, they must be able to detect disfluencies accurately and with minimal latency. To meet this challenge, here we frame incremental disfluency detection as a word-by-word tagging task and, following their recent success in Spoken Language Understanding tasks, we test the performance of Recurrent Neural Networks (RNNs). We experiment with different inputs for RNNs to explore the effect of context on their ability to detect edit terms and repair disfluencies effectively, and also experiment with different tagging schemes. Although not eclipsing the state of the art in terms of utterance-final performance, RNNs achieve good detection results, requiring no feature engineering and using simple input vectors representing the incoming utterance as their training input. Furthermore, RNNs show very good incremental properties with low latency and very good output stability, surpassing previously reported results in these measures.


Accepted Paper: ACL 2015

We have a paper that has been accepted to the ACL conference which will take place in Beijing, China this year.

Title: Simple Learning and Compositional Application of Perceptually Grounded Word Meanings for Incremental Reference Resolution

Authors:  Casey Kennington & David Schlangen

An elementary way of using language is to refer to objects. Often, these objects are physically present in the shared environment and reference is done via mention of perceivable properties of the objects. This is a type of language use that is modelled well neither by logical semantics nor by distributional semantics, the former focusing on inferential relations between expressed propositions, the latter on similarity relations between words or phrases. We present an account of word and phrase meaning that is perceptually grounded, trainable, compositional, and `dialogue-plausible’ in that it computes meanings word-by-word. We show that the approach performs well (with an accuracy of 65\% on a 1-out-of-32 reference resolution task) on direct descriptions and target/landmark descriptions, even when trained with less than 800 training examples and automatically transcribed utterances.

Accepted Paper: NAACL 2015

We have a recently accepted paper to the upcoming NAACL 2015 conference which will be held in Denver, CO, U.S.A.

Title:  Incrementally Tracking Reference in Human/Human Dialogue Using Linguistic and Extra-Linguistic Information

Authors: Casey Kennington, Ryu Iida, Takenobu Tokunaga, David Schlangen

A large part of human communication involves referring to entities in the world, and often these entities are objects that are visually present for the interlocutors. A system that aims to resolve such references needs to tackle a complex task: objects and their visual features need to be determined, the referring expressions must be recognised, and extra-linguistic information such as eye gaze or pointing gestures need to be incorporated. Systems that can make use of such information sources exist, but have so far only been tested under very constrained settings, such as WOz interactions. In this paper, we apply to a more complex domain a reference resolution model that works incrementally (i.e., word for word), grounds words with visually present properties of objects (such as shape and size), and can incorporate extra-linguistic information. We find that the model works well compared to previous work on the same data, despite using fewer features. We conclude that the model shows potential for use in a real-time interactive dialogue system.

Accepted Papers: IWCS 2015

We have 2 recently accepted papers to the IWCS conference which will take place in London, UK.

Title: Incremental Semantics for Dialogue Processing: Requirements, and a Comparison of Two Approaches
Authors: Julian Hough, Casey Kennington, David Schlangen and Jonathan Ginzburg
Truly interactive dialogue systems need to construct meaning on at least a word-by-word basis. We propose desiderata for incremental semantics for dialogue models and systems, a task not heretofore attempted thoroughly. After laying out the desirable properties we illustrate how they are met by current approaches, comparing two incremental semantic processing frameworks: Dynamic Syntax enriched with Type Theory with Records (DS-TTR) and Robust Minimal Recursion Semantics with incremental processing (RMRS-IP). We conclude these approaches are not significantly different with regards to their semantic representation construction, however their purported role within semantic models and dialogue models is where they diverge.


Title: A Discriminative Model for Perceptually-Grounded Incremental Reference Resolution
Authors: Casey Kennington, Livia Dia, David Schlangen
A large part of human communication involves referring to entities in the world, and often these entities are objects that are visually present for the interlocutors. A computer system that aims to resolve such references needs to tackle a complex task: objects and their visual features need to be determined, the referring expressions must be recognised, extra-linguistic information such as eye gaze or pointing gestures need to be incorporated — and the intended connection between words and world must be reconstructed. In this paper, we introduce a discriminative model of reference resolution that processes incrementally (i.e., word for word), is perceptually-grounded in the world, and improves when interpolated with information from gaze and pointing gestures. We evaluated our model and found that it performed robustly in a realistic reference resolution task, when compared to a generative model.

Intro to InproTK

InproTK has been around for several years and is becoming more widely used in dialogue processing research. It follows the Incremental Unit framework of incremental dialogue processing.

The toolkit has been written in Java primarily by Timo Baumann. In this intro, we will show you how to get InproTK up and running with speech recognition (Incremental Sphinx, part of InproTK), a dialogue manager (opendial, by Pierre Lison; note that opendial can also be used as a fully-functional end-to-end dialogue system, with recently added incremental capabilities, but we will use its dialogue management capabilities here), and a text-to-speech interface (incremental MaryTTS, also part of InproTK).

This intro has two parts. The first part is a rough guide on getting everything “installed” so you can see a working InproTK project called myds. The following video steps you through the myds project code and explains how it all works together. You do not need to install everything to watch the video, but it certainly helps.

Installing InproTK

InproTK is written in Java, so “installing” it means getting the jar (and necessary libraries), or getting the source code. It is advisable to get the source code.

You can the source code from bitbucket by running the following command (sign up for a free bitbucket account if you have not already done so):

  • git clone https://[your username]

The rest of this will explain how to use InproTK within eclipse using a sample eclipse project.

  • Download eclipse, unzip, run (you may need to install a Java JDK, make sure it is above 1.5 but lower than 1.8), pick a workspace

  • Import the InproTK project ….

    • in Eclipse, file -> import -> Existing Projects into Workspace -> Browse -> (find where you cloned the InproTK project)

    • You should see inprotk as a project in Eclipse. Make sure there are no errors (an error is denoted by red marks). You can click on the “Problems” tab to see where the problems are. Make sure all the libraries are included in the project. To do that, click on Project -> Properties -> Java Build Path -> Libraries -> (select jar files in the inprotk lib folder)

  • Download, unzip, and copy the myds project, import in the same way as you did with inprotk

  • right-click on myds, go to Properties -> Java Build Path -> Projects -> Add … -> (select the inprotk project)

You will also need the opendial project and its libraries. You can checkout the project from svn:

  • svn checkout opendial-read-only

    • either import it into eclipse, or just use the eclipse svn client and create a new project

  • right-click on myds, go to Properties -> Java Build Path -> Projects -> Add … -> (select the opendial project)

  • Check myds for errors.

Now, you need to get MaryTTS working in InproTK. It’s not required, but getting it working will allow you to use myds. Follow the instructions found at

  • be sure to use Mary 4 for the myds example

  • you can also set the mary.base in the in the myds project.

You should be able to open apps.Main in myds/src and run it by clicking Run -> Run as -> Java Application. It should run without throwing any errors. You will need to change some of the paths in

Next, you are ready for the instruction video:

We will post more tutorials on InproTK and incremental dialogue processing in the future.


Baumann T, Schlangen D. The InproTK 2012 release. In: Proceedings of the NAACL-HLT Workshop on Future directions and needs in the Spoken Dialog Community: Tools and Data (SDCTD 2012). ACL; 2012: 29–32.

Kennington C, Kousidis S, Schlangen D. InproTKs: A Toolkit for Incremental Situated Processing. In: Proceedings of SIGdial 2014: Short Papers.; 2014: 84–88.

Schlangen, D., & Skantze, G. (2011). A General, Abstract Model of Incremental Dialogue Processing. Dialoge & Discourse, 2(1), 83–111.

Lison, P. (2014). Structured Probabilistic Modelling for Dialogue Management. University of Oslo.

Sync your videos using reference audio

Reference audio is a common method for synchronization of videos from different cameras, when more expensive equipment that does this automatically is not available. The most common use case is of course filming a scene from several different angles, but there are other setups that may require the same technique. Recently, I had to film several scenes in succession, over which the same audio was playing. I needed to extract a part from each scene, so that exactly the same audio would be playing over each scene part (they all have the same duration, of course). So far, nothing is new. However, what is surprising is how easy it is to do that with a handful of open source tools, as I will be showing here. In particular we will be using Praat, ffmpeg and Python.

The tutorial is available in PDF format:


Accepted Paper: ICMI 2014

We have a recently accepted paper at the upcoming ICMI conference, which will be held in Istanbul, Turkey this year.

Title: A Multimodal In-Car Dialogue System\\ That Tracks The Driver’s Attention

Authors: Spyros Kousidis, Casey Kennington, Timo Baumann, Hendrik Buschmeier, Stefan Kopp, David Schlangen

Abstract: When a driver speaks to a passenger, that passenger is co-located with the driver, is generally aware of the situation, and can stop speaking to allow the driver to focus on the driving task. In-car dialogue systems ignore this important fact, making them more distracting than even cell-phone conversations. We developed and tested a “situationally-aware” dialogue system that can interrupt its speech when a situation is detected which requires more attention of the driver, and can resume when normal driving conditions return. Furthermore, our system allows resumption of interrupted speech via verbal or visual cues (such as head nods) from the driver. We tested whether giving the driver such control is a hindrance or helps in a driving and memory task.

Accepted Papers: RefNet Workshop

We have 3 recently accepted papers to the RefNet workshop which will take place in Edinburgh.

Title: A Corpus of Virtual Pointing Gestures
Authors: Ting Han, Spyros Kousidis, David Schlangen

Title: Comparing Listener Gaze with Predictions of an Incremental Reference Resolution Model
Authors: Casey Kennington, Spyros Kousidis, David Schlangen
In situated dialogue, listeners resolve referring expressions incrementally (on-line) and their gaze often attends to objects in the context as those objects are being described. In this work, we have looked at how listener gaze compares to a statistical reference resolution model that works incrementally. We found that listeners gaze at referred objects even before a referring expression begins, suggesting that salience and prior information is important in reference resolution models.

Title: Lattice Theoretic Relevance in Incremental Reference Processing

Authors: Julian Hough and Matthew Purver

Abstract: We build on Hough and Purver (2014)’s integration of Knuth (2005)’s lattice theoretic characterization of probabilistic inference to model incremental interpretation of repaired instructions in a small reference domain.