Tutorial: ASR using InproTK

This tutorial is meant to help you use speech recognition (ASR) in your applications. The kind of ASR we use follows the Incremental Unit dialogue processing model, which produces ASR output that is more timely than traditional approaches. We have modules for Sphinx4, Kaldi and Google ASR systems.

If you aren’t sure what InproTK is or how to install it, please refer to the instructions here. Once you have completed that tutorial, you can continue with this one.

(Note that some of the functionality in this tutorial may only exist in the DSG-Bielefeld BitBucket fork of InproTK and will appear in the main fork in the development branch shortly.)

Demo of Words-as-Classifiers Model of Reference Resolution

The short video below shows how the Words-as-Classifiers (WAC) [2,3] model of reference resolution (RR) resolves verbal references made to an object that is visually present and tangible.

The setup is as follows: several Pentomino puzzle tiles (i.e., geometric shapes) are on a table. A camera placed above them. The video is fed to computer vision processing software which segments the objects and provides a set of low-level features of each object to the WAC module of RR as implemented as a module in InproTK [4]. Another module uses Kaldi to recognize speech through a microphone and that output is also fed to the WAC module. Using the visual features and the speech, the WAC module determines which object is being referred by producing a probability distribution over the objects, where the object with the highest probability is chosen as the referred object.

The module works incrementally, i.e., it processes word for word as they are recognized. The WAC model is trained on examples of similar interactions (i.e., geometric pentomino tiles and corresponding referring expressions).

This demo corresponds to the demo presented by Soledad Lopez at the SEMDial Conference which was held in Gothenburg, Sweden in 2015 [1]. Also part of that demo, but not showcased here, was the ability for the module to speak some simple utterances including confirmation if the object selected by WAC was the intended one.


[1] Kennington, C., Lopez Gambino, M. S., & Schlangen, D. (2015). Real-world Reference Game using the Words-as-Classifiers Model of Reference Resolution. In Proceedings of SemDial 2015 (pp. 188–189).

[2] Kennington, C., & Schlangen, D. (2015). Simple Learning and Compositional Application of Perceptually Grounded Word Meanings for Incremental Reference Resolution. In Proceedings of ACL. Beijing, China: Association for Computational Linguistics.

[3] Kennington, C., Dia, L., & Schlangen, D. (2015). A Discriminative Model for Perceptually-Grounded Incremental Reference Resolution. In Proceedings of IWCS. Association for Computational Linguistics.

[4] Baumann, T., & Schlangen, D. (2012). The InproTK 2012 Release. In Proceedings of NAACL-HLT.

Several DSG Members are off to Gothenburg for the SemDial 2015 conference

Several long papers have been accepted:

Hough, Juliande Kok, IwanSchlangen, DavidKopp, Stefan. Timing and Grounding in Motor Skill Coaching Interaction: Consequences for the Information State

Han, Ting ; Kennington, CaseySchlangen, DavidBuilding and Applying Perceptually-Grounded Representations of Multimodal Scene Descriptions

As well as some demo papers:

Kennington, Casey ; Lopez Gambino, Maria Soledad ; Schlangen, DavidReal-world Reference Game using the Words-as-Classifiers Model of Reference Resolution

de Kok, Iwan;  Hough, Julian; Hülsmann, Felix; Waltema, Thomas; Botsch, Mario; Schlangen, David; and Kopp, Stefan. Demonstrating the Dialogue System of the Intelligent Coaching Space

Forced Alignment with InproTK (and Sphinx)

Forced alignment is the task of determining start and end times for words within an audio file, given a reference text. For example, if I record myself on a microphone to a wav file saying the words “hello world” I can then pass that wav file and the text “hello world” to a program that can do forced alignment and it will be able to tell me at what point in the file each of the two words started and ended.

In this tutorial, we will use InproTK (which in turn uses CMU’s Sphinx4 speech recognizer) to perform the forced alignment (henceforth, FA). We will see how it is done with InproTK’s SimpleReco, then we will see how one can do FA in other languages. Below, two common problems are addressed.

It was explained in a previous post how to “install” InproTK in eclipse. Please refer to that explanation on downloading and getting InproTK to run (it might not hurt to checkout the develop branch). Once you have it working as a project in eclipse, then please continue with the steps below.

FA in InproTK is quite simple. In eclipse, navigate to src, then inpro.apps.SimpleReco and open it. The java source file should appear in the editor. Click on the little arrow next to the green play button on the top, then Run As -> Java Application. You should see a bit of output in the console explaining what the command line parameters should be.

The command line parameters that are necessary for FA are:
-F <URI to audio file>
-fa “<reference text>”
-Lp <path to inc_reco output file>

In order to test FA on your system, please download this: audio. Put it in a folder that you know the path to.

In eclipse, click on the arrow next to the Run icon, then click on “Run Configurations”. Find SimpleReco under Java Applications and click on it. Then click on the tab “Arguments”. In the Program arguments box, copy in the following, then change the two paths within the brackets to reflect your own setup:

-F file://<path to audio>/audio.wav  -fa “der rote stricht oben rechts richtig” -L -Lp <path for output file>

Click “Apply” then “Run”.  You should see output in the command line. You should also see an inc_reco file appear in the <path for output file>. Open that file and you will see lines that look like this:

Time: 6.80
0.000    2.930    <sil>
2.930    3.450    der
3.450    3.990    rote
3.990    4.200    <sil>

What if I want to perform FA on many files? You can call InproTK from the command line. Make sure all the necessary jars are in the lib are on the classpath, and it can be called from, e.g.,  a shell or python script. E.g.:

java -classpath inprotk/lib/oaa2/*:inprotk/lib/common-for-convenience/*:inprotk/lib/*:inprotk/lib/sphinx4/* inpro.apps.SimpleReco -c file:inprotk/config.xml -F file:<path to audio file> -fa “<reference text>” -L -Lp <path to inc_reco output file>

FA with InproTK in English

The above example was audio from a German speaker. What if I want to use this in another language? No problem. You will of course, need three things for that language: an acoustic model, a language model, and a dictionary. There are all three kinds of models on voxforge.org for various languages. We will not look at how to create models here. Rather, we will see an example of how an already existing model for English can be used in InproTK.

In the inpro.apps folder, you will see a config.xml. In that file, there are references to other xml configuration files. Note the sphinx-de.xml. Replacing the information in that file with the information for the English model will tell Sphinx/InproTK where to find it. To do so, please change sphinx-de.xml to sphinx-en.xml and save the config.xml file.

Fortunately, InproTK already has the WSJ acoustic models and dictionary. You will also need a language model file which you can download here. Download the wsj5kc.Z.DMP file and put it into InproTK’s res folder. Next, create a file in inpro.app called sphinx-en.xml and paste the following contents into it:

<property name=”wordInsertionProbability” value=”0.7″/>
<property name=”languageWeight” value=”11.5″/>

<!– ******************************************************** –>
<!– The Grammar  configuration                               –>
<!– ******************************************************** –>
<component name=”jsgfGrammar” type=”edu.cmu.sphinx.jsgf.JSGFGrammar”>
<property name=”dictionary” value=”dictionary”/>
<property name=”grammarLocation” value=”file:src/inpro/domains/pentomino/resources/”/>
<property name=”grammarName” value=”pento”/>
<property name=”logMath” value=”logMath”/>

<component name=”ngram” type=”edu.cmu.sphinx.linguist.language.ngram.SimpleNGramModel”>
<property name=”dictionary” value=”dictionary”/>
<property name=”logMath” value=”logMath”/>
<!– dies lässt sich mit dem -lm switch in SimpleReco verbiegen –>
<property name=”location” value=”file://res/wsj5kc.Z.DMP”/>
<property name=”maxDepth” value=”3″/>
<property name=”unigramWeight” value=”.7″/>

<!– this second ngram model can be used to blend multiple language models using interpolatedLM defined in sphinx.xml –>
<component name=”ngram2″ type=”edu.cmu.sphinx.linguist.language.ngram.SimpleNGramModel”>
<property name=”dictionary” value=”dictionary”/>
<property name=”logMath” value=”logMath”/>
<property name=”location” value=”file://res/wsj5kc.Z.DMP”/>
<property name=”maxDepth” value=”3″/>
<property name=”unigramWeight” value=”.7″/>

<!– ******************************************************** –>
<!– The Dictionary configuration                            –>
<!– ******************************************************** –>
<component name=”dictionary” type=”edu.cmu.sphinx.linguist.dictionary.FullDictionary”>
<property name=”dictionaryPath” value=”resource:/WSJ_8gau_13dCep_16k_40mel_130Hz_6800Hz/dict/cmudict.0.6d”/>
<property name=”fillerPath” value=”resource:/WSJ_8gau_13dCep_16k_40mel_130Hz_6800Hz/dict/fillerdict”/>
<!–        <property name=”addSilEndingPronunciation” value=”false”/>
<property name=”allowMissingWords” value=”true”/>
<property name=”createMissingWords” value=”true”/>
<property name=”g2pModelPath” value=”file:///home/timo/uni/projekte/inpro-git/res/Cocolab_DE-g2p-4.fst.ser”/>
<property name=”g2pMaxPron” value=”2″/>
–>        <property name=”unitManager” value=”unitManager”/>

<!– ******************************************************** –>
<!– The acoustic model and unit manager configuration        –>
<!– ******************************************************** –>
<component name=”sphinx3Loader” type=”edu.cmu.sphinx.linguist.acoustic.tiedstate.Sphinx3Loader”>
<property name=”logMath” value=”logMath”/>
<property name=”unitManager” value=”unitManager”/>
<property name=”location” value=”resource:/WSJ_8gau_13dCep_16k_40mel_130Hz_6800Hz”/>
<property name=”modelDefinition” value=”etc/WSJ_clean_13dCep_16k_40mel_130Hz_6800Hz.4000.mdef”/>
<property name=”dataLocation” value=”cd_continuous_8gau/”/>

Now InproTK can perform FA on English audio. Go back to the SimpleReco configuration (arrow next to Play button, configurations, SimpleReco) and change the command line arguments to refer to an English audio file and change the reference text to reflect that audio. You can change back to German by opening up the config.xml file and replacing sphinx-en.xml with sphinx-de.xml.

Problem! I don’t see the inc_reco output file anywhere!! It could be the case that your LabelWriter isn’t working properly. In inpro.apps you will find iu-config.xml. Open that and find labelWriter2. Make sure the component definition for that looks like this:

<component name=”labelWriter2″ type=”inpro.incremental.sink.LabelWriter”>
<property name=”writeToStdOut” value=”false”/>
<property name=”writeToFile” value=”true”/>

If not, then copy the above text over the original labelWriter2 definition. Save the iu-config.xml file.

Problem! My output has some “Can’t find pronunciation for <word>”.  That means one of the words in your reference text does not exist in the dictionary. The dictionary for the German model is in the lib/Cocolab_DE_8gau_13dCep_16k_40mel_130Hz_6800Hz jar in the Cocolab_DE_8gau_13dCep_16k_40mel_130Hz_6800Hz/dict folder. You can view its contents in eclipse by going to the InproTK project, then Referenced Libraries, then Cocolab_DE_ …. then dict, then Cocolab_DE.lex. Here is an example of the contents:

zog    t s o: k
Sekretär    z e k r e t E: 6
tanzen    t a n t s n
tanzen(1)    t a n t s @ n
Ladentür    l a: d n t y: 6
kochten    k O x t n

where each line has a word, a tab, then the phonetic transcription. To add words to the dictionary (ones that appear in the above-mentioned warning in the console), add the word to the bottom, press tab, then add a phonetic transcription (we will not see how a phonetic transcription is made here; for now just find other words that look similar to the one you are adding and form the transcription based on those). You may need to open the jar with an archive tool, add the missing word+transcription, save the dictionary file, then re-make the jar and replace the old one. Alternatively, you can make your own dictionary file and set the dictionary path to that file in sphinx-de.xml.

Accepted Paper: ACL 2015

We have a paper that has been accepted to the ACL conference which will take place in Beijing, China this year.

Title: Simple Learning and Compositional Application of Perceptually Grounded Word Meanings for Incremental Reference Resolution

Authors:  Casey Kennington & David Schlangen

An elementary way of using language is to refer to objects. Often, these objects are physically present in the shared environment and reference is done via mention of perceivable properties of the objects. This is a type of language use that is modelled well neither by logical semantics nor by distributional semantics, the former focusing on inferential relations between expressed propositions, the latter on similarity relations between words or phrases. We present an account of word and phrase meaning that is perceptually grounded, trainable, compositional, and `dialogue-plausible’ in that it computes meanings word-by-word. We show that the approach performs well (with an accuracy of 65\% on a 1-out-of-32 reference resolution task) on direct descriptions and target/landmark descriptions, even when trained with less than 800 training examples and automatically transcribed utterances.

Accepted Paper: NAACL 2015

We have a recently accepted paper to the upcoming NAACL 2015 conference which will be held in Denver, CO, U.S.A.

Title:  Incrementally Tracking Reference in Human/Human Dialogue Using Linguistic and Extra-Linguistic Information

Authors: Casey Kennington, Ryu Iida, Takenobu Tokunaga, David Schlangen

A large part of human communication involves referring to entities in the world, and often these entities are objects that are visually present for the interlocutors. A system that aims to resolve such references needs to tackle a complex task: objects and their visual features need to be determined, the referring expressions must be recognised, and extra-linguistic information such as eye gaze or pointing gestures need to be incorporated. Systems that can make use of such information sources exist, but have so far only been tested under very constrained settings, such as WOz interactions. In this paper, we apply to a more complex domain a reference resolution model that works incrementally (i.e., word for word), grounds words with visually present properties of objects (such as shape and size), and can incorporate extra-linguistic information. We find that the model works well compared to previous work on the same data, despite using fewer features. We conclude that the model shows potential for use in a real-time interactive dialogue system.

Accepted Papers: IWCS 2015

We have 2 recently accepted papers to the IWCS conference which will take place in London, UK.

Title: Incremental Semantics for Dialogue Processing: Requirements, and a Comparison of Two Approaches
Authors: Julian Hough, Casey Kennington, David Schlangen and Jonathan Ginzburg
Truly interactive dialogue systems need to construct meaning on at least a word-by-word basis. We propose desiderata for incremental semantics for dialogue models and systems, a task not heretofore attempted thoroughly. After laying out the desirable properties we illustrate how they are met by current approaches, comparing two incremental semantic processing frameworks: Dynamic Syntax enriched with Type Theory with Records (DS-TTR) and Robust Minimal Recursion Semantics with incremental processing (RMRS-IP). We conclude these approaches are not significantly different with regards to their semantic representation construction, however their purported role within semantic models and dialogue models is where they diverge.


Title: A Discriminative Model for Perceptually-Grounded Incremental Reference Resolution
Authors: Casey Kennington, Livia Dia, David Schlangen
A large part of human communication involves referring to entities in the world, and often these entities are objects that are visually present for the interlocutors. A computer system that aims to resolve such references needs to tackle a complex task: objects and their visual features need to be determined, the referring expressions must be recognised, extra-linguistic information such as eye gaze or pointing gestures need to be incorporated — and the intended connection between words and world must be reconstructed. In this paper, we introduce a discriminative model of reference resolution that processes incrementally (i.e., word for word), is perceptually-grounded in the world, and improves when interpolated with information from gaze and pointing gestures. We evaluated our model and found that it performed robustly in a realistic reference resolution task, when compared to a generative model.

Intro to InproTK

InproTK has been around for several years and is becoming more widely used in dialogue processing research. It follows the Incremental Unit framework of incremental dialogue processing.

The toolkit has been written in Java primarily by Timo Baumann. In this intro, we will show you how to get InproTK up and running with speech recognition (Incremental Sphinx, part of InproTK), a dialogue manager (opendial, by Pierre Lison; note that opendial can also be used as a fully-functional end-to-end dialogue system, with recently added incremental capabilities, but we will use its dialogue management capabilities here), and a text-to-speech interface (incremental MaryTTS, also part of InproTK).

This intro has two parts. The first part is a rough guide on getting everything “installed” so you can see a working InproTK project called myds. The following video steps you through the myds project code and explains how it all works together. You do not need to install everything to watch the video, but it certainly helps.

Installing InproTK

InproTK is written in Java, so “installing” it means getting the jar (and necessary libraries), or getting the source code. It is advisable to get the source code.

You can the source code from bitbucket by running the following command (sign up for a free bitbucket account if you have not already done so):

  • git clone https://[your username]@bitbucket.org/inpro/inprotk.git

The rest of this will explain how to use InproTK within eclipse using a sample eclipse project.

  • Download eclipse, unzip, run (you may need to install a Java JDK, make sure it is above 1.5 but lower than 1.8), pick a workspace

  • Import the InproTK project ….

    • in Eclipse, file -> import -> Existing Projects into Workspace -> Browse -> (find where you cloned the InproTK project)

    • You should see inprotk as a project in Eclipse. Make sure there are no errors (an error is denoted by red marks). You can click on the “Problems” tab to see where the problems are. Make sure all the libraries are included in the project. To do that, click on Project -> Properties -> Java Build Path -> Libraries -> (select jar files in the inprotk lib folder)

  • Download, unzip, and copy the myds project, import in the same way as you did with inprotk

  • right-click on myds, go to Properties -> Java Build Path -> Projects -> Add … -> (select the inprotk project)

You will also need the opendial project and its libraries. You can checkout the project from svn:

  • svn checkout http://opendial.googlecode.com/svn/trunk/ opendial-read-only

    • either import it into eclipse, or just use the eclipse svn client and create a new project

  • right-click on myds, go to Properties -> Java Build Path -> Projects -> Add … -> (select the opendial project)

  • Check myds for errors.

Now, you need to get MaryTTS working in InproTK. It’s not required, but getting it working will allow you to use myds. Follow the instructions found at http://sourceforge.net/p/inprotk/wiki/Setup/

  • be sure to use Mary 4 for the myds example

  • you can also set the mary.base in the app.Main.run() in the myds project.

You should be able to open apps.Main in myds/src and run it by clicking Run -> Run as -> Java Application. It should run without throwing any errors. You will need to change some of the paths in Main.run()

Next, you are ready for the instruction video:

We will post more tutorials on InproTK and incremental dialogue processing in the future.


Baumann T, Schlangen D. The InproTK 2012 release. In: Proceedings of the NAACL-HLT Workshop on Future directions and needs in the Spoken Dialog Community: Tools and Data (SDCTD 2012). ACL; 2012: 29–32.

Kennington C, Kousidis S, Schlangen D. InproTKs: A Toolkit for Incremental Situated Processing. In: Proceedings of SIGdial 2014: Short Papers.; 2014: 84–88.

Schlangen, D., & Skantze, G. (2011). A General, Abstract Model of Incremental Dialogue Processing. Dialoge & Discourse, 2(1), 83–111.

Lison, P. (2014). Structured Probabilistic Modelling for Dialogue Management. University of Oslo.

Accepted Paper: ICMI 2014

We have a recently accepted paper at the upcoming ICMI conference, which will be held in Istanbul, Turkey this year.

Title: A Multimodal In-Car Dialogue System\\ That Tracks The Driver’s Attention

Authors: Spyros Kousidis, Casey Kennington, Timo Baumann, Hendrik Buschmeier, Stefan Kopp, David Schlangen

Abstract: When a driver speaks to a passenger, that passenger is co-located with the driver, is generally aware of the situation, and can stop speaking to allow the driver to focus on the driving task. In-car dialogue systems ignore this important fact, making them more distracting than even cell-phone conversations. We developed and tested a “situationally-aware” dialogue system that can interrupt its speech when a situation is detected which requires more attention of the driver, and can resume when normal driving conditions return. Furthermore, our system allows resumption of interrupted speech via verbal or visual cues (such as head nods) from the driver. We tested whether giving the driver such control is a hindrance or helps in a driving and memory task.

Accepted Papers: RefNet Workshop

We have 3 recently accepted papers to the RefNet workshop which will take place in Edinburgh.

Title: A Corpus of Virtual Pointing Gestures
Authors: Ting Han, Spyros Kousidis, David Schlangen

Title: Comparing Listener Gaze with Predictions of an Incremental Reference Resolution Model
Authors: Casey Kennington, Spyros Kousidis, David Schlangen
In situated dialogue, listeners resolve referring expressions incrementally (on-line) and their gaze often attends to objects in the context as those objects are being described. In this work, we have looked at how listener gaze compares to a statistical reference resolution model that works incrementally. We found that listeners gaze at referred objects even before a referring expression begins, suggesting that salience and prior information is important in reference resolution models.

Title: Lattice Theoretic Relevance in Incremental Reference Processing

Authors: Julian Hough and Matthew Purver

Abstract: We build on Hough and Purver (2014)’s integration of Knuth (2005)’s lattice theoretic characterization of probabilistic inference to model incremental interpretation of repaired instructions in a small reference domain.