Robots face a lot of uncertainty about what users want them to do from their speech, and they also face difficulty in knowing when they’ve made their own goals common knowledge with the user. We recently presented a model as to how to best make a robot’s goal, and uncertainty about the user’s goal, common knowledge with the user at the 2017 Conference on Human-Robot Interaction (HRI2017) main conference:
We are going to present two papers at ACL this year:
Zarrieß S, Schlangen D. “Easy Things First: Installments Improve Referring Expression Generation for Objects in Photographs”. In: Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (ACL 2016). pdf
Schlangen D, Zarrieß S, Kennington C. “Resolving References to Objects in Photographs using the Words-As-Classifiers Model”. In: Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (ACL 2016) pdf
Both are in the area of grounded semantics, and both make use of our variant of the “word-as-classifiers” model, where the (perceptual parts of the) semantics of a word is modelled as a classifier on perceptual input. Those models are trained from referring expressions. The first paper uses this for generation and shows that we can work around the relative noisiness of the classifiers in generation by following a strategy of generating what Herb Clark calls installments, trial NPs. The other paper is on resolution and shows that we can reach relatively decent performance with this simple model, even compared to (more) end-to-end deep learning based ones.
Last week saw us say goodbye to Casey, who after a short stint as a postdoctoral researcher here has now moved back closer to home, on to a tenure-track position at Boise State, Idaho, USA. Congratulations!
Casey was the first to join the Bielefeld incarnation of the Dialogue Systems Group, so with him and Spyros Kousidis (the second member) now having moved on, it does feel like the start up phase has now concluded. Casey’s had quite the run here, with almost 20 peer-reviewed publications (including 3 full *acl papers). We’re excited to see what he’ll do next!
(From left to right: Casey, Iwan, Ting, David, Sina, Julian, Soledad; Birte & Simon couldn’t make it on that day.)
Sina Zarrieß, Julian Hough, Casey Kennington, Ramesh Manuvinakurike, David DeVault, Raquel Fernández, and David Schlangen. PentoRef: A Corpus of Spoken References in Task-oriented Dialogues (PUB https://pub.uni-bielefeld.de/publication/2903076)
This tutorial is meant to help you use speech recognition (ASR) in your applications. The kind of ASR we use follows the Incremental Unit dialogue processing model, which produces ASR output that is more timely than traditional approaches. We have modules for Sphinx4, Kaldi and Google ASR systems.
If you aren’t sure what InproTK is or how to install it, please refer to the instructions here. Once you have completed that tutorial, you can continue with this one.
(Note that some of the functionality in this tutorial may only exist in the DSG-Bielefeld BitBucket fork of InproTK and will appear in the main fork in the development branch shortly.)
This being on arXiv is also a bit of an experiment for us, as this is the first time that we’ve released something before it has been “officially” published. We hope that we get some comments this way that we can perhaps incorporate into a version that will be submitted to more traditional venues. So, comment away!
The short video below shows how the Words-as-Classifiers (WAC) [2,3] model of reference resolution (RR) resolves verbal references made to an object that is visually present and tangible.
The setup is as follows: several Pentomino puzzle tiles (i.e., geometric shapes) are on a table. A camera placed above them. The video is fed to computer vision processing software which segments the objects and provides a set of low-level features of each object to the WAC module of RR as implemented as a module in InproTK . Another module uses Kaldi to recognize speech through a microphone and that output is also fed to the WAC module. Using the visual features and the speech, the WAC module determines which object is being referred by producing a probability distribution over the objects, where the object with the highest probability is chosen as the referred object.
The module works incrementally, i.e., it processes word for word as they are recognized. The WAC model is trained on examples of similar interactions (i.e., geometric pentomino tiles and corresponding referring expressions).
This demo corresponds to the demo presented by Soledad Lopez at the SEMDial Conference which was held in Gothenburg, Sweden in 2015 . Also part of that demo, but not showcased here, was the ability for the module to speak some simple utterances including confirmation if the object selected by WAC was the intended one.