Our Mission

The mission of the OVAL lab is to create open advanced technology so developers can create robust, friendly, multilingual voice assistants cost-effectively. Our goal is to enable every organization to provide voice-based assistance to their services as easily as creating a website. We envision an open World Wide Voice Web where voice agents are created once and are accessible via every language and on every device (smart speakers, smart and feature phones, cars).

How are we different from commercial assistants?

Commercial chatbots today are fragile, requiring developers to hardcode possible conversational paths in anticipation of what the user may say. Large language models, such as GPT-3, while versatile and general, are not grounded and cannot be controlled to provide correct and reliable information and perform actions with side effects. Our approach is to develop novel and effective voice assistant authoring tools that combine novel programming languages concepts with deep learning and large language models.

What is our technology?

Specifically, we are creating GenieScript, a voice assistant language that lets developers create task-oriented agents by simply providing a high-level script of the conversational flow and access to their existing databases containing domain knowledge. The GenieScript system will follow the script, while automatically responding to unanticipated user inputs and help them navigate the knowledge bases. More importantly, GenieScript is sample efficient; only a small amount of annotated data is needed for few-shot learning.

The voice agents we create are the first to use contextual neural semantic parsing where a formal representation of the conversation so far is fed into a neural network to determine the semantics of the incoming sentences. The semantics is represented formally by an executable programming language called ThingTalk. ThingTalk grounds the conversation by mapping the users request to precise database queries and API executions, which is significantly much richer than the intents typically used in dialogue flow. The GenieScript system automatically synthesizes large quantities of conversational data to fine-tune large language models. By including a small sample of hand-annotated data, the developer can expect to produce a first basic agent that can understand users’ statements with about 60-70% accuracy in just a few days.

What is our goal?

Our overall goal is to create tools that enable developers to easily create deployable agents. We aim to understand the gap between current academic research approach and practice through experimentation of real life use cases, and to devise useful methodology and tools that can be used in practice.

Ongoing Research Projects

Starting with queries for databases one command at a time, our current research is taking on these challenges:

  • Provide factually accurate answers in a conversation, drawing from the Wikidata knowledge base.
  • Lead conversations to fulfill the developers’ purpose while responding to user-initiated statements.
  • Recover from errors when the agent makes a mistake in understanding the users.
  • Converse in different languages, including mixed code.
  • Communicate with the user using multimodal interfaces including GUI, body position and gaze direction.
  • Apply common-sense knowledge and reasoning to understand users’ statements.
  • Converse with the user on a wide collection of general topics.
  • Converse with basic human qualities such as not repeating itself, giving consistent answers, and showing empathy.

Ongoing Collaborations

We are collaborating with the following groups on these projects.

  • Multilingual agent with Tianjin University (China), CEA (France), IIIT (Hyderabad, India), Microsoft (India), Han Yang University (Korea). Create multilingual agents by automatically translating training data into target languages; manual annotation is needed only for few-shot training, validation, and test. We are creating a Chinese, English, French, Hindi, Hindi/English, Korean dataset based on Wizard-of-Oz conversations with Deyi Xiong, Gaël de Chalendar, Nasredine Semmar, Ponnurangam Kumaraguru, Kalika Bali, Monojit Choudhury, and Jiwon Seo.
  • Social Skill Training Assistants for Individuals with Autism Spectrum Disorder with Stanford Medicine. Creating an empathetic agent that can help individuals on the Autism Disorder Spectrum or with social anxiety to improve their social skills. This is a joint research project with Prof. Lynn Koegel.
  • Conversational Agents for Building Information Management (BIM) with Stanford Civil Engineering. Voice interfaces make the massive amount of information in 3D BIM software easily accessible by blue-collar workers on the job. This is a joint research project with Prof. Martin Fischer.

We invite students, researchers, and corporations to join us in advancing the state of the art in conversational agents and applying the technology to real-world use cases.

Recent talk

Cost-Effective Multi-Lingual Task-Oriented Dialogue Agents

Monica Lam
AI Workshop, Stanford Computer Forum 2022, April 4-6, 2022

Demonstrations

We have now created a GenieScript system that can be used for experimental prototypes. Our experience shows that the GenieScript system has made it possible for small teams to create useful prototypes quickly.

We have developed a Genie Assistant, which can perform the ten most popular skills. Genie can play songs, podcasts, radios, news, help find restaurants, answer questions, forecast weather, set timers and reminders, and control IoTs. It can control 8 different kinds of appliances, from thermostats, switches, lights, fans, doors, locks, window covers, to the vacuum cleaner; it can also control 7 different kinds of sensors, temperature, motion, illuminance, humidity, flood, ultra-light, and battery.

  1. Genie is running on the Baidu smartspeaker; it runs the full audio stack which includes acoustic echo cancellation, voice activity detection, wakeword detection (donated by PicoVoice), ducking, and speech-to-text and text-to-speech (Microsoft Azure Cognitive Services).

    The First Workshop on the World Wide Voice Web, Nov 20, 2021.

  2. Genie also runs on a Raspberry Pi. It has been distributed as the voice interface to Home Assistant, an open-source home gateway that can connect to over 1000 different IoT devices.

    State of the Open Home Workshop, Home Assistant, Dec 11, 2021.

Genie is used to create a voice interface for a 3D Building Information Management software. This enables blue-collar workers to use voice to easily access digital information on the job.

AI Workshop, Stanford Computer Forum 2022, April 4-6, 2022

Open-Source Software & Datasets

Software
  • Genie: A toolkit to synthesize, train, and build conversational virtual assistants cost-effectively.
  • ThingTalk: A extensible, executable representation language for task-oriented dialogues.
  • Genie NLP: A library of NLP models for the Genie toolkit.
  • Genie Cloud: A multi-user, kubernetes-enabled Genie runtime, with embedded NLP.
  • Genie Server: A single-user version of Genie for home servers, suitable for running on low-power devices and smart speakers.
  • Genie Devices: A repository of skills created by Genie developers.
Datasets Tutorials
  • Coming soon.

Presentations & Interviews

Accomplishments to Date

System Structure

Genie Toolset

Natural Language Programming

An award-winning socialbot capable of emotionally engaging mixed-initiative conversations

The 1st conversational agent that learns from open-ended human feedback

The 1st context aware assistant that knows where you are

Upcoming Events & News

More

An Open Virtual Assistant 2.0 Platform

Affordability:

Virtual Assistant 2.0 improves the quality & lowers the cost of dialogue agents with automatic synthesis of high-quality training data

Key Technology and Available Software

  • Thingpedia: an open crowdsourced repository of skills with over 150 skills and 1000+ IoT devices
  • Genie Semantic Parser Generator: Automatic generation of contextual neural semantic parsers from Thingpedia entries Trained with synthesized data + 1% of traditional manual annotation cost
  • ThingTalk: The first (executable) virtual assistant programming language. A language with formal semantics to enable worldwide collaboration with extensibility, common libraries, neural models, datasets, and tools
  • Almond Assistant: the first assistant that protects privacy

Publication

Architecture diagram of Almond. The Thingpedia Open Repository of Skills connect to the Genie Neural Semantic Parser Generator, which produces a Genie-generated Neural Semantic Parser. This takes as input the Natural Language and produces ThingTalk, the Virtual Assistant Programing Language. ThingTalk is passed to the Almond Assistant.

The 1st Federated Virtual Assistant that Protect Privacy

Privacy

  • Almond protects privacy by allowing execution on user devices
  • A federated architecture offers interoperability & choice
  • Users can share digital assets with each other privately
  • Distributed with Home Assistant to 100,000+ users

Technology

  • Versatile access control: Natural language specification with formal ThingTalk semantics.
  • Communication protocol: Remote ThingTalk programs Allows sharing of all assets accessible to virtual assistant with privacy and security.

Publication

Diagram of communicating assistants. Two assistants are shown. Each assistant communicates with one user in natural language. The natural language is passed to the Genie Neural Parser to produce ThingTalk, which is passed to the Almond assistant. The two assistant communicate over a standard communication protocol.
Examples of Access Control
Allow my daughter to watch Netflix only before 8pm.
Allow my son to purchase any household item under $10 on Amazon.
My dad can access my security camera, only when I am not home.
Whenever I am out of town, let my secretary read email messages, whose subject is marked ‘urgent’.
Allow colleagues to add GitHub issues to my to-do list.
Authors can only read those files with their names in the title.

Virtual Assistant 2.0 Methodology

Diagram of Genie. Genie takes as input a database schema and data, and produces dialogues with ThingTalk annotations. An example of a dialogue between an user and agent conversing about restaurants is shown. The generated dialogues are used to train the NL to ThingTalk Semantic Parser and the dialouge agent. The dialogue agent connects back to the Genie and the schemas for iterative refinement.

Prior State of the Art

  • Today’s assistants rely heavily on annotating real user utterances with a formal representation.
    • Problems: expensive, poor coverage, error prone, privacy invading

Our Approach Reduces Data Acquisition by 2 Orders of Magnitude

  • Train question-answering and dialogue agents with
    • Mostly synthesized data from database schemas and API signatures
      • Teaches neural networks compositionally with high coverage of complex queries
    • A few shot of real data to teach the network natural language
  • Engineers can refine performance by improving
    • Domain-independent questions and dialogue models to support reuse
    • Domain-specific annotations

Publication

High-Quality & Low-Cost Question Answering Agents

Accuracy

  • 12% better than commercial assistants on crowdsourced long-tail complex restaurant questions

Affordability

  • 1% of the original manual annotation cost, for validation

Key technology

  • Generic domain-agnostic grammar templates
  • Pre-trained networks (Bert and BART)
  • A novel BERT-LSTM neural semantic parser

Available Software

  • Genie tool set, datasets for schemas in schema.org.

Publications

Results of evaluating Schema2QA on accuracy on long tail restaurant questions. Genie is compared to Alexa, Google, and Siri. Genie achieves over 60% accuracy while the other assistants are below 52%.

Examples of Long-Tail Questions Alexa Google Siri Genie
Show me restaurants rated at least 4 stars with at least 100 reviews.
Show restaurants in San Francisco rated higher than 4.5.
What is the highest rated Chinese restaurant near Stanford?
How far is the closest 4 star restaurant?​
Find a W3C employee that went to Oxford​
Who worked for Google and lives in Palo Alto?​
Who graduated from Stanford and won a Nobel prize?​​
Who worked for at least 3 companies?​
Show me hotels with checkout time later than 12PM​​
Which hotel has a pool in this area? ​

The First Contextual Neural Dialogue Agent

Accuracy

  • Multi-domain MultiWoz dataset: first system to demonstrate 70% turn-by-turn accuracy with just 2% of real training data.

Affordability

  • Needs only 2% of the original annotated training data

Key technology

  • Training data synthesis with an abstract dialogue state machine
  • A unified contextual dialogue-state-tracking neural network is more robust than intent-classification

Available Software

  • Genie parser generator, Almond assistant.

Publications

Model Accuracy
Joint Accuracy (MultiWOZ 2.1)
TRADE (Wu et al., 2019) 45.6
SUMBT (Lee et al., 2019) 46.7
DSTQA (Zhou and Small, 2019) 51.2
DST-Picklist (Zhang et al., 2019) 53.3
SST (Chen et al., 2020) 55.2
TripPy (Heck et al., 2020) 55.3
SimpleTOD (Hosseini-Asl et al., 2020) 55.7
Turn-By-Turn Accuracy (Cleaned Test Set)
Genie 71.1

Localize QA Agents for Other Languages in a Day

Accuracy

  • 75-82% accuracy on long-tail restaurant questions.

Affordability

  • Requires no manually annotated data, only human translation for test utterances.

Key technology

  • Train with translations of synthesized English data with named entities in target language
  • New alignment-based translation method using pre-trained Marian models

Available Software

  • Genie tool set, restaurant training data set in 10 languages.

Publications

Lang Restaurant Queries with Localized Entities
US look for 5 star restaurants that serve burgers
SA ابحث عن مطاعم 5 نجوم التي تقدم الشاورما
DE suchen sie nach 5 sterne restaurants, die maultaschen servieren
ES busque restaurantes de 5 estrellas que sirvan paella valenciana
IR به دنبال رستوران‌های 5 ستاره باشید که جوجه کباب سرو می‌کنند
FI etsi 5 tähden ravintoloita, joissa tarjoillaan karjalanpiirakkaa
IT cerca ristoranti a 5 stelle che servono bruschette
JP 寿司を提供する5つ星レストランを探す
PL poszukaj 5 gwiazdkowych restauracji, które serwują kotlet
TR köfte servis eden 5 yıldızlı restoranları arayın
CN 搜索卖北京烤鸭的5星级餐厅
Results of evaluating Genie compared to the state of the art on long-tail restaurant questions in 10 different languages. Genie improves the state of the art across all languages.

Event-Driven Commands in Natural Language

Capability

  • The 1st assistant to support event-driven cross-service commands.

Accuracy

  • 68% of crowdsourced event-driven commands, using no real training data

Affordability

  • Developer supplies API signatures and annotations on parameters
  • Real annotated data needed only for validation

Key technology

  • ThingTalk: a formal language for trigger-action commands
  • Compositionality: Synthesized data teach our neural network to understand unseen combinations.

Available

  • Almond assistant, Genie, Thingpedia skill repository, with 100+ popular web services & 1000 IoT devices.

Publications

Areas Examples of Event-Driven Commands
Weather Remind me to bring an umbrella, when rain is forecast tomorrow
Finance When the Microsoft stock drops to $200, and my checking balance is greater than $2000, buy 5 shares
Home Automation Email me if my car is not plugged in, when parked at home.
Social Media Whenever I post my profile to Twitter, post it to Facebook
Security Send images from my security camera to Dad, if motion is detected when I am not home
Work Forward emails to my secretary if they are marked urgent

Chirpy Cardinal: Emotionally Engaging Mixed-Initiative Conversations

An open-domain socialbot based on neural generative models

  • Responsive, personalized user experience, capable of:
    • Talking knowledgeably about a wide variety of topics
    • Chatting empathetically about ordinary life, by prioritizing user interests, feelings, and autonomy.

Results

  • 2nd prize in Alexa Prize Grand Challenge 3, 2020
  • Average rating by real users: 3.6/5.0, median conversation duration of 2.3 minutes,

Publication

Multimodal Virtual Assistant Commands on Mobile Devices

Diagram of DoITHere, multi-modal assistant on mobile devices. The user can issue a query command by selecting an entry box and saying "Insert my Duo code here". They can issue a Do command such as selecting the name of a videogame and saying "Show me the review of this game". They can issue a Keep command by selecting a portion of the screen and saying "Keep this".

Novelty

  • Minimizes context switching on mobile devices with intuitive multimodal interaction.

Features

  • Query:Ask for the information verbally and point to the destination of the answer
  • Do:Point to some data on the screen and issue a command on the data
  • Keep:Point to a portion of the screen and ask to keep it on top of another app like a “post-it note”

Extensibility

  • Built on top of the Almond virtual assistant

Results

  • A user study shows that it reduces cognitive load and task completion time

Publication

DIYA: End-User Web Task Automation with Demonstration

Diagram of DIYA: a) A user sees a cookierecipe on a popular food blog and wants to see how much the ingredients are. (b) He then enters DIYA’s recording modeusing his voice and searches for one of the ingredients on Walmart’s website. (c) He clicks on the first search result andhighlights the price, telling DIYA via voice that it should be returned. (d) A few weeks later, he is interested in the "SpaghettiCarbonara" recipe on another food blog. He highlights the ingredients and asks DIYA to run the previously defined programwith them. (e) DIYA returns to the prices of the items, but also knows to notify him that one of the ingredients is not availableon Walmart.com.

Novelty

  • Users automate web tasks with voice commands as they browse.

Key Technology

  • Programming by demonstration: Users describe filters and function applications on selected data items by voice.
  • ThingTalk: a formal language combining web operations with control constructs

Results

  • A user study shows that DIYA is easy to learn and useful for crowdsourced tasks

Publication

The First Conversational Agent Able to Learn from Open-Ended Human Feedback

Novelty

  • If conversational agents want to improve, they need to learn from human interaction
  • We introduce the first technique for learning from open-ended conversations, and an agent that interacted with 236k people online to learn new visual concepts

Key Technology

  • Interaction manifold: identification of a low-dimensional manifold of the larger action space

Available

  • Open source to be released

Publication

  • Socially Situated Artificial Intelligence: Learning to Interact and Interacting to Learn (In preparation)
    Krishna et al.

The First Smart Speaker to Know Where You Are

New Capability

  • Detect head position and orientation with the microphone array in regular smart speakers

Uses

  • Controlling IoT devices, verbal reference inference in meeting room, turn-by-turn indoor guidance

Accuracy

  • average error 0.31m, 34̊

Key technology

  • Scalable data collection with virtual reality software
  • Neural network to predict user’s head position and orientation

Publication

Example of using Soundr. A user faces a specific light and says "Turn on this light".

Publications

See here for the paper abstracts.

2022

2021

2020

2019

2018

2017

Team

Senior Members

PhD Students

Master & Undergraduate Students

PhD Alumni

Former Students and Collaborators

We thank them for their valuable contribution.

Acknowledgement

OVAL is supported by the National Science Foundation under Grant No. 1900638, the Alfred P. Sloan Foundation under Grant No. G-2020-13938, and the Verdant Foundation. We also want to thank our partners Alpha Vantage, Baidu, Picovoice, Smartnews, and Yelp for their support.