Workshop Banner: Welcome To The World Wide Voice Web (WWvW). Wednesday, 10 November 2020 09:00-17:00 Stanford University (face-to-face and virtual participation). Workshop Organizers: Prof. Monica Lam, Prof. James Landay, Prof. Chris Manning.

We have built the first "browser" to the World Wide Voice Web, our Genie open-source virtual assistant. To scale up cost-effectively, we have created a Pretrained Agent Generator that can produce transactional dialogue agents from just database schemas, API signatures, and a few samples of natural language utterances.

We are now ready to apply it to the WWW. The idea is to standardize on APIs and provide open pretrained agents that interface to the APIs. For example, restaurants provide a menu and an ordering API in the standardized format, and they get a voice agent for ordering food. We believe an open, decentralized voice web (WWvW) will surpass any proprietary walled gardens.

Decentralization of the voice web promotes equal opportunity, global inclusion and accessibility, and consumer privacy.

Learn more...

Open-Source Software & Datasets

  • Genie: A toolkit to synthesize, train, and build conversational virtual assistants cost-effectively.
  • ThingTalk: A extensible, executable representation language for task-oriented dialogues.
  • Genie NLP: A library of NLP models for the Genie toolkit.
  • Genie Cloud: A multi-user, kubernetes-enabled Genie runtime, with embedded NLP.
  • Genie Server: A single-user version of Genie for home servers, suitable for running on low-power devices and smart speakers.
  • Genie Devices: A repository of skills created by Genie developers.
Datasets Tutorials
  • Coming soon.

Presentations & Interviews

Accomplishments to Date

System Structure

Genie Toolset

Natural Language Programming

An award-winning socialbot capable of emotionally engaging mixed-initiative conversations

The 1st conversational agent that learns from open-ended human feedback

The 1st context aware assistant that knows where you are

Upcoming Events & News


An Open Virtual Assistant 2.0 Platform


Virtual Assistant 2.0 improves the quality & lowers the cost of dialogue agents with automatic synthesis of high-quality training data

Key Technology and Available Software

  • Thingpedia: an open crowdsourced repository of skills with over 150 skills and 1000+ IoT devices
  • Genie Semantic Parser Generator: Automatic generation of contextual neural semantic parsers from Thingpedia entries Trained with synthesized data + 1% of traditional manual annotation cost
  • ThingTalk: The first (executable) virtual assistant programming language. A language with formal semantics to enable worldwide collaboration with extensibility, common libraries, neural models, datasets, and tools
  • Almond Assistant: the first assistant that protects privacy


Architecture diagram of Almond. The Thingpedia Open Repository of Skills connect to the Genie Neural Semantic Parser Generator, which produces a Genie-generated Neural Semantic Parser. This takes as input the Natural Language and produces ThingTalk, the Virtual Assistant Programing Language. ThingTalk is passed to the Almond Assistant.

The 1st Federated Virtual Assistant that Protect Privacy


  • Almond protects privacy by allowing execution on user devices
  • A federated architecture offers interoperability & choice
  • Users can share digital assets with each other privately
  • Distributed with Home Assistant to 100,000+ users


  • Versatile access control: Natural language specification with formal ThingTalk semantics.
  • Communication protocol: Remote ThingTalk programs Allows sharing of all assets accessible to virtual assistant with privacy and security.


Diagram of communicating assistants. Two assistants are shown. Each assistant communicates with one user in natural language. The natural language is passed to the Genie Neural Parser to produce ThingTalk, which is passed to the Almond assistant. The two assistant communicate over a standard communication protocol.
Examples of Access Control
Allow my daughter to watch Netflix only before 8pm.
Allow my son to purchase any household item under $10 on Amazon.
My dad can access my security camera, only when I am not home.
Whenever I am out of town, let my secretary read email messages, whose subject is marked ‘urgent’.
Allow colleagues to add GitHub issues to my to-do list.
Authors can only read those files with their names in the title.

Virtual Assistant 2.0 Methodology

Diagram of Genie. Genie takes as input a database schema and data, and produces dialogues with ThingTalk annotations. An example of a dialogue between an user and agent conversing about restaurants is shown. The generated dialogues are used to train the NL to ThingTalk Semantic Parser and the dialouge agent. The dialogue agent connects back to the Genie and the schemas for iterative refinement.

Prior State of the Art

  • Today’s assistants rely heavily on annotating real user utterances with a formal representation.
    • Problems: expensive, poor coverage, error prone, privacy invading

Our Approach Reduces Data Acquisition by 2 Orders of Magnitude

  • Train question-answering and dialogue agents with
    • Mostly synthesized data from database schemas and API signatures
      • Teaches neural networks compositionally with high coverage of complex queries
    • A few shot of real data to teach the network natural language
  • Engineers can refine performance by improving
    • Domain-independent questions and dialogue models to support reuse
    • Domain-specific annotations


High-Quality & Low-Cost Question Answering Agents


  • 12% better than commercial assistants on crowdsourced long-tail complex restaurant questions


  • 1% of the original manual annotation cost, for validation

Key technology

  • Generic domain-agnostic grammar templates
  • Pre-trained networks (Bert and BART)
  • A novel BERT-LSTM neural semantic parser

Available Software

  • Genie tool set, datasets for schemas in


Results of evaluating Schema2QA on accuracy on long tail restaurant questions. Genie is compared to Alexa, Google, and Siri. Genie achieves over 60% accuracy while the other assistants are below 52%.

Examples of Long-Tail Questions Alexa Google Siri Genie
Show me restaurants rated at least 4 stars with at least 100 reviews.
Show restaurants in San Francisco rated higher than 4.5.
What is the highest rated Chinese restaurant near Stanford?
How far is the closest 4 star restaurant?​
Find a W3C employee that went to Oxford​
Who worked for Google and lives in Palo Alto?​
Who graduated from Stanford and won a Nobel prize?​​
Who worked for at least 3 companies?​
Show me hotels with checkout time later than 12PM​​
Which hotel has a pool in this area? ​

The First Contextual Neural Dialogue Agent


  • Multi-domain MultiWoz dataset: first system to demonstrate 70% turn-by-turn accuracy with just 2% of real training data.


  • Needs only 2% of the original annotated training data

Key technology

  • Training data synthesis with an abstract dialogue state machine
  • A unified contextual dialogue-state-tracking neural network is more robust than intent-classification

Available Software

  • Genie parser generator, Almond assistant.


Model Accuracy
Joint Accuracy (MultiWOZ 2.1)
TRADE (Wu et al., 2019) 45.6
SUMBT (Lee et al., 2019) 46.7
DSTQA (Zhou and Small, 2019) 51.2
DST-Picklist (Zhang et al., 2019) 53.3
SST (Chen et al., 2020) 55.2
TripPy (Heck et al., 2020) 55.3
SimpleTOD (Hosseini-Asl et al., 2020) 55.7
Turn-By-Turn Accuracy (Cleaned Test Set)
Genie 71.1

Localize QA Agents for Other Languages in a Day


  • 75-82% accuracy on long-tail restaurant questions.


  • Requires no manually annotated data, only human translation for test utterances.

Key technology

  • Train with translations of synthesized English data with named entities in target language
  • New alignment-based translation method using pre-trained Marian models

Available Software

  • Genie tool set, restaurant training data set in 10 languages.


Lang Restaurant Queries with Localized Entities
US look for 5 star restaurants that serve burgers
SA ابحث عن مطاعم 5 نجوم التي تقدم الشاورما
DE suchen sie nach 5 sterne restaurants, die maultaschen servieren
ES busque restaurantes de 5 estrellas que sirvan paella valenciana
IR به دنبال رستوران‌های 5 ستاره باشید که جوجه کباب سرو می‌کنند
FI etsi 5 tähden ravintoloita, joissa tarjoillaan karjalanpiirakkaa
IT cerca ristoranti a 5 stelle che servono bruschette
JP 寿司を提供する5つ星レストランを探す
PL poszukaj 5 gwiazdkowych restauracji, które serwują kotlet
TR köfte servis eden 5 yıldızlı restoranları arayın
CN 搜索卖北京烤鸭的5星级餐厅
Results of evaluating Genie compared to the state of the art on long-tail restaurant questions in 10 different languages. Genie improves the state of the art across all languages.

Event-Driven Commands in Natural Language


  • The 1st assistant to support event-driven cross-service commands.


  • 68% of crowdsourced event-driven commands, using no real training data


  • Developer supplies API signatures and annotations on parameters
  • Real annotated data needed only for validation

Key technology

  • ThingTalk: a formal language for trigger-action commands
  • Compositionality: Synthesized data teach our neural network to understand unseen combinations.


  • Almond assistant, Genie, Thingpedia skill repository, with 100+ popular web services & 1000 IoT devices.


Areas Examples of Event-Driven Commands
Weather Remind me to bring an umbrella, when rain is forecast tomorrow
Finance When the Microsoft stock drops to $200, and my checking balance is greater than $2000, buy 5 shares
Home Automation Email me if my car is not plugged in, when parked at home.
Social Media Whenever I post my profile to Twitter, post it to Facebook
Security Send images from my security camera to Dad, if motion is detected when I am not home
Work Forward emails to my secretary if they are marked urgent

Chirpy Cardinal: Emotionally Engaging Mixed-Initiative Conversations

An open-domain socialbot based on neural generative models

  • Responsive, personalized user experience, capable of:
    • Talking knowledgeably about a wide variety of topics
    • Chatting empathetically about ordinary life, by prioritizing user interests, feelings, and autonomy.


  • 2nd prize in Alexa Prize Grand Challenge 3, 2020
  • Average rating by real users: 3.6/5.0, median conversation duration of 2.3 minutes,


Multimodal Virtual Assistant Commands on Mobile Devices

Diagram of DoITHere, multi-modal assistant on mobile devices. The user can issue a query command by selecting an entry box and saying "Insert my Duo code here". They can issue a Do command such as selecting the name of a videogame and saying "Show me the review of this game". They can issue a Keep command by selecting a portion of the screen and saying "Keep this".


  • Minimizes context switching on mobile devices with intuitive multimodal interaction.


  • Query:Ask for the information verbally and point to the destination of the answer
  • Do:Point to some data on the screen and issue a command on the data
  • Keep:Point to a portion of the screen and ask to keep it on top of another app like a “post-it note”


  • Built on top of the Almond virtual assistant


  • A user study shows that it reduces cognitive load and task completion time


DIYA: End-User Web Task Automation with Demonstration

Diagram of DIYA: a) A user sees a cookierecipe on a popular food blog and wants to see how much the ingredients are. (b) He then enters DIYA’s recording modeusing his voice and searches for one of the ingredients on Walmart’s website. (c) He clicks on the first search result andhighlights the price, telling DIYA via voice that it should be returned. (d) A few weeks later, he is interested in the "SpaghettiCarbonara" recipe on another food blog. He highlights the ingredients and asks DIYA to run the previously defined programwith them. (e) DIYA returns to the prices of the items, but also knows to notify him that one of the ingredients is not availableon


  • Users automate web tasks with voice commands as they browse.

Key Technology

  • Programming by demonstration: Users describe filters and function applications on selected data items by voice.
  • ThingTalk: a formal language combining web operations with control constructs


  • A user study shows that DIYA is easy to learn and useful for crowdsourced tasks


The First Conversational Agent Able to Learn from Open-Ended Human Feedback


  • If conversational agents want to improve, they need to learn from human interaction
  • We introduce the first technique for learning from open-ended conversations, and an agent that interacted with 236k people online to learn new visual concepts

Key Technology

  • Interaction manifold: identification of a low-dimensional manifold of the larger action space


  • Open source to be released


  • Socially Situated Artificial Intelligence: Learning to Interact and Interacting to Learn (In preparation)
    Krishna et al.

The First Smart Speaker to Know Where You Are

New Capability

  • Detect head position and orientation with the microphone array in regular smart speakers


  • Controlling IoT devices, verbal reference inference in meeting room, turn-by-turn indoor guidance


  • average error 0.31m, 34̊

Key technology

  • Scalable data collection with virtual reality software
  • Neural network to predict user’s head position and orientation


Example of using Soundr. A user faces a specific light and says "Turn on this light".


See here for the paper abstracts.







Senior Members

PhD Students

Master & Undergraduate Students

PhD Alumni

Former Students and Collaborators

We thank them for their valuable contribution.


OVAL is supported by the National Science Foundation under Grant No. 1900638, and by the Alfred P. Sloan Foundation under Grant No. G-2020-13938.