We are hiring! We are looking for a couple of talented engineers to join the team. Job announcement

Interested in getting involved with the project? Let us know here!

Mission of the OVAL Lab

Voice is the next frontier in computer interfaces.

The mission of the OVAL lab is to advance voice interface technology and make it openly available to the public. Here are the four main themes of our research.

Knowledge in Voice

All knowledge in the world will be available by voice, and in all languages. This represents the next big leap in the evolution of search engines.

Dialogues as Intuitive Software Interfaces

All consumer-facing software will have a dialogue interface. To achieve this goal, we need to reduce the cost of dialogue agent development by 2 orders of magnitude. The answer is tools, tools that empower millions of natural language interface developers.

End-User Task Automation

Professionals and consumers will be able to automate their digital tasks by using a multimodal interface combining voice with graphical user interfaces.

Human-Centered AI

As we aim to fulfill the above vision, we ensure that our technology is accessible to all and that user privacy is protected.

The Open Virtual Assistant Initiative

We have launched an initiative in July 2020 to create an open-source virtual assistant infrastructure to support experimentation in research and to provide a basis of collaboration in the industry. This is made possible by a grant from Alfred P. Sloan Foundation with the goals of protecting open access to knowledge and to protect privacy. We are seeking collaboration and contribution to these three specific objectives:

  1. Democratizing NLP with the Genie tool set. Our recent technical breakthrough enables robust dialogue agents be developed without requiring labor-intensive annotation of training data. Our tool set, called Genie, can create effective dialogue agents given the schema of a domain. All the Genie tools, training data sets, neural models, and libraries are publicly available so companies can easily create and own their voice interfaces.
  2. Creating an open assistant skill repository: Thingpedia. Thingpedia is a crowdsourced repository of assistant skills, which are voice interfaces to IoT devices, web services, and question answering. Contributors specify the domain schemas and API calls, and neural-network based interfaces will be automatically generated using the Genie toolkit. All the skill specifications and neural parsers are non-proprietary and made publicly available to all assistants; like Wikipedia, Thingpedia has the potential to surpass proprietary systems.
  3. Privacy protection through inter-operating Genie assistants. Open-source Genie-based virtual assistants support privacy by allowing individuals to keep their data private. Assistants can inter-operate using a sophisticated and secure communication protocol, similar to email. Users can use natural language to share whatever, and with whomever, they want, and no centralized third party sees the shared data.

We will be making beta releases throughout, with the goals of delivering an open virtual assistant with the top 10 most popular domains by Q3 2021. Here is a stable version of our research prototype, Almond.

Please see our roadmap. This endeavor needs the support and contributions from funding agencies, companies, researchers, developers, and individuals. Please contact us.

We are also looking for a couple of talented engineers to join the team. Job announcement

Our current partners include:

  • Alpha Vantage Open Stock API provides programmatic access to global financial market and currency data.
  • Home Assistant provides a local gateway to over 1000 different IoT devices. Almond is bundled as a voice assistant interface.
  • Smartnews, a news aggregator, is collaborating in creating a news skill.
  • Yelp is providing access to APIs to answer questions about restaurants.

Presentations & Interviews

Accomplishments to Date

System Structure

Genie Toolset

Natural Language Programming

An award-winning socialbot capable of emotionally engaging mixed-initiative conversations

The 1st conversational agent that learns from open-ended human feedback

The 1st context aware assistant that knows where you are

Upcoming Events & News


An Open Virtual Assistant 2.0 Platform


Virtual Assistant 2.0 improves the quality & lowers the cost of dialogue agents with automatic synthesis of high-quality training data

Key Technology and Available Software

  • Thingpedia: an open crowdsourced repository of skills with over 150 skills and 1000+ IoT devices
  • Genie Semantic Parser Generator: Automatic generation of contextual neural semantic parsers from Thingpedia entries Trained with synthesized data + 1% of traditional manual annotation cost
  • ThingTalk: The first (executable) virtual assistant programming language. A language with formal semantics to enable worldwide collaboration with extensibility, common libraries, neural models, datasets, and tools
  • Almond Assistant: the first assistant that protects privacy


Architecture diagram of Almond. The Thingpedia Open Repository of Skills connect to the Genie Neural Semantic Parser Generator, which produces a Genie-generated Neural Semantic Parser. This takes as input the Natural Language and produces ThingTalk, the Virtual Assistant Programing Language. ThingTalk is passed to the Almond Assistant.

The 1st Federated Virtual Assistant that Protect Privacy


  • Almond protects privacy by allowing execution on user devices
  • A federated architecture offers interoperability & choice
  • Users can share digital assets with each other privately
  • Distributed with Home Assistant to 100,000+ users


  • Versatile access control: Natural language specification with formal ThingTalk semantics.
  • Communication protocol: Remote ThingTalk programs Allows sharing of all assets accessible to virtual assistant with privacy and security.


Diagram of communicating assistants. Two assistants are shown. Each assistant communicates with one user in natural language. The natural language is passed to the Genie Neural Parser to produce ThingTalk, which is passed to the Almond assistant. The two assistant communicate over a standard communication protocol.
Examples of Access Control
Allow my daughter to watch Netflix only before 8pm.
Allow my son to purchase any household item under $10 on Amazon.
My dad can access my security camera, only when I am not home.
Whenever I am out of town, let my secretary read email messages, whose subject is marked ‘urgent’.
Allow colleagues to add GitHub issues to my to-do list.
Authors can only read those files with their names in the title.

Virtual Assistant 2.0 Methodology

Diagram of Genie. Genie takes as input a database schema and data, and produces dialogues with ThingTalk annotations. An example of a dialogue between an user and agent conversing about restaurants is shown. The generated dialogues are used to train the NL to ThingTalk Semantic Parser and the dialouge agent. The dialogue agent connects back to the Genie and the schemas for iterative refinement.

Prior State of the Art

  • Today’s assistants rely heavily on annotating real user utterances with a formal representation.
    • Problems: expensive, poor coverage, error prone, privacy invading

Our Approach Reduces Data Acquisition by 2 Orders of Magnitude

  • Train question-answering and dialogue agents with
    • Mostly synthesized data from database schemas and API signatures
      • Teaches neural networks compositionally with high coverage of complex queries
    • A few shot of real data to teach the network natural language
  • Engineers can refine performance by improving
    • Domain-independent questions and dialogue models to support reuse
    • Domain-specific annotations


High-Quality & Low-Cost Question Answering Agents


  • 12% better than commercial assistants on crowdsourced long-tail complex restaurant questions


  • 1% of the original manual annotation cost, for validation

Key technology

  • Generic domain-agnostic grammar templates
  • Pre-trained networks (Bert and BART)
  • A novel BERT-LSTM neural semantic parser

Available Software

  • Genie tool set, datasets for schemas in schema.org.


Results of evaluating Schema2QA on accuracy on long tail restaurant questions. Genie is compared to Alexa, Google, and Siri. Genie achieves over 60% accuracy while the other assistants are below 52%.

Examples of Long-Tail Questions Alexa Google Siri Genie
Show me restaurants rated at least 4 stars with at least 100 reviews.
Show restaurants in San Francisco rated higher than 4.5.
What is the highest rated Chinese restaurant near Stanford?
How far is the closest 4 star restaurant?​
Find a W3C employee that went to Oxford​
Who worked for Google and lives in Palo Alto?​
Who graduated from Stanford and won a Nobel prize?​​
Who worked for at least 3 companies?​
Show me hotels with checkout time later than 12PM​​
Which hotel has a pool in this area? ​

The First Contextual Neural Dialogue Agent


  • Multi-domain MultiWoz dataset: first system to demonstrate 70% turn-by-turn accuracy with just 2% of real training data.


  • Needs only 2% of the original annotated training data

Key technology

  • Training data synthesis with an abstract dialogue state machine
  • A unified contextual dialogue-state-tracking neural network is more robust than intent-classification

Available Software

  • Genie parser generator, Almond assistant.


Model Accuracy
Joint Accuracy (MultiWOZ 2.1)
TRADE (Wu et al., 2019) 45.6
SUMBT (Lee et al., 2019) 46.7
DSTQA (Zhou and Small, 2019) 51.2
DST-Picklist (Zhang et al., 2019) 53.3
SST (Chen et al., 2020) 55.2
TripPy (Heck et al., 2020) 55.3
SimpleTOD (Hosseini-Asl et al., 2020) 55.7
Turn-By-Turn Accuracy (Cleaned Test Set)
Genie 71.1

Localize QA Agents for Other Languages in a Day


  • 75-82% accuracy on long-tail restaurant questions.


  • Requires no manually annotated data, only human translation for test utterances.

Key technology

  • Train with translations of synthesized English data with named entities in target language
  • New alignment-based translation method using pre-trained Marian models

Available Software

  • Genie tool set, restaurant training data set in 10 languages.


Lang Restaurant Queries with Localized Entities
US look for 5 star restaurants that serve burgers
SA ابحث عن مطاعم 5 نجوم التي تقدم الشاورما
DE suchen sie nach 5 sterne restaurants, die maultaschen servieren
ES busque restaurantes de 5 estrellas que sirvan paella valenciana
IR به دنبال رستوران‌های 5 ستاره باشید که جوجه کباب سرو می‌کنند
FI etsi 5 tähden ravintoloita, joissa tarjoillaan karjalanpiirakkaa
IT cerca ristoranti a 5 stelle che servono bruschette
JP 寿司を提供する5つ星レストランを探す
PL poszukaj 5 gwiazdkowych restauracji, które serwują kotlet
TR köfte servis eden 5 yıldızlı restoranları arayın
CN 搜索卖北京烤鸭的5星级餐厅
Results of evaluating Genie compared to the state of the art on long-tail restaurant questions in 10 different languages. Genie improves the state of the art across all languages.

Event-Driven Commands in Natural Language


  • The 1st assistant to support event-driven cross-service commands.


  • 68% of crowdsourced event-driven commands, using no real training data


  • Developer supplies API signatures and annotations on parameters
  • Real annotated data needed only for validation

Key technology

  • ThingTalk: a formal language for trigger-action commands
  • Compositionality: Synthesized data teach our neural network to understand unseen combinations.


  • Almond assistant, Genie, Thingpedia skill repository, with 100+ popular web services & 1000 IoT devices.


Areas Examples of Event-Driven Commands
Weather Remind me to bring an umbrella, when rain is forecast tomorrow
Finance When the Microsoft stock drops to $200, and my checking balance is greater than $2000, buy 5 shares
Home Automation Email me if my car is not plugged in, when parked at home.
Social Media Whenever I post my profile to Twitter, post it to Facebook
Security Send images from my security camera to Dad, if motion is detected when I am not home
Work Forward emails to my secretary if they are marked urgent

Chirpy Cardinal: Emotionally Engaging Mixed-Initiative Conversations

An open-domain socialbot based on neural generative models

  • Responsive, personalized user experience, capable of:
    • Talking knowledgeably about a wide variety of topics
    • Chatting empathetically about ordinary life, by prioritizing user interests, feelings, and autonomy.


  • 2nd prize in Alexa Prize Grand Challenge 3, 2020
  • Average rating by real users: 3.6/5.0, median conversation duration of 2.3 minutes,


Multimodal Virtual Assistant Commands on Mobile Devices

Diagram of DoITHere, multi-modal assistant on mobile devices. The user can issue a query command by selecting an entry box and saying "Insert my Duo code here". They can issue a Do command such as selecting the name of a videogame and saying "Show me the review of this game". They can issue a Keep command by selecting a portion of the screen and saying "Keep this".


  • Minimizes context switching on mobile devices with intuitive multimodal interaction.


  • Query:Ask for the information verbally and point to the destination of the answer
  • Do:Point to some data on the screen and issue a command on the data
  • Keep:Point to a portion of the screen and ask to keep it on top of another app like a “post-it note”


  • Built on top of the Almond virtual assistant


  • A user study shows that it reduces cognitive load and task completion time


DIYA: End-User Web Task Automation with Demonstration

Diagram of DIYA: a) A user sees a cookierecipe on a popular food blog and wants to see how much the ingredients are. (b) He then enters DIYA’s recording modeusing his voice and searches for one of the ingredients on Walmart’s website. (c) He clicks on the first search result andhighlights the price, telling DIYA via voice that it should be returned. (d) A few weeks later, he is interested in the "SpaghettiCarbonara" recipe on another food blog. He highlights the ingredients and asks DIYA to run the previously defined programwith them. (e) DIYA returns to the prices of the items, but also knows to notify him that one of the ingredients is not availableon Walmart.com.


  • Users automate web tasks with voice commands as they browse.

Key Technology

  • Programming by demonstration: Users describe filters and function applications on selected data items by voice.
  • ThingTalk: a formal language combining web operations with control constructs


  • A user study shows that DIYA is easy to learn and useful for crowdsourced tasks


The First Conversational Agent Able to Learn from Open-Ended Human Feedback


  • If conversational agents want to improve, they need to learn from human interaction
  • We introduce the first technique for learning from open-ended conversations, and an agent that interacted with 236k people online to learn new visual concepts

Key Technology

  • Interaction manifold: identification of a low-dimensional manifold of the larger action space


  • Open source to be released


  • Socially Situated Artificial Intelligence: Learning to Interact and Interacting to Learn (In preparation)
    Krishna et al.

The First Smart Speaker to Know Where You Are

New Capability

  • Detect head position and orientation with the microphone array in regular smart speakers


  • Controlling IoT devices, verbal reference inference in meeting room, turn-by-turn indoor guidance


  • average error 0.31m, 34̊

Key technology

  • Scalable data collection with virtual reality software
  • Neural network to predict user’s head position and orientation


Example of using Soundr. A user faces a specific light and says "Turn on this light".


See here for the paper abstracts.







Senior Members

PhD Students

Master & Undergraduate Students

PhD Alumni

Former Students and Collaborators

We thank them for their valuable contribution.


OVAL is supported by the National Science Foundation under Grant No. 1900638, and by the Alfred P. Sloan Foundation under Grant No. G-2020-13938.