Our Research Vision

Computers will transform into effective, personalized, conversational assistants for everybody, including the pre-literate and the non-literate. Commercial chatbots today are notoriously brittle as they are hardcoded to handle a few possible choices of user inputs. Recently introduced large language models (LLMs), such as GPT-3, are remarkably fluent, but they are often erroneous and prone to hallucinations. Our goal is not just to create chit-chit bots, but assistants with a purpose. We focus on studying the science in conversational agents and specifically how to:

  • tame LLMs into robust, trustworthy, and effective conversational agents;
  • ensure that LLMs conform to human values;
  • create socially intelligent agents that can e.g. provide companionship to the elderly and emotional support to improve mental health;
  • make the technology easily used and deployed by non-AI experts.
We envision an open World Wide Voice, as prevalent as the World Wide Web, where voice agents are created once and are accessible via every language and on every device (smart speakers, smart and feature phones, cars).

Our Approach

This research is inspired by the role of the prefrontal cortex (PFC) in the human brain: it performs high cognitive functions (e.g. planning, inhibitory control, attention, long-term memory) by sending top-down control signals to the perceptual, motor, and language cortices.

Our approach is to give developers high-level programmatic control over the LLMs to create a purposeful assistant, using a novel language called GenieScript. This approach combines the best of deep learning and programming systems to produce assistants that can make an impact on our society.

GenieScript makes it possible for non-AI developers to quickly script the high-level conversational flow, along with the specification of grounding functions such as knowledge base and API schemas. The complexities in natural language processing are handled by the underlying Genie system, which incorporates the research results so far:

  • ThingTalk: a formal, expressive, executable meaning representation for task-oriented dialogues
  • Contextual neural semantic parser for translating dialogues into ThingTalk.
  • Automatic generation of training data for dialogues from high-level schema and API specification with the help of LLMs.
  • Multilingual, mixed-initiative, multimodal assistants.
  • Federated privacy-protecting assistants.

Ongoing Research Projects

  • Correctness. Neural semantic parsers today have been built by fine-tuning language models such as BART. Semantic parsing is useful for grounding LLMs, allowing them to answer factual questions correctly by querying external knowledge bases. How can we improve the sample efficiency of semantic parsing with LLMs? How can we leverage common knowledge and common sense reasoning to translate a natural language query into one supported by the external knowledge base? How can we use conversations to disambiguate a query? How do we combine task-oriented conversations with chitchats?
  • Self-learning. Virtual assistants today just follow single commands, and they simply reject any commands they fail to understand. The capability to converse can allow assistants to learn from their mistakes and grow with experience. Is it possible to provide error correction generically once and for all to support all possible applications without burdenng the developer?
  • Human values.The traditional approach to teach LLM human values (anti-toxicity, elimination of sterotypical biases, promotion of prosocial behavior) is to fine-tune LLMs with hand-annotated data. Our PFC-inspired approach is to control the LLM output externally by adding thoughtfulness to its output. We ask LLMs (1) to generate multiple answers prompted with a desirable behavior such as being empathetic, (2) to reflect upon each candidate generation by assessing how much it is aligned with the desirable behavior, and (3) to choose the highest rated answer according to its own assessment. We will also experiment with fine-tuning LLMs on these generations, so that generating desirable outputs becomes its second nature.
  • Social intelligence. To be effective as a communicator, the assistant must also have socially intelligence, which includes good conversational skills, knowledge of social norms, listening skills, attuning to others' feelings, social self-efficacy, and impression management skills. LLMs have a lot of embedded knowledge about social intelligence. Can thoughtfulness be applied to hone its skills? How do we use LLMs to gather important user profile and history in external long-term memory, and how to make LLMs generate text that reflects the content stored in memory.
  • Multimodal assistants.The future assistants will naturally be multimodal, making the most of audio and graphical user interfaces. Our research is to create a development framework that will automatically support mixed-mode operations with only a minor increase in programming complexity.

Ongoing Collaborations

  • Multilingual agentwith Tianjin University (China), CEA (France), IIIT (Hyderabad, India), Microsoft (India), Han Yang University (Korea). Create multilingual agents by automatically translating training data into target languages; manual annotation is needed only for few-shot training, validation, and test. We are creating a Chinese, English, French, Hindi, Hindi/English, Korean dataset based on Wizard-of-Oz conversations with Deyi Xiong, Gaël de Chalendar, Nasredine Semmar, Ponnurangam Kumaraguru, Kalika Bali, Monojit Choudhury, and Jiwon Seo.
  • Social Skill Training Assistants for Individuals with Autism Spectrum Disorder with Stanford Medicine. Creating an empathetic agent that can help individuals on the Autism Disorder Spectrum or with social anxiety to improve their social skills. This is a joint research project with Prof. Lynn Koegel.
  • Conversational Agents for Building Information Management (BIM) with Stanford Civil Engineering. Voice interfaces make the massive amount of information in 3D BIM software easily accessible by blue-collar workers on the job. This is a joint research project with Prof. Martin Fischer.

We invite students, researchers, and corporations to join us in advancing the state of the art in conversational agents and applying the technology to real-world use cases.

Recent talk

Cost-Effective Multi-Lingual Task-Oriented Dialogue Agents

Monica Lam
AI Workshop, Stanford Computer Forum 2022, April 4-6, 2022


We have now created a GenieScript system that can be used for experimental prototypes. Our experience shows that the GenieScript system has made it possible for small teams to create useful prototypes quickly.

We have developed a Genie Assistant, which can perform the ten most popular skills. Genie can play songs, podcasts, radios, news, help find restaurants, answer questions, forecast weather, set timers and reminders, and control IoTs. It can control 8 different kinds of appliances, from thermostats, switches, lights, fans, doors, locks, window covers, to the vacuum cleaner; it can also control 7 different kinds of sensors, temperature, motion, illuminance, humidity, flood, ultra-light, and battery.

  1. Genie is running on the Baidu smartspeaker; it runs the full audio stack which includes acoustic echo cancellation, voice activity detection, wakeword detection (donated by PicoVoice), ducking, and speech-to-text and text-to-speech (Microsoft Azure Cognitive Services).

    The First Workshop on the World Wide Voice Web, Nov 20, 2021.

  2. Genie also runs on a Raspberry Pi. It has been distributed as the voice interface to Home Assistant, an open-source home gateway that can connect to over 1000 different IoT devices.

    State of the Open Home Workshop, Home Assistant, Dec 11, 2021.

Genie is used to create a voice interface for a 3D Building Information Management software. This enables blue-collar workers to use voice to easily access digital information on the job.

AI Workshop, Stanford Computer Forum 2022, April 4-6, 2022

Open-Source Software & Datasets

  • Genie: A toolkit to synthesize, train, and build conversational virtual assistants cost-effectively.
  • ThingTalk: A extensible, executable representation language for task-oriented dialogues.
  • Genie NLP: A library of NLP models for the Genie toolkit.
  • Genie Cloud: A multi-user, kubernetes-enabled Genie runtime, with embedded NLP.
  • Genie Server: A single-user version of Genie for home servers, suitable for running on low-power devices and smart speakers.
  • Genie Devices: A repository of skills created by Genie developers.
Datasets Tutorials
  • Coming soon.

Presentations & Interviews

Accomplishments to Date

System Structure

Genie Toolset

Natural Language Programming

An award-winning socialbot capable of emotionally engaging mixed-initiative conversations

The 1st conversational agent that learns from open-ended human feedback

The 1st context aware assistant that knows where you are

Upcoming Events & News


An Open Virtual Assistant 2.0 Platform


Virtual Assistant 2.0 improves the quality & lowers the cost of dialogue agents with automatic synthesis of high-quality training data

Key Technology and Available Software

  • Thingpedia: an open crowdsourced repository of skills with over 150 skills and 1000+ IoT devices
  • Genie Semantic Parser Generator: Automatic generation of contextual neural semantic parsers from Thingpedia entries Trained with synthesized data + 1% of traditional manual annotation cost
  • ThingTalk: The first (executable) virtual assistant programming language. A language with formal semantics to enable worldwide collaboration with extensibility, common libraries, neural models, datasets, and tools
  • Almond Assistant: the first assistant that protects privacy


Architecture diagram of Almond. The Thingpedia Open Repository of Skills connect to the Genie Neural Semantic Parser Generator, which produces a Genie-generated Neural Semantic Parser. This takes as input the Natural Language and produces ThingTalk, the Virtual Assistant Programing Language. ThingTalk is passed to the Almond Assistant.

The 1st Federated Virtual Assistant that Protect Privacy


  • Almond protects privacy by allowing execution on user devices
  • A federated architecture offers interoperability & choice
  • Users can share digital assets with each other privately
  • Distributed with Home Assistant to 100,000+ users


  • Versatile access control: Natural language specification with formal ThingTalk semantics.
  • Communication protocol: Remote ThingTalk programs Allows sharing of all assets accessible to virtual assistant with privacy and security.


Diagram of communicating assistants. Two assistants are shown. Each assistant communicates with one user in natural language. The natural language is passed to the Genie Neural Parser to produce ThingTalk, which is passed to the Almond assistant. The two assistant communicate over a standard communication protocol.
Examples of Access Control
Allow my daughter to watch Netflix only before 8pm.
Allow my son to purchase any household item under $10 on Amazon.
My dad can access my security camera, only when I am not home.
Whenever I am out of town, let my secretary read email messages, whose subject is marked ‘urgent’.
Allow colleagues to add GitHub issues to my to-do list.
Authors can only read those files with their names in the title.

Virtual Assistant 2.0 Methodology

Diagram of Genie. Genie takes as input a database schema and data, and produces dialogues with ThingTalk annotations. An example of a dialogue between an user and agent conversing about restaurants is shown. The generated dialogues are used to train the NL to ThingTalk Semantic Parser and the dialouge agent. The dialogue agent connects back to the Genie and the schemas for iterative refinement.

Prior State of the Art

  • Today’s assistants rely heavily on annotating real user utterances with a formal representation.
    • Problems: expensive, poor coverage, error prone, privacy invading

Our Approach Reduces Data Acquisition by 2 Orders of Magnitude

  • Train question-answering and dialogue agents with
    • Mostly synthesized data from database schemas and API signatures
      • Teaches neural networks compositionally with high coverage of complex queries
    • A few shot of real data to teach the network natural language
  • Engineers can refine performance by improving
    • Domain-independent questions and dialogue models to support reuse
    • Domain-specific annotations


High-Quality & Low-Cost Question Answering Agents


  • 12% better than commercial assistants on crowdsourced long-tail complex restaurant questions


  • 1% of the original manual annotation cost, for validation

Key technology

  • Generic domain-agnostic grammar templates
  • Pre-trained networks (Bert and BART)
  • A novel BERT-LSTM neural semantic parser

Available Software

  • Genie tool set, datasets for schemas in schema.org.


Results of evaluating Schema2QA on accuracy on long tail restaurant questions. Genie is compared to Alexa, Google, and Siri. Genie achieves over 60% accuracy while the other assistants are below 52%.

Examples of Long-Tail Questions Alexa Google Siri Genie
Show me restaurants rated at least 4 stars with at least 100 reviews.
Show restaurants in San Francisco rated higher than 4.5.
What is the highest rated Chinese restaurant near Stanford?
How far is the closest 4 star restaurant?​
Find a W3C employee that went to Oxford​
Who worked for Google and lives in Palo Alto?​
Who graduated from Stanford and won a Nobel prize?​​
Who worked for at least 3 companies?​
Show me hotels with checkout time later than 12PM​​
Which hotel has a pool in this area? ​

The First Contextual Neural Dialogue Agent


  • Multi-domain MultiWoz dataset: first system to demonstrate 70% turn-by-turn accuracy with just 2% of real training data.


  • Needs only 2% of the original annotated training data

Key technology

  • Training data synthesis with an abstract dialogue state machine
  • A unified contextual dialogue-state-tracking neural network is more robust than intent-classification

Available Software

  • Genie parser generator, Almond assistant.


Model Accuracy
Joint Accuracy (MultiWOZ 2.1)
TRADE (Wu et al., 2019) 45.6
SUMBT (Lee et al., 2019) 46.7
DSTQA (Zhou and Small, 2019) 51.2
DST-Picklist (Zhang et al., 2019) 53.3
SST (Chen et al., 2020) 55.2
TripPy (Heck et al., 2020) 55.3
SimpleTOD (Hosseini-Asl et al., 2020) 55.7
Turn-By-Turn Accuracy (Cleaned Test Set)
Genie 71.1

Localize QA Agents for Other Languages in a Day


  • 75-82% accuracy on long-tail restaurant questions.


  • Requires no manually annotated data, only human translation for test utterances.

Key technology

  • Train with translations of synthesized English data with named entities in target language
  • New alignment-based translation method using pre-trained Marian models

Available Software

  • Genie tool set, restaurant training data set in 10 languages.


Lang Restaurant Queries with Localized Entities
US look for 5 star restaurants that serve burgers
SA ابحث عن مطاعم 5 نجوم التي تقدم الشاورما
DE suchen sie nach 5 sterne restaurants, die maultaschen servieren
ES busque restaurantes de 5 estrellas que sirvan paella valenciana
IR به دنبال رستوران‌های 5 ستاره باشید که جوجه کباب سرو می‌کنند
FI etsi 5 tähden ravintoloita, joissa tarjoillaan karjalanpiirakkaa
IT cerca ristoranti a 5 stelle che servono bruschette
JP 寿司を提供する5つ星レストランを探す
PL poszukaj 5 gwiazdkowych restauracji, które serwują kotlet
TR köfte servis eden 5 yıldızlı restoranları arayın
CN 搜索卖北京烤鸭的5星级餐厅
Results of evaluating Genie compared to the state of the art on long-tail restaurant questions in 10 different languages. Genie improves the state of the art across all languages.

Event-Driven Commands in Natural Language


  • The 1st assistant to support event-driven cross-service commands.


  • 68% of crowdsourced event-driven commands, using no real training data


  • Developer supplies API signatures and annotations on parameters
  • Real annotated data needed only for validation

Key technology

  • ThingTalk: a formal language for trigger-action commands
  • Compositionality: Synthesized data teach our neural network to understand unseen combinations.


  • Almond assistant, Genie, Thingpedia skill repository, with 100+ popular web services & 1000 IoT devices.


Areas Examples of Event-Driven Commands
Weather Remind me to bring an umbrella, when rain is forecast tomorrow
Finance When the Microsoft stock drops to $200, and my checking balance is greater than $2000, buy 5 shares
Home Automation Email me if my car is not plugged in, when parked at home.
Social Media Whenever I post my profile to Twitter, post it to Facebook
Security Send images from my security camera to Dad, if motion is detected when I am not home
Work Forward emails to my secretary if they are marked urgent

Chirpy Cardinal: Emotionally Engaging Mixed-Initiative Conversations

An open-domain socialbot based on neural generative models

  • Responsive, personalized user experience, capable of:
    • Talking knowledgeably about a wide variety of topics
    • Chatting empathetically about ordinary life, by prioritizing user interests, feelings, and autonomy.


  • 2nd prize in Alexa Prize Grand Challenge 3, 2020
  • Average rating by real users: 3.6/5.0, median conversation duration of 2.3 minutes,


Multimodal Virtual Assistant Commands on Mobile Devices

Diagram of DoITHere, multi-modal assistant on mobile devices. The user can issue a query command by selecting an entry box and saying "Insert my Duo code here". They can issue a Do command such as selecting the name of a videogame and saying "Show me the review of this game". They can issue a Keep command by selecting a portion of the screen and saying "Keep this".


  • Minimizes context switching on mobile devices with intuitive multimodal interaction.


  • Query:Ask for the information verbally and point to the destination of the answer
  • Do:Point to some data on the screen and issue a command on the data
  • Keep:Point to a portion of the screen and ask to keep it on top of another app like a “post-it note”


  • Built on top of the Almond virtual assistant


  • A user study shows that it reduces cognitive load and task completion time


DIYA: End-User Web Task Automation with Demonstration

Diagram of DIYA: a) A user sees a cookierecipe on a popular food blog and wants to see how much the ingredients are. (b) He then enters DIYA’s recording modeusing his voice and searches for one of the ingredients on Walmart’s website. (c) He clicks on the first search result andhighlights the price, telling DIYA via voice that it should be returned. (d) A few weeks later, he is interested in the "SpaghettiCarbonara" recipe on another food blog. He highlights the ingredients and asks DIYA to run the previously defined programwith them. (e) DIYA returns to the prices of the items, but also knows to notify him that one of the ingredients is not availableon Walmart.com.


  • Users automate web tasks with voice commands as they browse.

Key Technology

  • Programming by demonstration: Users describe filters and function applications on selected data items by voice.
  • ThingTalk: a formal language combining web operations with control constructs


  • A user study shows that DIYA is easy to learn and useful for crowdsourced tasks


The First Conversational Agent Able to Learn from Open-Ended Human Feedback


  • If conversational agents want to improve, they need to learn from human interaction
  • We introduce the first technique for learning from open-ended conversations, and an agent that interacted with 236k people online to learn new visual concepts

Key Technology

  • Interaction manifold: identification of a low-dimensional manifold of the larger action space


  • Open source to be released


  • Socially Situated Artificial Intelligence: Learning to Interact and Interacting to Learn (In preparation)
    Krishna et al.

The First Smart Speaker to Know Where You Are

New Capability

  • Detect head position and orientation with the microphone array in regular smart speakers


  • Controlling IoT devices, verbal reference inference in meeting room, turn-by-turn indoor guidance


  • average error 0.31m, 34̊

Key technology

  • Scalable data collection with virtual reality software
  • Neural network to predict user’s head position and orientation


Example of using Soundr. A user faces a specific light and says "Turn on this light".


See here for the paper abstracts.








Senior Members

PhD Students

Master & Undergraduate Students

PhD Alumni

Former Students and Collaborators

We thank them for their valuable contribution.


OVAL is supported by the National Science Foundation under Grant No. 1900638, the Alfred P. Sloan Foundation under Grant No. G-2020-13938, Microsoft, and the Verdant Foundation. We also want to thank our partners Alpha Vantage, Baidu, Picovoice, and Yelp for their support.