Computers will transform into effective, personalized, conversational assistants for everybody, including the pre-literate and the non-literate. Commercial chatbots today are notoriously brittle as they are hardcoded to handle a few possible choices of user inputs. Recently introduced large language models (LLMs), such as GPT-3, are remarkably fluent, but they are often erroneous and prone to hallucinations. Our goal is not just to create chit-chit bots, but assistants with a purpose. We focus on studying the science in conversational agents and specifically how to:
tame LLMs into robust, trustworthy, and effective conversational agents;
ensure that LLMs conform to human values;
create socially intelligent agents that can e.g. provide companionship to the elderly and emotional support to improve mental health;
make the technology easily used and deployed by non-AI experts.
We envision an open World Wide Voice, as prevalent as the World Wide Web, where voice agents are created once and are accessible via every language and on every device (smart speakers, smart and feature phones, cars).
Our Approach
This research is inspired by the role of the prefrontal cortex (PFC) in the human brain: it performs high cognitive functions (e.g. planning, inhibitory control, attention, long-term memory) by sending top-down control signals to the perceptual, motor, and language cortices.
Our approach is to give developers high-level programmatic control over the LLMs to create a purposeful assistant, using a novel language called GenieScript. This approach combines the best of deep learning and programming systems to produce assistants that can make an impact on our society.
GenieScript makes it possible for non-AI developers to quickly script the high-level conversational flow, along with the specification of grounding functions such as knowledge base and API schemas. The complexities in natural language processing are handled by the underlying Genie system, which incorporates the research results so far:
ThingTalk: a formal, expressive, executable meaning representation for task-oriented dialogues
Contextual neural semantic parser for translating dialogues into ThingTalk.
Automatic generation of training data for dialogues from high-level schema and API specification with the help of LLMs.
Correctness. Neural semantic parsers today have been built by fine-tuning language models such as BART. Semantic parsing is useful for grounding LLMs, allowing them to answer factual questions correctly by querying external knowledge bases. How can we improve the sample efficiency of semantic parsing with LLMs? How can we leverage common knowledge and common sense reasoning to translate a natural language query into one supported by the external knowledge base? How can we use conversations to disambiguate a query? How do we combine task-oriented conversations with chitchats?
Self-learning.
Virtual assistants today just follow single commands, and they simply reject any commands they fail to understand. The capability to converse can allow assistants to learn from their mistakes and grow with experience. Is it possible to provide error correction generically once and for all to support all possible applications without burdenng the developer?
Human values.The traditional approach to teach LLM human values (anti-toxicity, elimination of sterotypical biases, promotion of prosocial behavior) is to fine-tune LLMs with hand-annotated data. Our PFC-inspired approach is to control the LLM output externally by adding thoughtfulness to its output. We ask LLMs (1) to generate multiple answers prompted with a desirable behavior such as being empathetic, (2) to reflect upon each candidate generation by assessing how much it is aligned with the desirable behavior, and (3) to choose the highest rated answer according to its own assessment. We will also experiment with fine-tuning LLMs on these generations, so that generating desirable outputs becomes its second nature.
Social intelligence. To be effective as a communicator, the assistant must also have socially intelligence, which includes good conversational skills, knowledge of social norms, listening skills, attuning to others' feelings, social self-efficacy, and impression management skills. LLMs have a lot of embedded knowledge about social intelligence. Can thoughtfulness be applied to hone its skills? How do we use LLMs to gather important user profile and history in external long-term memory, and how to make LLMs generate text that reflects the content stored in memory.
Multimodal assistants.The future assistants will naturally be multimodal, making the most of audio and graphical user interfaces. Our research is to create a development framework that will automatically support mixed-mode operations with only a minor increase in programming complexity.
Ongoing Collaborations
Multilingual agentwith Tianjin University (China), CEA (France),
IIIT (Hyderabad, India), Microsoft (India), Han Yang University (Korea).
Create multilingual agents by automatically
translating training data into target languages; manual annotation is needed
only for few-shot training, validation, and test. We are creating a Chinese,
English, French, Hindi, Hindi/English, Korean dataset based on Wizard-of-Oz
conversations with Deyi Xiong, Gaël de Chalendar, Nasredine Semmar,
Ponnurangam Kumaraguru, Kalika Bali, Monojit Choudhury, and Jiwon Seo.
Social Skill Training Assistants for Individuals with Autism Spectrum
Disorderwith Stanford Medicine.
Creating an empathetic agent that can help individuals on the Autism Disorder
Spectrum or with social anxiety to improve their social skills. This is a
joint research project with Prof. Lynn Koegel.
Conversational Agents for Building Information Management (BIM)with Stanford
Civil Engineering. Voice interfaces make the massive amount of information in
3D BIM software easily accessible by blue-collar workers on the job. This is
a joint research project with Prof. Martin Fischer.
We invite students, researchers, and corporations to join us in advancing the state of
the art in conversational agents and applying the technology to real-world use cases.
Monica Lam
AI Workshop, Stanford Computer Forum 2022, April 4-6, 2022
Demonstrations
We have now created a GenieScript system that can be used for experimental
prototypes. Our experience shows that the GenieScript system has made it
possible for small teams to create useful prototypes quickly.
We have developed a Genie Assistant, which can perform the ten most popular
skills. Genie can play songs, podcasts, radios, news, help find restaurants,
answer questions, forecast weather, set timers and reminders, and control IoTs.
It can control 8 different kinds of appliances, from thermostats, switches,
lights, fans, doors, locks, window covers, to the vacuum cleaner; it can also
control 7 different kinds of sensors, temperature, motion, illuminance, humidity,
flood, ultra-light, and battery.
Genie is running on the Baidu smartspeaker; it runs the full audio stack
which includes acoustic echo cancellation, voice activity detection, wakeword
detection (donated by PicoVoice), ducking, and speech-to-text and
text-to-speech (Microsoft Azure Cognitive Services).
The First Workshop on the World Wide Voice Web, Nov 20, 2021.
Genie also runs on a Raspberry Pi. It has been distributed as the voice
interface to Home Assistant, an open-source home gateway that can connect
to over 1000 different IoT devices.
State of the Open Home Workshop, Home Assistant, Dec 11, 2021.
Genie is used to create a voice interface for a 3D Building Information
Management software. This enables blue-collar workers to use voice to easily
access digital information on the job.
AI Workshop, Stanford Computer Forum 2022, April 4-6, 2022
Open-Source Software & Datasets
Software
Genie:
A toolkit to synthesize, train, and build conversational virtual assistants cost-effectively.
ThingTalk:
A extensible, executable representation language for task-oriented dialogues.
Genie NLP:
A library of NLP models for the Genie toolkit.
Genie Cloud:
A multi-user, kubernetes-enabled Genie runtime, with embedded NLP.
Genie Server:
A single-user version of Genie for home servers, suitable for running on low-power devices and smart
speakers.
Genie Devices:
A repository of skills created by Genie developers.
Monica Lam delivered an invited talk
Scaling the World Wide Voice Web with Open Standards and Pretrained Agents
at the Automated Knowledge Base Construction Conference (AKBC 2021).
Monica Lam delivered a keynote on Genie: An Open Privacy-Preserving Virtual Assistant with Deep
Learning at the International Symposium on Computer Architecture (ISCA 2021). The video is
available on
YouTube.
Jackie Yang presented
DoThisHere: Multimodal Interaction to Improve Cross-Application Tasks on Mobile Devices at
the
ACM Symposium on User Interface Software and Technology (UIST).
Congratulations to Dr. Michael Fischer for earning his Ph.D degree!
Congratulations to the Chirpy Cardinal team, led by Ashwin Paranjape, Abigail See, and Christopher
Manning,
for winning the 2nd Place in the Alexa Prize.
Monica Lam gave an invited talk, Let's Build an Open Programmable Virtual Assistant with
Privacy, at the First Workshop on Natural Language Interfaces in the ACL 2020 conference.
Virtual Assistant 2.0 improves the quality & lowers the cost of dialogue agents with automatic
synthesis of high-quality training data
Key Technology and Available Software
Thingpedia: an open crowdsourced repository of skills with over 150 skills and 1000+ IoT
devices
Genie Semantic Parser Generator: Automatic generation of contextual neural semantic parsers
from
Thingpedia entries Trained with synthesized data + 1% of traditional manual annotation cost
ThingTalk: The first (executable) virtual assistant programming language. A language with
formal
semantics to enable worldwide collaboration with extensibility, common libraries, neural models,
datasets,
and tools
Almond Assistant: the first assistant that protects privacy
The First Conversational Agent Able to Learn from Open-Ended Human Feedback
Novelty
If conversational agents want to improve, they need to learn from human interaction
We introduce the first technique for learning from open-ended conversations, and an agent that
interacted with 236k people online to learn new visual concepts
Key Technology
Interaction manifold: identification of a low-dimensional manifold of the larger action space
Available
Open source to be released
Publication
Socially Situated Artificial Intelligence: Learning to Interact and Interacting to Learn (In
preparation)
Krishna et al.
The First Smart Speaker to Know Where You Are
New Capability
Detect head position and orientation with the microphone array in regular smart speakers
Grounding Open-Domain Instructions to Automate Web Support Tasks
Nancy Xu, Sam Masling, Michael Du, Giovanni Campagna, Larry Heck, James Landay, Monica S Lam
Proceedings of the 2021 Annual Conference of the North American Chapter of the Association for
Computational Linguistics (NAACL-HLT 2021), June 2021.
OVAL is supported by the National Science Foundation under Grant No. 1900638, the Alfred P. Sloan Foundation under Grant No. G-2020-13938, Microsoft, and the Verdant Foundation. We also want to thank our partners Alpha Vantage, Baidu,
Picovoice, and Yelp for their support.