Proactive Agent : Adaptive Decision-Making for Multimodal Voice Interface

Smart Sensing for Humans (SmaSH) Lab - Carnegie Mellon University

Sept 2024 - Present

Objective: To develop AI agents that respond more intelligently by leveraging semantic filtering, natural language processing, and voice activity detection (VAD) to filter out irrelevant or ambient speech. This minimizes false activations and ensures engagement that feels natural and non-intrusive.

TOOLS

OpenAI API

Whisper

Figma

SHI Bot: Adaptive Decision-Making for Multimodal Voice Interfaces is designed to support natural, confident interaction, particularly for users unfamiliar with or hesitant toward digital tools. By integrating NLP and semantic analysis, a classification engine evaluates the contextual relevance of input, triggering responses only when appropriate. This reduces false activations and builds user trust. Designed for hands-free, context-aware use, the interface is intuitive, non-intrusive, and empowers users to feel in control, not overwhelmed.

Overview:

Speech-to-text → NLP + Semantic Analysis

Continuously assesses real-time context and user needs to determine when to assist and when to stay passive

Helpful when you need it, invisible when you don’t
Optimized for situational awareness, not constant interaction

► Smart companion, not a source of distraction

Dataset

System Design

Use Cases

Accidental Command Trigger in Passive Conversation

A nearby user casually says “go home” during an unrelated conversation.

▶︎ The voice interface incorrectly interprets the phrase as a navigation command and initiates route guidance.

False Activation from Distant, Irrelevant Speech

Smart speaker overhears “no!” from another room

▶︎ The smart speaker misinterprets the emotional outburst as a cancellation command and prematurely stops an active task (e.g., timer or music).

Responding only to support the driver or driving task.

Retrieves directions or alternate routes so the driver doesn’t have to pause and search
Provides a brief definition or fact if a passenger asks about something they see outside
Reminds the driver when fuel is low
Alerts the driver if traffic conditions ahead have changed

Helpful aisstant when you need it, invisible when you don’t

Retrieve a specific document so the speak wouldn’t have to pause

▶︎ Records the meeting or document assignments for next steps

▶︎ Provides a brief definition of a term if asked

▶︎ Reminder if the meeting is running long

▶︎ Agent summarizes the meeting at the end

Next Steps

❏ Adaptive UI based on interaction patterns

❏ Real-time prompt editing pipeline

❏ Voice-based prompt revision

❏ Incorporating Emotional and Tonal Cues

❏ Personalized Interaction Models

Adapt system behavior based on individual user preferences, speech patterns, and interaction history.

Page updated

Google Sites

Report abuse