# How Voice Commands Help Computers Find Objects in Pictures and Videos

> A method for using voice commands to tell a computer which object in a photo or video you want to search for, allowing it to automatically isolate that object and perform a visual search.

- **Patent:** US 9098533
- **Original title:** Voice directed context sensitive visual search
- **Owner:** Microsoft Technology Licensing LLC
- **Granted:** 2015
- **Status:** Active
- **Times cited:** 4
- **Field:** consumer_electronics, software, ai_ml

## What it does

This patent describes a system that bridges the gap between what you say and what you see on a screen. When you point at an object on a display and ask a question about it, the system uses your voice query to identify the specific object within the image or video frame. It then intelligently selects a specific edge-detection algorithm—a mathematical tool to find the boundaries of shapes—tailored to that specific object or context. Finally, it crops that object out of the original image and uses it to perform a 'reverse visual search' to find more information, showing you the results directly on your screen.

## What it does NOT cover

- Does not cover general voice-to-text transcription that is not linked to visual object extraction.
- Does not cover visual searches that rely solely on manual user selection or cropping without voice input.
- Does not cover object detection methods that use a single, fixed edge-detection algorithm for all image types.

## The clever bit

The system dynamically selects the best edge-detection algorithm based on the voice query and context, rather than using a one-size-fits-all approach to finding the object's boundaries.

## Real-world examples

1. Smart TV features that identify actors or products in a movie scene via voice command.
2. Augmented reality shopping apps that isolate items in a live video feed.
3. Digital photo management tools that allow users to search for specific objects within a video.

## Why it matters

This technology is a foundational step toward multimodal interfaces where voice and vision work together. It moves beyond simple keyword searching by allowing users to interact with visual media as if it were a searchable database, which is critical for modern augmented reality and smart assistant applications.

## Frequently asked questions

### What does How Voice Commands Help Computers Find Objects in Pictures and Videos cover?

A method for using voice commands to tell a computer which object in a photo or video you want to search for, allowing it to automatically isolate that object and perform a visual search.

### Who owns patent US 9098533?

Microsoft Technology Licensing LLC owns this patent, granted in 2015.

### When does this patent expire?

This patent is expected to expire on August 4, 2035, when the invention enters the public domain.

### What is patent US 9098533 cited by?

This patent has been cited by 4 later patents that build on its ideas.

### What problem does this patent solve?

This technology is a foundational step toward multimodal interfaces where voice and vision work together. It moves beyond simple keyword searching by allowing users to interact with visual media as if it were a searchable database, which is critical for modern augmented reality and smart assistant applications.

### What does this patent NOT cover?

Does not cover general voice-to-text transcription that is not linked to visual object extraction.

**Full plain-English explainer:** https://patentbrief.org/patent/us/9098533/amazon-kinesis

**Original patent:** https://patents.google.com/patent/US9098533

---

_Source: PatentBrief — https://patentbrief.org. Patent facts are from public records; the plain-English explanation is PatentBrief's._
