OWLViz: An Open-World Benchmark for Visual Question Answering

Thuy Nguyen1, Dang Nguyen2, Hoang Nguyen3, Thuan Luong3,
Franck Dernoncourt4, Long Hoang Dang3, Viet Dac Lai4

1Reasoning Foundation   2University of Maryland  
3Posts and Telecommunications Institute of Technology   4Adobe Research

OWLViz is a challenging open-world benchmark designed to evaluate Vision-Language Models and Agents in Visual Question Answering tasks that require multi-step reasoning, tool usage, and external knowledge retrieval. Unlike traditional VQA datasets, OWLViz questions are short, clear, and demand complex reasoning over degraded images and external information sources.

OWLViz Dataset

Dataset Size

OWLViz comprises 248 carefully annotated questions and answers, designed to comprehensively evaluate multi-modal reasoning capabilities.

Question Design

Each question is associated with one or more of the following skill categories:

Visual Skills

Recognition, segmentation, attribute identification, spatial relations

Reasoning Skills

Measurement, arithmetic, logic, counting, comparison

Tool Use Skills

API calls, OCR, GUI interaction, search

Answers are provided in standardized formats like yes/no, multiple choice, and short text.

Image Sources

Images were gathered from publicly available websites and selected for their visual difficulty. They simulate realistic challenges such as low brightness, blur, and low contrast.

Annotation Process

The dataset was designed and annotated by the authors through a three-phase process to ensure quality, solvability, and objectivity. This rigorous process ensures that only independently answerable questions, clearly grounded in the visual content, are included.

Difficulty Levels

Questions are categorized into three increasing levels of difficulty based on the number of unique skills required:

Level 1

Typically involves no more than 2 unique skills and at most 1 external tool.

Level 2

Generally between 3 and 5 unique skills, often including a combination of two tools.

Level 3

Designed for an ideal general-purpose assistant, these questions may require arbitrarily long action sequences, unrestricted tool use and general internet access.

Example 1

(a) Question: "How many people are visible on the left side of the white line that cuts across the photo? Provide a numeric answer." Answer: 2
Skills: using external API, human recognition Difficulty level: 1

Example 2

(b) Question: "How many umbrellas have 3 or more colors? Provide a numeric answer." Answer: 2
Skills: object recognition, attribute identification, counting, object detection Difficulty level: 2

Example 3

(c) Question: "This is in Fairfax, Virginia. What is the name of the road shown in the photo?" Answer: Shadowridge Dr; Shadowridge drive; Shadowridge
Skills: OCR, knowledge search, knowledge retrieval, GUI, comparison, spatial relationships Difficulty level: 3

Figure: Examples of the three core challenges in our OWLViz dataset. (a) Challenging visual conditions requiring image enhancement or specialized recognition tools to count people on a white line in a low-contrast night scene. (b) Complex reasoning tasks demanding object detection, attribute identification, and precise counting of multi-colored umbrellas in a dynamic street scene. (c) Knowledge-intensive queries requiring internet exploration and external data retrieval to identify specific locations based on minimal visual cues.

Challenges and Significance

OWLViz pushes the boundaries of multi-modal AI by testing three fundamental capabilities that current systems struggle with:

Visual Degradation

Processing low-quality, blurred, or poorly lit images that mirror real-world conditions

"Count people in a night scene with poor visibility"

Complex Reasoning

Multi-step cognitive processes involving counting, measurement, and logical deduction

"Identify and count multi-colored umbrellas by attributes"

Web Exploration

Internet search and external data retrieval based on minimal visual cues

"Find road name in Fairfax using OCR and web search"
Visual Processing Reasoning Knowledge Integration

Experiments and Results

Three methodological approaches were systematically evaluated: Vanilla VLMs, Tool-Calling Agents, and GUI Agents

Human Baseline

69.2% Accuracy

Establishing the upper bound for model performance on these intuitive visual reasoning tasks

Vanilla VLMs

Best: Gemini

27.09% LM

Struggled with multi-step reasoning and tool use. Most models scored below 20% EM and 30% LM.

Tool-Calling Agents

HF Agent: 18.32% EM
DynaSaur: 16.23% EM, 26.67% LM

Tool Usage Gain

+2% EM

GUI Agents

UI-TARS & ShowUI

0.00% EM

Poor performance with LM maxing at 12.80%, reflecting low interaction capability.

Table 1: Performance of Vision-Language Models on OWLViz
Model Language Model Vision Model EM (%) ↓ LM (%)
Small Open SourceDeepSeek-VL2-small--11.1612.75
DeepSeek-VL2--11.1614.34
Qwen2-VL-7B-InstructQwen2-7BQwenViT12.7517.93
Qwen2.5-VL-7B-InstructQwen2.5-7BQwenViT13.9419.52
InternVL3-8BQwen2.5-7BInternViT-300M-v2.514.3421.12
LLaVa-v1.6-mistral-7BMistral-7BCLIP ViT-L/1414.7415.54
Llama-3.2-11B-Vision-InstructLlama-3.1-8B-14.7425.10
InternVL2.5-8BInternLM2.5-7BInternViT-300M-v2.514.7418.73
LLaVa-v1.5-13BVicuna-v1.5-13BCLIP ViT-L/1416.3316.33
Molmo-7B-D-0924Qwen2-7BCLIP ViT-L/1417.1320.32
LLaVa-v1.5-7BVicuna-v1.5-7BCLIP ViT-L/1418.3319.92
Large Open SourceQwen2.5-VL-32B-InstructQwen2.5-32BQwenViT2.7925.90
InternVL2.5-38BQwen-2.5-32BInternViT-6B-v2.513.9419.52
InternVL3-78BQwen2.5-72BInternViT-6B-v2.515.5420.72
Molmo-72B-0924Qwen2-72BCLIP ViT-L/1415.9422.71
InternVL2.5-78BQwen-2.5-72BInternViT-6B-v2.515.9421.91
InternVL3-38BQwen2.5-32BInternViT-6B-v2.516.7323.11
Qwen2-VL-72B-InstructQwen2-72BQwenViT19.9225.90
Qwen2.5-VL-72B-InstructQwen2.5-72BQwenViT20.3226.29
Llama-3.2-90B-Vision-InstructLlama-3.1-70B-20.7224.70
ProprietaryClaude-3-5-sonnet-20241022--11.5519.92
GPT-4V--14.3420.00
Gemini-2.5-Flash--15.5425.50
GPT-4o--16.3319.52
Gemini-1.5-Pro--19.5221.91
Gemini-2.0-Flash--21.5124.30
Gemini-2.5-Pro--21.5127.09

*Model performance breakdown into 3 groups ordered by EM performance. The best and second-best of each group are bolded and underlined, respectively.

*GPT-4o is used as the judge model to evaluate semantic equivalence between predicted and ground truth answers, producing the LLM-based Match (LM) accuracy metric.
Table 2: Performance of Agentic models with tool-uses
Model MLLM EM (%) LM (%)
LLaVa-Plus gpt-4o-2024-11-20 0.00 2.50
ViperGPT gpt-4o-2024-11-20 7.56 12.35
GPT4Tools vicuna-7b-v1.5 11.15 14.34
HYDRA gpt-4o-2024-11-20 10.75 12.35
HF Agent gpt-4o-2024-11-20 18.32 24.08
DynaSaur gpt-4o-2024-11-20 16.23 26.67

*The best and second-best of each group are bolded and underlined, respectively.

Table 3: Performance of GUI-based models
Model EM LM Click Hover Scroll
UI-TARS 0.00 12.31 0.91 0.51 0.68
ShowUI 0.00 12.80 0.97 0.19 0.10

Key Research Insights

Even state-of-the-art VLMs struggle significantly with open-world reasoning tasks

Tool integration provides measurable but modest improvements in performance

GUI-based approaches currently lack the sophistication for complex multi-modal tasks

Large performance gap between human and AI capabilities highlights research opportunities

Qualitative Analysis

Question & Answer

What is the name of the shop that is located across the street from the lot for sale in this photo? Provide an answer in fewer than 3 words

Any of the following answers are acceptable:

Wheat Bay; Uniquely Chengdu; Wheat Bay Uniquely Chengdu

Map showing shop location

Gemini

...Identify the shop across Alder Street from the for-sale lot.

Not identifiable

DynaSaur

The name of the shop across the street is visible in the image. It is "Starbucks Coffee".

ShowUI

I need to...

Action: Scroll

No answer

Figure: Qualitative results comparing different model capabilities on OWLViz. Results demonstrate varying capabilities across model types: Gemini (vanilla VLM) fails to identify the target, DynaSaur (tool-calling agent) produces an incorrect answer despite external search capabilities, and ShowUI (GUI agent) provides no answer.

Limitations

The current evaluation method transforms questions into constrained response types (e.g., multiple-choice, yes/no, numerical, short text) to enable exact-match evaluation. While this ensures consistency, it may increase the likelihood of correct responses by narrowing the output space, potentially overestimating model performance compared to free-form answers.

Accessing the Dataset

The OWLViz dataset is currently kept private to minimize data contamination. Additional information on how to access the data is available upon request.

Acknowledgments

We thank Adobe Research for their financial and technical support.

Citation

BibTeX

@article{nguyen2025owlviz,
  title={OWLViz: An Open-World Benchmark for Visual Question Answering},
  author={Nguyen, Thuy and Nguyen, Dang and Nguyen, Hoang and 
          Luong, Thuan and Dang, Long Hoang and Lai, Viet Dac},
  journal={arXiv preprint arXiv:2503.07631},
  year={2025}
}