Thuy Nguyen1,
Dang Nguyen2,
Hoang Nguyen3,
Thuan Luong3,
Franck Dernoncourt4,
Long Hoang Dang3,
Viet Dac Lai4
1Reasoning Foundation
2University of Maryland
3Posts and Telecommunications Institute of Technology
4Adobe Research
OWLViz is a challenging open-world benchmark designed to evaluate Vision-Language Models and Agents in Visual Question Answering tasks that require multi-step reasoning, tool usage, and external knowledge retrieval. Unlike traditional VQA datasets, OWLViz questions are short, clear, and demand complex reasoning over degraded images and external information sources.
OWLViz comprises 248 carefully annotated questions and answers, designed to comprehensively evaluate multi-modal reasoning capabilities.
Each question is associated with one or more of the following skill categories:
Recognition, segmentation, attribute identification, spatial relations
Measurement, arithmetic, logic, counting, comparison
API calls, OCR, GUI interaction, search
Answers are provided in standardized formats like yes/no, multiple choice, and short text.
Images were gathered from publicly available websites and selected for their visual difficulty. They simulate realistic challenges such as low brightness, blur, and low contrast.
The dataset was designed and annotated by the authors through a three-phase process to ensure quality, solvability, and objectivity. This rigorous process ensures that only independently answerable questions, clearly grounded in the visual content, are included.
Questions are categorized into three increasing levels of difficulty based on the number of unique skills required:
Typically involves no more than 2 unique skills and at most 1 external tool.
Generally between 3 and 5 unique skills, often including a combination of two tools.
Designed for an ideal general-purpose assistant, these questions may require arbitrarily long action sequences, unrestricted tool use and general internet access.
(a) Question: "How many people are visible on the left side of the white line that cuts across the photo? Provide a numeric answer."
Answer: 2
Skills: using external API, human recognition
Difficulty level: 1
(b) Question: "How many umbrellas have 3 or more colors? Provide a numeric answer."
Answer: 2
Skills: object recognition, attribute identification, counting, object detection
Difficulty level: 2
(c) Question: "This is in Fairfax, Virginia. What is the name of the road shown in the photo?"
Answer: Shadowridge Dr; Shadowridge drive; Shadowridge
Skills: OCR, knowledge search, knowledge retrieval, GUI, comparison, spatial relationships
Difficulty level: 3
Figure: Examples of the three core challenges in our OWLViz dataset. (a) Challenging visual conditions requiring image enhancement or specialized recognition tools to count people on a white line in a low-contrast night scene. (b) Complex reasoning tasks demanding object detection, attribute identification, and precise counting of multi-colored umbrellas in a dynamic street scene. (c) Knowledge-intensive queries requiring internet exploration and external data retrieval to identify specific locations based on minimal visual cues.
OWLViz pushes the boundaries of multi-modal AI by testing three fundamental capabilities that current systems struggle with:
Processing low-quality, blurred, or poorly lit images that mirror real-world conditions
Multi-step cognitive processes involving counting, measurement, and logical deduction
Internet search and external data retrieval based on minimal visual cues
Three methodological approaches were systematically evaluated: Vanilla VLMs, Tool-Calling Agents, and GUI Agents
69.2% Accuracy
Establishing the upper bound for model performance on these intuitive visual reasoning tasks
Best: Gemini
27.09% LM
Struggled with multi-step reasoning and tool use. Most models scored below 20% EM and 30% LM.
Tool Usage Gain
+2% EM
UI-TARS & ShowUI
0.00% EM
Poor performance with LM maxing at 12.80%, reflecting low interaction capability.
| Model | Language Model | Vision Model | EM (%) ↓ | LM (%) | |
|---|---|---|---|---|---|
| Small Open Source | DeepSeek-VL2-small | - | - | 11.16 | 12.75 |
| DeepSeek-VL2 | - | - | 11.16 | 14.34 | |
| Qwen2-VL-7B-Instruct | Qwen2-7B | QwenViT | 12.75 | 17.93 | |
| Qwen2.5-VL-7B-Instruct | Qwen2.5-7B | QwenViT | 13.94 | 19.52 | |
| InternVL3-8B | Qwen2.5-7B | InternViT-300M-v2.5 | 14.34 | 21.12 | |
| LLaVa-v1.6-mistral-7B | Mistral-7B | CLIP ViT-L/14 | 14.74 | 15.54 | |
| Llama-3.2-11B-Vision-Instruct | Llama-3.1-8B | - | 14.74 | 25.10 | |
| InternVL2.5-8B | InternLM2.5-7B | InternViT-300M-v2.5 | 14.74 | 18.73 | |
| LLaVa-v1.5-13B | Vicuna-v1.5-13B | CLIP ViT-L/14 | 16.33 | 16.33 | |
| Molmo-7B-D-0924 | Qwen2-7B | CLIP ViT-L/14 | 17.13 | 20.32 | |
| LLaVa-v1.5-7B | Vicuna-v1.5-7B | CLIP ViT-L/14 | 18.33 | 19.92 | |
| Large Open Source | Qwen2.5-VL-32B-Instruct | Qwen2.5-32B | QwenViT | 2.79 | 25.90 |
| InternVL2.5-38B | Qwen-2.5-32B | InternViT-6B-v2.5 | 13.94 | 19.52 | |
| InternVL3-78B | Qwen2.5-72B | InternViT-6B-v2.5 | 15.54 | 20.72 | |
| Molmo-72B-0924 | Qwen2-72B | CLIP ViT-L/14 | 15.94 | 22.71 | |
| InternVL2.5-78B | Qwen-2.5-72B | InternViT-6B-v2.5 | 15.94 | 21.91 | |
| InternVL3-38B | Qwen2.5-32B | InternViT-6B-v2.5 | 16.73 | 23.11 | |
| Qwen2-VL-72B-Instruct | Qwen2-72B | QwenViT | 19.92 | 25.90 | |
| Qwen2.5-VL-72B-Instruct | Qwen2.5-72B | QwenViT | 20.32 | 26.29 | |
| Llama-3.2-90B-Vision-Instruct | Llama-3.1-70B | - | 20.72 | 24.70 | |
| Proprietary | Claude-3-5-sonnet-20241022 | - | - | 11.55 | 19.92 |
| GPT-4V | - | - | 14.34 | 20.00 | |
| Gemini-2.5-Flash | - | - | 15.54 | 25.50 | |
| GPT-4o | - | - | 16.33 | 19.52 | |
| Gemini-1.5-Pro | - | - | 19.52 | 21.91 | |
| Gemini-2.0-Flash | - | - | 21.51 | 24.30 | |
| Gemini-2.5-Pro | - | - | 21.51 | 27.09 | |
*Model performance breakdown into 3 groups ordered by EM performance. The best and second-best of each group are bolded and underlined, respectively.
*GPT-4o is used as the judge model to evaluate semantic equivalence between predicted and ground truth answers, producing the LLM-based Match (LM) accuracy metric.| Model | MLLM | EM (%) | LM (%) |
|---|---|---|---|
| LLaVa-Plus | gpt-4o-2024-11-20 | 0.00 | 2.50 |
| ViperGPT | gpt-4o-2024-11-20 | 7.56 | 12.35 |
| GPT4Tools | vicuna-7b-v1.5 | 11.15 | 14.34 |
| HYDRA | gpt-4o-2024-11-20 | 10.75 | 12.35 |
| HF Agent | gpt-4o-2024-11-20 | 18.32 | 24.08 |
| DynaSaur | gpt-4o-2024-11-20 | 16.23 | 26.67 |
*The best and second-best of each group are bolded and underlined, respectively.
| Model | EM | LM | Click | Hover | Scroll |
|---|---|---|---|---|---|
| UI-TARS | 0.00 | 12.31 | 0.91 | 0.51 | 0.68 |
| ShowUI | 0.00 | 12.80 | 0.97 | 0.19 | 0.10 |
Even state-of-the-art VLMs struggle significantly with open-world reasoning tasks
Tool integration provides measurable but modest improvements in performance
GUI-based approaches currently lack the sophistication for complex multi-modal tasks
Large performance gap between human and AI capabilities highlights research opportunities
What is the name of the shop that is located across the street from the lot for sale in this photo? Provide an answer in fewer than 3 words
Any of the following answers are acceptable:
Wheat Bay; Uniquely Chengdu; Wheat Bay Uniquely Chengdu
...Identify the shop across Alder Street from the for-sale lot.
Not identifiable
The name of the shop across the street is visible in the image. It is "Starbucks Coffee".
I need to...
Action: Scroll
No answer
Figure: Qualitative results comparing different model capabilities on OWLViz. Results demonstrate varying capabilities across model types: Gemini (vanilla VLM) fails to identify the target, DynaSaur (tool-calling agent) produces an incorrect answer despite external search capabilities, and ShowUI (GUI agent) provides no answer.
The current evaluation method transforms questions into constrained response types (e.g., multiple-choice, yes/no, numerical, short text) to enable exact-match evaluation. While this ensures consistency, it may increase the likelihood of correct responses by narrowing the output space, potentially overestimating model performance compared to free-form answers.
The OWLViz dataset is currently kept private to minimize data contamination. Additional information on how to access the data is available upon request.
We thank Adobe Research for their financial and technical support.
@article{nguyen2025owlviz,
title={OWLViz: An Open-World Benchmark for Visual Question Answering},
author={Nguyen, Thuy and Nguyen, Dang and Nguyen, Hoang and
Luong, Thuan and Dang, Long Hoang and Lai, Viet Dac},
journal={arXiv preprint arXiv:2503.07631},
year={2025}
}