DriveBench

Are VLMs Ready for Autonomous Driving?
An Empirical Study from the Reliability, Data, and Metric Perspectives

Shaoyuan Xie¹ Lingdong Kong^2,3 Yuhao Dong^2,4 Chonghao Sima^2,5
Wenwei Zhang² Qi Alfred Chen¹ Ziwei Liu⁴ Liang Pan²

¹University of California, Irvine ²Shanghai AI Laboratory ³National University of Singapore ⁴S-Lab, Nanyang Technological University ⁵The University of Hong Kong

Paper arXiv Toolkit Dataset Leaderboard

Abstract

Recent advancements in Vision-Language Models (VLMs) have sparked interest in their use for autonomous driving, particularly in generating interpretable driving decisions through natural language. However, the assumption that VLMs inherently provide visually grounded, reliable, and interpretable explanations for driving remains largely unexamined. To address this gap, we introduce DriveBench, a benchmark dataset designed to evaluate VLM reliability across 17 settings (clean, corrupted, and text-only inputs), encompassing 19,200 frames, 20,498 question-answer pairs, three question types, four mainstream driving tasks, and a total of 12 popular VLMs. Our findings reveal that VLMs often generate plausible responses derived from general knowledge or textual cues rather than true visual grounding, especially under degraded or missing visual inputs. This behavior, concealed by dataset imbalances and insufficient evaluation metrics, poses significant risks in safety-critical scenarios like autonomous driving. We further observe that VLMs struggle with multi-modal reasoning and display heightened sensitivity to input corruptions, leading to inconsistencies in performance. To address these challenges, we propose refined evaluation metrics that prioritize robust visual grounding and multi-modal understanding. Additionally, we highlight the potential of leveraging VLMs’ awareness of corruptions to enhance their reliability, offering a roadmap for developing more trustworthy and interpretable decision-making systems in real-world autonomous driving contexts. The benchmark toolkit is publicly accessible.

DriveBench: Driving with VLMs

Overview of key features and configurations in DriveBench. Our benchmark evaluates the reliability and visual grounding of VLMs in autonomous driving across four mainstream driving tasks - perception, prediction, planning, and explanation - under a diverse spectrum of 17 settings (clean, corrupted, and text-only inputs). It includes 19,200 frames and 20,498 QA pairs spanning three question types: multiple-choice, open-ended, and visual grounding. By addressing diverse tasks and conditions, we aim to reveal VLM limitations and promote reliable, interpretable autonomous driving.

Are Existing VLMs Ready for Autonomous Driving?

We investigate this question through the lenses of reliability, data quality, and evaluation metrics. Our findings reveal that current VLMs often fabricate convincing answers to driving-related questions, even in the absence of visual information. These fabricated responses can bypass existing evaluation metrics, including GPT scores, due to issues such as dataset imbalance, insufficient contextual data, and flawed evaluation protocols. These observations challenge the widely held assumption that VLMs inherently provide more reliable, visually grounded, and interpretable responses than task-specific models in driving scenarios.

Interactive Demo

Benchmark Comparison

Benchmark	Perception	Prediction	Behavior	Planning	Robustness	Frames	QA	Logic	Evaluation Metrics
Benchmark	Perception	Prediction	Behavior	Planning	Robustness	(Test)	(Test)	Logic	Evaluation Metrics

BDD-X	✔	✘	✘	✘	✘	-	-	None	Language
BDD-OIA	✔	✘	✔	✘	✘	-	-	None	F1 Score
nuScenes-QA	✔	✘	✘	✘	✘	36,114	83,337	None	Acc
Talk2Car	✔	✘	✘	✔	✘	~1.8k	2,447	None	-
nuPrompt	✔	✘	✘	✘	✘	~36k	~6k	None	AMOTA
DRAMA	✔	✘	✘	✔	✘	-	~14k	Chain	Language
Rank2Tel	✔	✘	✘	✔	✘	-	-	Chain	Acc, Language
DirveMLLM	✔	✘	✘	✘	✘	880	-	None	Acc
DriveVLM	✔	✘	✔	✔	✘	-	-	None	GPT_ctx
DriveLM	✔	✔	✔	✔	✘	4,794	15,480	Graph	Language, GPT
DriveBench (Ours)	✔	✔	✔	✔	✔	19,200	20,498	Graph	Acc, Language, GPT, GPT_ctx

Unique Data Collection

We analyze existing “Driving with Language” benchmarks and identify their issues, particularly dataset imbalance inherited from sources like nuScenes, BDD, and Waymo Open. Our benchmark addresses these issues by curating a balanced dataset with diverse driving tasks, corruption types, and text-only inputs, enabling systematic evaluation of VLMs under real-world autonomous driving conditions. This ensuress a reliable testbed for assessing VLMs in safety-critical scenarios.

Challenging Cases in Existing Dataset

(a): The black sedan is turning left, indicated by the turn lights. (b): The black sedan is turning right. GPT4-o predicts Going Ahead for these two cases.

(c) and (d) are both Turning Right, but GPT4-o fails to locate the objects based on center pixel positions due to the existence of overlapping or occlusion.

Spatial Distribution

We study the spatial distribution of predictions generated by Qwen2-VL (7B), under the text-only prompts. We find that the model can potentially “guess” the MCQ answers without visual information by leveraging plain text cues, e.g., camera and coordinate positions mentioned in the questions, resulting in the hallucination issue.

Benchmark Study

Model	Size	Type	Perception (Clean)	Perception (Corr.)	Perception (T.O.)	Prediction (Clean)	Prediction (Corr.)	Prediction (T.O.)	Planning (Clean)	Planning (Corr.)	Planning (T.O.)	Behavior (Clean)	Behavior (Corr.)	Behavior (T.O.)

Human	-	-	47.67	38.32	-	-	-	-	-	-	-	69.51	54.09	-

GPT-4o	-	Commercial	35.37	35.25	36.48	51.30	49.94	49.05	75.75	75.36	73.21	45.40	44.33	50.03

LLaVA-1.5	7B	Open	23.22	22.95	22.31	22.02	17.54	14.64	29.15	31.51	32.45	13.60	13.62	14.91
LLaVA-1.5	13B	Open	23.35	23.37	22.37	36.98	37.78	23.98	34.26	34.99	38.85	32.99	32.43	32.79
LLaVA-NeXT	7B	Open	24.15	19.62	13.86	35.07	35.89	28.36	45.27	44.36	27.58	48.16	39.44	11.92
InternVL2	8B	Open	32.36	32.68	33.60	45.52	37.93	48.89	53.27	55.25	34.56	54.58	40.78	20.14
Phi-3	4.2B	Open	22.88	23.93	28.26	40.11	37.27	22.61	60.03	61.31	46.88	45.20	44.57	28.22
Phi-3.5	4.2B	Open	27.52	27.51	28.26	45.13	38.21	4.92	31.91	28.36	46.30	37.89	49.13	39.16
Oryx	7B	Open	17.02	15.97	18.47	48.13	46.63	12.77	53.57	55.76	48.26	33.92	33.81	23.94
Qwen2-VL	7B	Open	28.99	27.85	35.16	37.89	39.55	37.77	57.04	54.78	41.66	49.07	47.68	54.48
Qwen2-VL	72B	Open	30.13	26.92	17.70	49.35	43.49	5.57	61.30	63.07	53.35	51.26	49.78	39.46

DriveLM	7B	Specialist	16.85	16.00	8.75	44.33	39.71	4.70	68.71	67.60	65.24	42.78	40.37	27.83
Dolphins	7B	Specialist	9.59	10.84	11.01	32.66	29.88	39.98	52.91	53.77	60.98	8.81	8.25	11.92

Table. Evaluations of VLMs across different driving tasks (perception, prediction, planning, and behavior). Clean represents clean image inputs. Corr. represents corruption image inputs, averaged across fifteen corruptions. T.O. represents text-only evaluation. For humans, we only evaluate MCQ questions in perception and behavior tasks. The evaluations are based on GPT scores, where we tailored detailed rubrics for each task and question type.

Robustness Analysis

Model	Size	Type	Weather			External			Sensor			Motion			Transmission
Model	Size	Type	MCQ	VQA	CAP	MCQ	VQA	CAP	MCQ	VQA	CAP	MCQ	VQA	CAP	MCQ	VQA	CAP

GPT-4o	-	Commercial	57.20	57.28	54.90	29.25	56.60	61.98	44.25	54.95	56.53	34.25	59.20	56.25	36.83	53.95	57.57

LLaVA-1.5	7B	Open	69.70	35.49	35.91	26.50	29.17	34.95	18.83	30.64	33.15	71.25	33.43	35.18	10.17	27.28	34.38
LLaVA-1.5	13B	Open	61.60	39.76	37.76	15.50	34.55	37.83	24.08	35.48	36.08	79.75	36.46	36.42	15.50	32.53	34.33
LLaVA-NeXT	7B	Open	69.70	36.96	48.52	48.50	30.32	57.18	21.83	30.40	44.37	66.00	34.20	50.44	11.83	29.43	53.50
InternVL2	8B	Open	59.90	48.72	48.60	50.75	47.74	57.82	29.92	45.06	51.14	68.25	49.51	49.67	30.00	43.42	54.24
Phi-3	4.2B	Open	40.00	40.59	45.61	25.00	31.44	45.99	16.83	35.58	43.71	31.25	42.92	48.43	27.67	33.04	41.35
Phi-3.5	4.2B	Open	60.60	41.82	45.97	21.25	36.89	30.95	25.58	34.66	39.30	33.00	46.03	49.33	39.67	33.47	39.67
Oryx	7B	Open	53.20	40.43	48.95	45.00	40.68	56.06	50.50	36.71	48.55	72.50	40.01	48.33	39.67	36.98	49.87
Qwen2-VL	7B	Open	76.70	49.33	45.12	37.50	47.62	51.24	22.83	39.45	47.23	57.00	47.40	47.74	35.83	42.31	48.60
Qwen2-VL	72B	Open	59.80	51.05	48.55	45.50	50.57	57.25	52.25	45.89	48.59	58.25	50.85	47.88	44.83	46.23	50.50

DriveLM	7B	Specialist	21.20	42.86	20.04	21.25	37.49	21.92	9.00	36.68	15.56	22.25	42.05	17.07	17.50	39.56	10.37
Dolphins	7B	Specialist	54.30	30.21	31.08	3.00	30.42	29.38	9.42	26.83	26.30	9.25	29.82	28.05	21.50	28.86	27.65

ROUGE, BLEU, or GPT Score?

Evaluation results when using different metrics. The language metrics, such as ROUGE-L and BLEU-4, exhibit high consistency; while the GPT Score metric demonstrates a noticeable gap compared to existing language metrics. We also observe that fine-tuned process benefits DriveLM significantly in regulating its response format, thus leading to misleading high performance under language metrics.

Behavior Distributions of Steering & Speed in DriveLM

We notice that the majority actions of vehicle behaviors are “Going Ahead”, while only a small proportion of actions are “Turn Left” or “Turn Right”.

This leads to the data distribution imbalance issue in evaluating different vision-language models.

GPT4-o Examples

Figure. Examples of GPT-4o responses to four tasks and the corresponding evaluation results under the Dark condition. We observe that GPT-4o is aware of the low-light environment and can identify the bus and pedestrian from the image, showing certain degrees of resilience.

Figure. Examples of GPT-4o response to four tasks and the corresponding evaluation results under the Motion Blur condition. We observe that GPT-4o are influenced by this type of corruption and tend to predict "driving fast" based on it. The example shows the potential of visual corruption to influence high-level driving decisions.

Evaluation Types (Rubric, Question & Context)

Figure. Comparisons among Different Evaluation Types (rubric, question-aware, and context-aware). The GPT scores vary depending on the rubric, question, and physical driving context. With more information added, the results become more distinguishable.

Qualitative Comparisons

Figure. Examples of different VLM responses under the Frame Lost condition. We observe that GPT-4o responses with visible objects while LLaVA-NeXT and DriveLM tend to hallucinate objects that cannot be seen from the provided images.

Figure. Examples of different VLM responses under the Water Splash condition. We observe that, under severe visual corruptions, VLMs respond with ambiguous and general answers based on their learned knowledge, without referring to the visual information. Most responses include traffic signals and pedestrians, even though they are not visible in the provided images.

BibTeX

@article{xie2025drivebench,
  author  = {Xie, Shaoyuan and Kong, Lingdong and Dong, Yuhao and Sima, Chonghao and Zhang, Wenwei and Chen, Qi Alfred and Liu, Ziwei and Pan, Liang},
  title   = {Are VLMs Ready for Autonomous Driving? An Empirical Study from the Reliability, Data, and Metric Perspectives},
  journal = {arXiv preprint arXiv:2501.04003},
  year    = {2025},
}

Are VLMs Ready for Autonomous Driving?An Empirical Study from the Reliability, Data, and Metric Perspectives