Including cities across North America, Europe, Asia, Africa, South America, and Oceania.
CityLens: Evaluating Large Vision-Language Models for Urban Socioeconomic Sensing
† Corresponding authors
ICLR 2026Abstract
Understanding urban socioeconomic conditions through visual data is a challenging yet essential task for sustainable urban development and policy planning. In this work, we introduce CityLens, a comprehensive benchmark designed to evaluate the capabilities of Large Vision-Language Models in predicting socioeconomic indicators from satellite and street view imagery. We construct a multi-modal dataset covering a total of 17 globally distributed cities, spanning 6 key domains: economy, education, crime, transport, health, and environment, reflecting the multifaceted nature of urban life. Based on this dataset, we define 11 prediction tasks and utilize 3 evaluation paradigms: Direct Metric Prediction, Normalized Metric Estimation, and Feature-Based Regression. We benchmark 17 state-of-the-art LVLMs across these tasks. Together, these characteristics make CityLens the most extensive socioeconomic benchmark to date in terms of geographic coverage, indicator diversity, and model scale. Our results reveal that while LVLMs demonstrate promising perceptual and reasoning capabilities, they still exhibit limitations in predicting urban socioeconomic indicators. CityLens provides a unified framework for diagnosing these limitations and guiding future efforts in using LVLMs to understand and predict urban socioeconomic patterns. The code and data are available online.
Benchmark
CityLens pairs each evaluation region with one satellite image, ten street view images, and scalar labels for 11 socioeconomic indicators across 17 globally distributed cities. The benchmark spans economy, education, crime, transport, health, and environment, with geographic units aligned to satellite areas, census tracts, or MSOAs.
Economy, education, crime, transport, health, and environment.
GDP, population, house price, public transport, drive ratio, mental health, and more.
Open and proprietary systems are evaluated under the same task definitions.
Indicators
Eleven prediction tasks across six
urban domains.
Evaluation Paradigms
CityLens evaluates direct numeric prediction, normalized score estimation, and feature-based regression so that perceptual grounding, relative ranking, and numeric calibration can be compared under a unified benchmark.
Direct Metric Prediction
The model receives urban imagery and directly outputs a specific socioeconomic value, such as the public transit ratio or average house price for a region.
Normalized Metric Estimation
The target is converted to a 0.0 to 9.9 scale, emphasizing relative ordering and reducing the burden of absolute value calibration.
Feature-Based Regression
LVLMs score structured visual features, then a regression model maps those features to socioeconomic indicators, turning the LVLM into a visual feature enhancer.
Results
Poster
Citation
Tianhui Liu, Hetian Pang, Xin Zhang, Tianjian Ouyang, Zhiyuan Zhang, Jie Feng, Yong Li, and Pan Hui. CityLens: Evaluating Large Vision-Language Models for Urban Socioeconomic Sensing. International Conference on Learning Representations, 2026.
@inproceedings{liu2026citylens,
title={CityLens: Evaluating Large Vision-Language Models for Urban Socioeconomic Sensing},
author={Liu, Tianhui and Pang, Hetian and Zhang, Xin and Ouyang, Tianjian and Zhang, Zhiyuan and Feng, Jie and Li, Yong and Hui, Pan},
booktitle={International Conference on Learning Representations},
year={2026}
}