CityLens: Evaluating Large Vision-Language Models for Urban Socioeconomic Sensing

Tianhui Liu1, Hetian Pang2, Xin Zhang2, Tianjian Ouyang2, Zhiyuan Zhang3, Jie Feng2,4†, Yong Li2†, and Pan Hui1†
1 Information Hub, The Hong Kong University of Science and Technology (Guangzhou)
2 Department of Electronic Engineering, BNRist, Tsinghua University
3 School of Electronic and Information Engineering, Beijing Jiaotong University
4 Zhongguancun Academy

† Corresponding authors

ICLR 2026

Understanding urban socioeconomic conditions through visual data is a challenging yet essential task for sustainable urban development and policy planning. In this work, we introduce CityLens, a comprehensive benchmark designed to evaluate the capabilities of Large Vision-Language Models in predicting socioeconomic indicators from satellite and street view imagery. We construct a multi-modal dataset covering a total of 17 globally distributed cities, spanning 6 key domains: economy, education, crime, transport, health, and environment, reflecting the multifaceted nature of urban life. Based on this dataset, we define 11 prediction tasks and utilize 3 evaluation paradigms: Direct Metric Prediction, Normalized Metric Estimation, and Feature-Based Regression. We benchmark 17 state-of-the-art LVLMs across these tasks. Together, these characteristics make CityLens the most extensive socioeconomic benchmark to date in terms of geographic coverage, indicator diversity, and model scale. Our results reveal that while LVLMs demonstrate promising perceptual and reasoning capabilities, they still exhibit limitations in predicting urban socioeconomic indicators. CityLens provides a unified framework for diagnosing these limitations and guiding future efforts in using LVLMs to understand and predict urban socioeconomic patterns. The code and data are available online.

CityLens pairs each evaluation region with one satellite image, ten street view images, and scalar labels for 11 socioeconomic indicators across 17 globally distributed cities. The benchmark spans economy, education, crime, transport, health, and environment, with geographic units aligned to satellite areas, census tracts, or MSOAs.

Coverage 17 global cities

Including cities across North America, Europe, Asia, Africa, South America, and Oceania.

Signals 6 domains

Economy, education, crime, transport, health, and environment.

Tasks 11 indicators

GDP, population, house price, public transport, drive ratio, mental health, and more.

Models 17 LVLMs

Open and proprietary systems are evaluated under the same task definitions.

CityLens framework: satellite and street view imagery are used for socioeconomic prediction across domains, cities, models, and evaluation methods
CityLens framework: region-level visual inputs, socioeconomic prediction tasks, and benchmark scale.

Eleven prediction tasks across six
urban domains.

GDP Population House Price Public Transport Drive Ratio Mental Health Accessibility to Healthcare Life Expectancy Bachelor Ratio Violent Crime Building Height

CityLens evaluates direct numeric prediction, normalized score estimation, and feature-based regression so that perceptual grounding, relative ranking, and numeric calibration can be compared under a unified benchmark.

CityLens benchmark construction pipeline: data collection, indicator selection, data mapping, and evaluation methods
Benchmark construction pipeline: data collection, indicator selection, spatial mapping, and evaluation methods.
01

Direct Metric Prediction

The model receives urban imagery and directly outputs a specific socioeconomic value, such as the public transit ratio or average house price for a region.

02

Normalized Metric Estimation

The target is converted to a 0.0 to 9.9 scale, emphasizing relative ordering and reducing the burden of absolute value calibration.

03

Feature-Based Regression

LVLMs score structured visual features, then a regression model maps those features to socioeconomic indicators, turning the LVLM into a visual feature enhancer.

CityLens conference poster
Open Poster

Tianhui Liu, Hetian Pang, Xin Zhang, Tianjian Ouyang, Zhiyuan Zhang, Jie Feng, Yong Li, and Pan Hui. CityLens: Evaluating Large Vision-Language Models for Urban Socioeconomic Sensing. International Conference on Learning Representations, 2026.

@inproceedings{liu2026citylens,
  title={CityLens: Evaluating Large Vision-Language Models for Urban Socioeconomic Sensing},
  author={Liu, Tianhui and Pang, Hetian and Zhang, Xin and Ouyang, Tianjian and Zhang, Zhiyuan and Feng, Jie and Li, Yong and Hui, Pan},
  booktitle={International Conference on Learning Representations},
  year={2026}
}