Abstract
Artificial intelligence (AI) opens new possibilities for processing and analysing large, heterogeneous historical data corpora in a semi-automated way. The Ottoman Nature in Travelogues (ONiT) project applies a fine-tuned Contrastive Language–Image Pre-Training (CLIP) model for retrieving images with nature representations in digitized early book prints based on embeddings of visual features rather than on textual metadata. In this article, we present results of our work, including a curated and annotated dataset of 8,042 images of nature representations, and the CLIP-based text–image exploration tool ONiT Explorer. An evaluation of the fine-tuned model comparing it to the zero-shot model confirms the potential of vision-language models for retrieving specific contents from large image collections in the cultural heritage and digital humanities domains. While in general our fine-tuned model can retrieve more correct examples per class compared to the zero-shot model, our analysis also reveals some limitations that need to be addressed in future explorations.
| Original language | English |
|---|---|
| Article number | fqaf082 |
| Pages (from-to) | 1-18 |
| Number of pages | 19 |
| Journal | Digital Scholarship in the Humanities |
| DOIs | |
| Publication status | Published - 7 Sept 2025 |
| Event | Digital Humanities Conference 2023 - Messecongress Graz convention centre, Graz, Austria Duration: 10 Jul 2023 → 14 Jul 2023 |
Research Field
- Multimodal Analytics
Keywords
- Vision-language models
- Computer vision
- Early modern prints
- Book history
- Image retrieval
- Artificial intelligence