🤖📝 SynSlideGen : AI-Generated Lecture Slides for Improving Slide Element Detection and Retrieval

1Indian Institute of Technology, Jodhpur
2Sardar Vallabhbhai National Institute of Technology, Surat
3CVIT, IIIT Hyderabad

* Equal Contribution. Work done during internship at IIIT Hyderabad

To be presented at ICDAR 2025 (Oral)

Sept 16 - 21 in Wuhan, China.

Key Contributions
  • SynSlideGen: A modular and lightweight pipeline to generate synthetic slides with automated annotations for Slide Element Detection and Text Query-based Slide Retrieval
  • RealSlide: A benchmark dataset of 1050 real graduate-level lecture slides with annotations for multiple slide-image tasks.
  • Extensive Analysis: Analysing the effectiveness of training with sythetic slides generated using the proposed pipeline for Slide Element Detection and Query-based Slide Retrieval tasks.

Abstract

Lecture slide element detection and retrieval, key tasks in lecture slide understanding, have gained significant attention in the multi-modal research community. However, annotating large volumes of lecture slides for supervised training is labor intensive and domain specific.

To address this, we propose a large language model (LLM)-guided Synthetic Lecture Slide Generation SynLecSlideGen pipeline that produces high-quality, coherent slides, named as SynSlide dataset, closely resembling real lecture slides. We also create an evaluation benchmark RealSlide by manually annotating 1050 real slides curated from lecture presentation decks. To evaluate the effectiveness of SynSlide dataset, we perform few-shot transfer learning on real slides using models pre-trained on our synthetically generated slides.

Experimental results show that few-shot transfer learning outperforms training only on the real dataset especially in low resource settings, demonstrating that synthetic slides can be a valuable pre-training resource in labeled data scarce real-world scenarios.

SynSlideGen pipeline

SynDet Sample : Image 1
Overview of our Synthetic Lecture Slide Generation SynLecSlideGen pipeline. In Phase I, we generate presentation content using LLMs and Web Image search agents in a multi-step process. In Phase II, we arbitrarily assign layout and style based on each slide's content. Finally, in Phase III, we generate the PPT files, convert them to images, and also generate automatic annotations for multiple downstream tasks from the final JSON file.

Generated Datasets

SynDet

SynDet is a subset of our synthetic slide dataset designed specifically for the Slide Element Detection (SED) task. It includes 2,200 slide images automatically annotated with bounding boxes in COCO format. These annotations span 16 fine-grained element categories, such as:

  • Textual: Title, Description, Enumeration, Heading
  • Structural: Equation, Table, Chart, Code
  • Visual: Diagram, Natural-Image, Logo
  • Meta: Slide Number, Footer Element, URL
  • Captions: Figure Caption, Table Caption

Annotations are extracted directly from slide structure JSONs and layout maps generated in the SynSlideGen pipeline, allowing pixel-accurate labeling without manual annotation. This provides a scalable solution for training and benchmarking object detection models on educational material.

Annotation Sample : Image 1

SynRet

SynRet is curated for the Text-based Slide Image Retrieval (TSIR) task. It consists of 2,200 synthetic slides where each slide is paired with two styles of slide-level textual summaries:

  • LecSD-style summaries: OCR-like textual descriptions focused on layout and key phrases.
  • LLM-generated semantic summaries: Capturing holistic slide intent, structure, and metadata (e.g., footer content, instructor name).

This dual-format enables both robustness and variability in retrieval models. The dataset is optimized for CLIP-style vision-language embedding training and includes rare slide types (e.g., equation-heavy, multi-diagram) to improve generalization.

SynRet also enables training for compositional reasoning tasks, such as counting elements, querying slides without a title, and reasoning over layout features like position and size. With metadata like slide numbers and instructor names, SynRet supports rich, structure-aware retrieval beyond standard OCR-based search.

Annotation Sample : Image 1

RealSlide Benchmark

RealSlide is a benchmark dataset of 1,050 manually annotated slides sourced from university-level courses in Computer Science (50%), Economics (20%), Physics (15%), and Mathematics (15%). Slides are curated from publicly available university lecture decks under Creative Commons licenses. Full list of utlized presentations can be accessed here

Manual Annotation

SynDet Sample : Image 1
Pixel-perfect annotation performed manually using the open-source CVAT annotation tool

Slide Element Detection

Training with synthetic data improves mAP@0.5 across most element classes. Especially for low-resource visual classes like Code, Natural Image, Diagram, Chart, etc.

 Effect of real slide images (RealSlide) on the performance (mAP@[0.5-0.95]) of three slide element detection models under two training strategies: (i) Single Stage and (ii) Two Stage

Effect of real slide images (RealSlide) on the performance (mAP@[0.5-0.95]) of three slide element detection models under two training strategies: (i) Single Stage and (ii) Two Stage

Element Fine-tuning using 50 Real Images Fine-tuning using 300 Real Images
YOLOV9 LayoutLMv3 DETR YOLOV9 LayoutLMv3 DETR
SSTS SSTS SSTS SSTS SSTS SSTS
Title69.175.366.271.957.867.270.975.471.877.059.569.7
Text06.517.411.313.207.413.815.621.513.818.109.115.3
Enumeration57.566.860.767.867.172.470.976.772.079.968.674.9
URL00.003.400.000.800.801.503.401.302.402.101.802.7
Equation09.228.300.909.816.620.223.027.416.524.218.322.1
Table57.860.135.743.348.550.682.756.259.063.156.551.3
Diagram25.846.326.639.933.741.053.858.646.050.438.844.7
Chart12.833.408.517.408.914.431.732.114.923.111.718.9
Heading06.109.203.816.110.618.818.622.824.635.213.120.3
Slide Number25.027.333.429.320.824.127.725.928.226.722.525.6
Footer Element48.742.247.248.036.642.051.547.743.048.940.045.1
Figure caption02.705.700.310.901.806.715.814.207.609.800.408.1
Table caption00.011.800.002.001.306.919.221.600.002.202.108.7
Logo48.346.103.728.118.826.267.969.426.042.922.634.7
Code02.134.600.005.008.014.523.842.510.518.612.117.8
Natural Image00.720.900.012.000.411.711.827.910.418.309.314.6
Macro avg23.333.018.626.121.227.036.838.827.933.826.830.2

Element-wise mAP @ IoU [0.50:0.95] for three slide element detection models under two fine-tune strategies: (i) Single Stage (SS) and (ii)Two Stage (TS) on the test set (750 images) of the RealSlide dataset.

Slide Image Retrieval

A qualtiative example

Query: A slide on Pseudo relevance feedback with a diagram and enumeration

1st result
1st
2nd result
2nd
3rd result
3rd
4th result
4th

Ground Truth slide highlighted in green

SynSlideGen-augmented training improves Recall@1 and Recall@5 for CLIP-based retrieval models. It enhances performance particularly for open-world, out-of-domain lecture slides,

Text-based Lecture Slide Retrieval using CLIP model. We show Recall@1 and Recall@10.
Dataset R@1 R@10
Finetuning dataset (# Samples) Test Dataset (# Samples) In-domain
None (zero-shot)LecSD-Test (10,000)NA1644
LecSD-Train (31,475)Yes4578
DreamStruct (3,183)No2659
SynRet (2,200)No2660
RealSlide (300)No2049
None (zero-shot)RealSlide (750)NA3363
LecSD-Train (31,475)No3157
DreamStruct (3,183)No4267
SynRet (2,200)No4369
RealSlide (300)Yes4069

Table 3: SIR Retrieval Metrics (Recall@K)

BibTeX


@misc{
  maniyar2025aigeneratedlectureslidesimproving,
  title={AI-Generated Lecture Slides for Improving Slide Element Detection and Retrieval}, 
  author={Suyash Maniyar and Vishvesh Trivedi and Ajoy Mondal and Anand Mishra and C. V. Jawahar},
  year={2025},
  eprint={2506.23605},
  archivePrefix={arXiv},
  primaryClass={cs.CV},
  url={https://arxiv.org/abs/2506.23605}, 
}