SynSlideGen : AI-Generated Lecture Slides for Improving Slide Element Detection and Retrieval

Abstract

Lecture slide element detection and retrieval, key tasks in lecture slide understanding, have gained significant attention in the multi-modal research community. However, annotating large volumes of lecture slides for supervised training is labor intensive and domain specific.

To address this, we propose a large language model (LLM)-guided Synthetic Lecture Slide Generation SynLecSlideGen pipeline that produces high-quality, coherent slides, named as SynSlide dataset, closely resembling real lecture slides. We also create an evaluation benchmark RealSlide by manually annotating 1050 real slides curated from lecture presentation decks. To evaluate the effectiveness of SynSlide dataset, we perform few-shot transfer learning on real slides using models pre-trained on our synthetically generated slides.

Experimental results show that few-shot transfer learning outperforms training only on the real dataset especially in low resource settings, demonstrating that synthetic slides can be a valuable pre-training resource in labeled data scarce real-world scenarios.

Generated Datasets

SynDet

SynDet is a subset of our synthetic slide dataset designed specifically for the Slide Element Detection (SED) task. It includes 2,200 slide images automatically annotated with bounding boxes in COCO format. These annotations span 16 fine-grained element categories, such as:

Textual: Title, Description, Enumeration, Heading
Structural: Equation, Table, Chart, Code
Visual: Diagram, Natural-Image, Logo
Meta: Slide Number, Footer Element, URL
Captions: Figure Caption, Table Caption

Annotations are extracted directly from slide structure JSONs and layout maps generated in the SynSlideGen pipeline, allowing pixel-accurate labeling without manual annotation. This provides a scalable solution for training and benchmarking object detection models on educational material.

SynRet

SynRet is curated for the Text-based Slide Image Retrieval (TSIR) task. It consists of 2,200 synthetic slides where each slide is paired with two styles of slide-level textual summaries:

LecSD-style summaries: OCR-like textual descriptions focused on layout and key phrases.
LLM-generated semantic summaries: Capturing holistic slide intent, structure, and metadata (e.g., footer content, instructor name).

This dual-format enables both robustness and variability in retrieval models. The dataset is optimized for CLIP-style vision-language embedding training and includes rare slide types (e.g., equation-heavy, multi-diagram) to improve generalization.

SynRet also enables training for compositional reasoning tasks, such as counting elements, querying slides without a title, and reasoning over layout features like position and size. With metadata like slide numbers and instructor names, SynRet supports rich, structure-aware retrieval beyond standard OCR-based search.

RealSlide Benchmark

RealSlide is a benchmark dataset of 1,050 manually annotated slides sourced from university-level courses in Computer Science (50%), Economics (20%), Physics (15%), and Mathematics (15%). Slides are curated from publicly available university lecture decks under Creative Commons licenses. Full list of utlized presentations can be accessed here

Manual Annotation

SynDet Sample : Image 1 — Pixel-perfect annotation performed manually using the open-source CVAT annotation tool

Slide Element Detection

Training with synthetic data improves mAP@0.5 across most element classes. Especially for low-resource visual classes like Code, Natural Image, Diagram, Chart, etc.

Effect of real slide images (RealSlide) on the performance (mAP@[0.5-0.95]) of three slide element detection models under two training strategies: (i) Single Stage and (ii) Two Stage

Element	Fine-tuning using 50 Real Images						Fine-tuning using 300 Real Images
Element	YOLOV9		LayoutLMv3		DETR		YOLOV9		LayoutLMv3		DETR
	SS	TS	SS	TS	SS	TS	SS	TS	SS	TS	SS	TS
Title	69.1	75.3	66.2	71.9	57.8	67.2	70.9	75.4	71.8	77.0	59.5	69.7
Text	06.5	17.4	11.3	13.2	07.4	13.8	15.6	21.5	13.8	18.1	09.1	15.3
Enumeration	57.5	66.8	60.7	67.8	67.1	72.4	70.9	76.7	72.0	79.9	68.6	74.9
URL	00.0	03.4	00.0	00.8	00.8	01.5	03.4	01.3	02.4	02.1	01.8	02.7
Equation	09.2	28.3	00.9	09.8	16.6	20.2	23.0	27.4	16.5	24.2	18.3	22.1
Table	57.8	60.1	35.7	43.3	48.5	50.6	82.7	56.2	59.0	63.1	56.5	51.3
Diagram	25.8	46.3	26.6	39.9	33.7	41.0	53.8	58.6	46.0	50.4	38.8	44.7
Chart	12.8	33.4	08.5	17.4	08.9	14.4	31.7	32.1	14.9	23.1	11.7	18.9
Heading	06.1	09.2	03.8	16.1	10.6	18.8	18.6	22.8	24.6	35.2	13.1	20.3
Slide Number	25.0	27.3	33.4	29.3	20.8	24.1	27.7	25.9	28.2	26.7	22.5	25.6
Footer Element	48.7	42.2	47.2	48.0	36.6	42.0	51.5	47.7	43.0	48.9	40.0	45.1
Figure caption	02.7	05.7	00.3	10.9	01.8	06.7	15.8	14.2	07.6	09.8	00.4	08.1
Table caption	00.0	11.8	00.0	02.0	01.3	06.9	19.2	21.6	00.0	02.2	02.1	08.7
Logo	48.3	46.1	03.7	28.1	18.8	26.2	67.9	69.4	26.0	42.9	22.6	34.7
Code	02.1	34.6	00.0	05.0	08.0	14.5	23.8	42.5	10.5	18.6	12.1	17.8
Natural Image	00.7	20.9	00.0	12.0	00.4	11.7	11.8	27.9	10.4	18.3	09.3	14.6
Macro avg	23.3	33.0	18.6	26.1	21.2	27.0	36.8	38.8	27.9	33.8	26.8	30.2

Element-wise mAP @ IoU [0.50:0.95] for three slide element detection models under two fine-tune strategies: (i) Single Stage (SS) and (ii)Two Stage (TS) on the test set (750 images) of the RealSlide dataset.

Slide Image Retrieval

A qualtiative example

Query: A slide on Pseudo relevance feedback with a diagram and enumeration

Ground Truth slide highlighted in green

SynSlideGen-augmented training improves Recall@1 and Recall@5 for CLIP-based retrieval models. It enhances performance particularly for open-world, out-of-domain lecture slides,

Text-based Lecture Slide Retrieval using CLIP model. We show Recall@1 and Recall@10.

Dataset R@1 R@10

Finetuning dataset (# Samples) Test Dataset (# Samples) In-domain

None (zero-shot) LecSD-Test (10,000) NA 16 44

LecSD-Train (31,475) Yes 45 78

DreamStruct (3,183) No 26 59

SynRet (2,200) No 26 60

RealSlide (300) No 20 49

None (zero-shot) RealSlide (750) NA 33 63

LecSD-Train (31,475) No 31 57

DreamStruct (3,183) No 42 67

SynRet (2,200) No 43 69

RealSlide (300) Yes 40 69

Table 3: SIR Retrieval Metrics (Recall@K)

Dataset	R@1	R@10
None (zero-shot)	LecSD-Test (10,000)	NA	16	44
LecSD-Train (31,475)	Yes	45	78
DreamStruct (3,183)	No	26	59
SynRet (2,200)	No	26	60
RealSlide (300)	No	20	49
None (zero-shot)	RealSlide (750)	NA	33	63
LecSD-Train (31,475)	No	31	57
DreamStruct (3,183)	No	42	67
SynRet (2,200)	No	43	69
RealSlide (300)	Yes	40	69

BibTeX


@article{maniyar2025ai,
  title={AI-Generated Lecture Slides for Improving Slide Element Detection and Retrieval},
  author={Maniyar, Suyash and Trivedi, Vishvesh and Mondal, Ajoy and Mishra, Anand and Jawahar, CV},
  journal={arXiv preprint arXiv:2506.23605},
  year={2025}
}

🤖📝 SynSlideGen : AI-Generated Lecture Slides for Improving Slide Element Detection and Retrieval

To be presented at ICDAR 2025 (Oral)

Sept 16 - 21 in Wuhan, China.

Key Contributions