My Adobe internship work has been accepted as a conference paper at ICLR 2025: “SV-RAG: LoRA-Contextualizing Adaptation of MLLMs for Long Document Understanding.” Huge thanks to my mentor, Ruiyi Zhang, for his invaluable support and guidance! An improved implementation is available at Self-Visual-RAG, developed after my internship with the support of my labmate at UB.
SV-RAG enhances long-document understanding by adapting MLLMs for self-visual retrieval-augmented generation, optimizing both evidence retrieval and question answering with specialized LoRA adapters.
Specifically, we use hidden states as embedding features and train the model to compute sequence interaction scores via contrastive learning, while using the same MLLM for QA.