Publications in reverse chronological order. CVPR, ECCV, and ICCV are the top conferences in computer vision. ACL, EMNLP, and NAACL are the top conferences in natural language processing. NeurIPS, ICLR and ICML are top-tier conferences in machine learning generally.
2024
MetaSumPerceiver: Multimodal Multi-Document Evidence Summarization for Fact-Checking
Ting-Chih Chen, Chia-Wei Tang, and Christopher Thomas
In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2024
Fact-checking real-world claims often requires reviewing multiple multimodal documents in order to assess the claim’s truthfulness, a highly laborious and time-consuming task. In this paper, we present a summarization model crafted to generate claim-specific summaries useful for fact-checking from multimodal multi-document datasets. The model takes inputs in the form of documents, images, and a claim, with the objective of assisting in fact-checking tasks. We introduce a dynamic perceiver-based model that is able to handle inputs from multiple modalities of arbitrary lengths. To train our model, we leverage a novel reinforcement learning-based entailment objective in order to generate summaries that provide evidence distinguishing between different truthfulness labels. To assess the efficacy of our approach, we conduct experiments on both an existing benchmark as well as a new dataset of multi-document claims which we contribute. Our approach outperforms the SOTA approach by 4.6% in the claim verification task on the MOCHEG dataset and demonstrates strong performance on our new Multi-News-Fact-Checking dataset.
Semantic Shield: Defending Vision-Language Models Against Backdooring and Poisoning via Fine-grained Knowledge Alignment
Alvi Md Ishmam, and Christopher Thomas
In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024
Building cross-model intelligence that can understand charts and communicate the salient information hidden behind them is an appealing challenge in the vision and language (V+L) community. The capability to uncover the underlined table data of chart figures is a critical key to automatic chart understanding. We introduce ChartT5, a V+L model that learns how to interpret table information from chart images via cross-modal pre-training on plot table pairs. Specifically, we propose two novel pre-training objectives: Masked Header Prediction (MHP) and Masked Value Prediction (MVP) to facilitate the model with different skills to interpret the table information. We have conducted extensive experiments on chart question answering and chart summarization to verify the effectiveness of the proposed pre-training strategies. In particular, on the ChartQA benchmark, our ChartT5 outperforms the state-of-the-art non-pretraining methods by over 8% performance gains.
2022
Fine-Grained Visual Entailment
Christopher Thomas, Yipeng Zhang, and Shih-Fu Chang
In Proceedings of the European Conference on Computer Vision, 2022
This study provides insight into New York City residents’ perceptions about violence after the outbreak of Coronavirus disease (COVID-19) based on information from communities in New York City Housing Authority (NYCHA) buildings. In this novel analysis, we used focus group and social media data to confirm or reject findings from qualitative interviews. We first used data from 69 in-depth, semi-structured interviews with low-income residents and community stakeholders to further explore how violence impacts New York City’s low-income residents of color, as well as the role of city government in providing tangible support for violence prevention during co-occurring health (COVID-19) and social (anti-Black racism) pandemics. Residents described how COVID-19 and the Black Lives Matter movement impacted safety in their communities while offering direct recommendations to improve safety. Residents also shared recommendations that indirectly improve community safety by addressing long term systemic issues. As the recruitment of interviewees was concluding, researchers facilitated two focus groups with 38 interviewees to discuss similar topics. In order to assess the degree to which the themes discovered in our qualitative interviews were shared by the broader community, we developed an integrative community data science study which leveraged natural language processing and computer vision techniques to study text and images on public social media data of 12 million tweets generated by residents. We joined computational methods with qualitative analysis through a social work lens and design justice principles to most accurately and holistically analyze the community perceptions of gun violence issues and potential prevention strategies. Findings indicate valuable community-based insights that elucidate how the co-occurring pandemics impact residents’ experiences of gun violence and provide important implications for gun violence prevention in a digital era.
Emphasizing Complementary Samples for Non-Literal Cross-Modal Retrieval
Christopher Thomas, and Adriana Kovashka
In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, 2022