CS 6804: Multimodal Vision

Spring 2023

Department of Computer Science, Virginia Tech

Location: 318 Randolph Hall, 460 Old Turner St. Blacksburg, VA.
Meeting time: Mondays and Wednesdays, 4:00 PM - 5:15 PM
Instructor: Chris Thomas
E-mail: Please message me through the messaging system on Canvas rather than sending e-mail.
Office: 3120C Torgersen Hall
Office hours: 12:00 PM - 2:00 PM Tuesdays. My Zoom is linked here.
Exam section: 16M. Tentatively May 8, 2023, 7:45AM - 9:45AM. While there are no exams in this course, we may use this time for final presentations if necessary.

Course overview

Course description: Humans are able to reason about how concepts read in text, heard in audio, and seen in visual content (different modalities of data) relate to one another by drawing on a learned multimodal understanding of the world. For example, reading a textual description might allow one to recognize a bird they have never seen before by drawing on a background understanding of what the color blue looks like and what a high-pitched bird call sounds like. Thus, building artificially intelligent systems capable of robust multimodal reasoning is an area of intense research interest. This graduate-level seminar course will introduce students to the latest research in multimodal computer vision, with a significant emphasis on vision and language. The course will feature foundational lectures, in-depth student-led presentations on state-of-the-art research, and classroom discussions. Students will complete an intensive, semester-long research project and report.

Prerequisites: A background in deep learning is strongly recommended. Prior machine learning coursework would also be highly beneficial. Concerned students should speak to the instructor as soon as possible to ensure they have sufficient background to successfully complete this course. You aren’t expected to be familiar with every technique in every paper covered in this course, but you should be able to understand the main ideas of each paper discussed and propose (and possibly implement for your class project) extensions or improvements to these papers. If you are uncomfortable with your ability to do this, speak to the instructor as soon as possible.

Format: This is a seminar-style graduate course covering recent multimodal computer vision research. The majority of the class time will be spent on listening to paper presentations by other students, followed by group discussions. Before each class, you will read and write a review of the main paper presented. You will also work in a group on a final project which will be presented at the end of the semester to the class.

Course learning objectives:

  • Learning about state of the art methods in multimodal computer vision
  • Learning to think critically about research, which applies beyond this class or subject area. This involves developing the ability to critically assess research papers you encounter and to understand how different works are connected.
  • Learning how to conceive novel ideas and extensions of existing research methods and implement your ideas
  • Learning how to clearly write up and present research
  • Learning how to work and collaborate with others in a research group

Topical outline: Example topics to be covered in this course include representation learning, fusion, pretraining, privileged modalities, prompt learning, cross-modal retrieval, model architectures (e.g. transformers, two-stream, convolutional), attention mechanisms, zero-shot and few-shot recognition, knowledge representation, generative models, and embodied vision. Students will be asked to vote on these topics (and to possibly contribute new ones that interest them) on the first day of the course. The schedule for the remainder of the course will be determined to ensure adequate topic coverage and to reflect student interest.

Requirements

Your final grade in the course will be based on your in-class participation and discussion, paper reviews, paper presentation, and final project, which will be weighted as follows:

Syntax Description
Final project 45%
In-class participation and discussion 15%
Paper presentation 20%
Paper reviews 20%

Attendance: Because this is a seminar-style class attendance is required. Missing more than two classes without a valid excuse will negatively affect your final grade. Attendance may be randomly taken.

Participation: You are expected to regularly and meaningfully participate in this seminar class. Participation could include asking interesting questions or offering comments on a paper, as well as answering and responding to comments from other class members. Questions asked within the class about vision topics are assumed to be addressed to the class to answer, rather than the instructor. Note that thoroughly reading the papers ahead of the class and doing a good job on your paper review will help prepare you to participate in class. Each presenter has spent a considerable amount of time preparing their presentation and studying the paper, so you should do your part to engage.

Paper reviews

The purpose of paper reviews is to prepare students to critically assess and review multimodal computer vision research. Students are required to write one paper review for each class presentation (the first presentation is January 25th). Students who are presenting that class session are not required to submit a paper review for that class. The paper review must be of the primary paper presented each class, though students might refer to other papers in their review to compare and contrast the paper. Paper reviews should be 1-2 pages single-spaced using 11-point Times New Roman and 1 inch margins. Paper reviews should follow the following guidelines (modified from the CVPR Reviewer Guidelines).

Tips for writing good reviews

  • In your review, look for what is good or stimulating in the paper, and what knowledge advancement it has made. Your review should highlight both the novelty and potential impact of the work. Above all, you should be specific and detailed in your reviews.
  • Take the time to write good reviews. Ideally, you should read a paper and then think about it over the course of several days before you write your review.
  • You should take enough time to write a thoughtful and detailed review. Bullet lists with one short sentence per bullet are NOT a detailed review.
  • Be specific about novelty. Claims in a review that the submitted work “has been done before” MUST be backed up with specific references and an explanation of how closely they are related. At the same time, for a positive review, be sure to summarize what novel aspects are most interesting in the Strengths section.
  • It is best to avoid the term “the authors” in your review because you are reviewing their work and not the person. Instead, try to refer to “the paper” or “the method”. Referring to the authors can be perceived as being confrontational when you write real reviews, even though you may not mean it this way.
  • Be generous about suggesting new ideas for how the paper could be improved. You might suggest a new technical tool that could help, a dataset that could be tried, an application area that might benefit from the method, or a way to generalize the idea to increase its impact.

Contents of paper reviews

Each paper review should consist of the following parts and should clearly indicate which is being addressed at that point in the review (e.g. Summary, Relation to prior work, Strengths, Weaknesses, Future work).

  1. Summary: The paper review should first summarize the entire paper. This means first explaining what the paper is trying to do and how the paper proposes to do it. Your summary should focus on the primary novelty and contributions of the paper, rather than unimportant details. Your summary of the paper typically will include a summary of a new model architecture or loss function, but might also involve describing key mathematical insights which undergird the paper. If the primary contribution of the paper is a dataset, the paper should describe details about what make the dataset significant. Your summary should also summarize how the method is experimentally evaluated and any significant findings or results, whether quantitative or qualitative.
  2. Relation to prior work: The review should next summarize the paper’s relation to prior work and why its contributions are (or are not) significant in your opinion. Reading the paper’s “Related Work” section and understanding how the paper differs from prior work will help you write this section.
  3. Strengths: The review should mention at least three strengths of the approach. For example, you might explain how a particular technique or design is expected to solve problems with existing work. Simply rephrasing the strengths of the paper from its contribution section does not adequately address this point. Instead, rely on your own impression of the work and your own judgments about its novelty.
  4. Weaknesses: What do you feel detracts from the paper’s contributions? Your review should mention at least three weaknesses. Some example weaknesses include cases where the method is likely to not perform well because of its design, computational costs, non-standard inference or train requirements, shortcomings in the proposed loss function or formulation, or a weak experimental evaluation.
  5. Future work: Propose at least one possible extension of the paper. This might be a fix to a weakness you identified (e.g. a modified model or loss function) or you might propose how the techniques developed in the paper could be applied in some novel way for a different task. You should not, however, simply rephrase or repeat the future work suggested by the paper itself. Instead, think critically about how you might extend the paper as a researcher.

Grading policy for paper reviews

Paper reviews will be submitted through Canvas in PDF format and are due at 10:00 PM the day before the class in which the paper will be presented. The submission time reported by Canvas will be used to determine the time of submission. A submission that is one minute late is still late. Timely reviews start at 10 points. You will lose points for making false statements, vague or irrelevant claims which don’t indicate a deeper understanding of the paper, reviews which simply rephrase the paper’s own summary, claimed strengths, or weaknesses without providing your own insights (we want to hear what you think - not what the authors say!), or reviews which otherwise fail to address the above criteria. If your review is submitted by 12:00 PM the day of the paper presentation, your review will start at half credit. After 12:00 PM on the day of the class, you will not receive credit for your review. You will get three late days, meaning you can submit three reviews late (i.e. by 12:00 PM on the day of the class) without a penalty. These late days apply to paper reviews only.

Paper presentations

Each student will give 1-2 presentations throughout the course (depending on enrollment). Each class will be focused on a particular topic of interest and will contain one primary paper and possibly several background papers. Students will express their topic preferences after the first class via Canvas, though due to the class size, there is no guarantee students will be matched to their desired topic.

Guidelines for paper presentations

Students should throughly read the assigned papers and other relevant background papers. This means fully understanding key equations, model design choices, etc. Your presentation, at a minimum, should:

  1. Clearly define what problem the paper is addressing.
  2. Provide motivation for why the problem is important, interesting, and/or challenging.
  3. Address prior related work that has attempted to address this problem (or a related problem).
  4. Describe, in detail, the proposed approach for the problem. For example, this may involve describing details of the model design and key loss functions used to train it. You should understand all equations that you present in class.
  5. Explain how the paper is evaluated. You should fully describe the experimental set-up and present any quantitative and qualitative results. If there are any unusual metrics that students may not know, you should explain what those are and how they are computed.
  6. Discuss key strengths and weaknesses of the paper.
  7. Propose ideas for future work and identify any open research questions.

The non-primary paper(s) often provide background context to the primary paper. For example, the primary paper may build upon methods or results developed in these papers. In this case, you should clearly present this background work in such a way that students will be able to understand how the primary paper builds upon that prior work.

Presentations should be 45 minutes long. Good paper presentations are the result of extensive preparation and practice. You should practice your presentation many times before presenting it to the class to ensure you know what to say and to time yourself. Your presentation should be at most 2-3 minutes shorter and certainly no longer than 45 minutes. In sum, your presentation is expected to be highly polished. This means you end on time, your presentation is well organized, and you explain the paper(s) presented clearly. After your paper presentation, the paper presenter will moderate a ~20-25 minute discussion session. The presenter is responsible for preparing possible topics for discussion and driving the discussion. The discussion could involve potential weaknesses in the approach, ideas for future work, thoughts on terminology used by the authors, the relationship of the paper to other literature, claims that the paper made that weren’t adequately justified, choices that you didn’t understand, etc.

You are strongly encouraged to use illustrations and graphics to explain concepts. Using animations, images, and videos (if applicable) is highly encouraged to make your presentation more engaging. Avoid slides which are just walls of text. Instead, you are encouraged to use short bullets (which you animate) and explain the rest verbally. You are also encouraged to animate equations and explain them piece by piece, which can help students better understand complex concepts. You are highly encouraged to search online for relevant materials which you may use in your presentation. However, make sure to clearly cite all your sources. You are free to use slides made by others, but be aware that your presentation requirements are different from the authors or others who may be describing the paper. Unlike authors presenting their work at a conference, your presentation should view the paper critically. This means identifying weaknesses and thinking about the significance of the work in its broader context. Finally, make sure to use your own words on slides and during your presentation. You should not be memorizing someone else’s words and presenting them as your own or copying text verbatim from the paper (or elsewhere) to slides.

Paper presentation slides should be uploaded to Canvas (pptx or pdf) by 10:00 PM on the day before the intended class presentation.

Grading policy for paper presentations

Your grade for your paper presentation will be based on: 1) clarity and presentation quality; 2) whether you covered the key points of the paper; 3) correctness of your statements made during the presentation; 4) whether you addressed all the guidelines above; 5) peer reviews by others in the class; 6) how well you facilitated the discussion of the paper; and 7) how well you delivered you presentation (i.e. was it clearly practiced, met time constraints, etc.).

Final project

This course will conclude with a student-driven group project, with a report due at the end of the course. In order to make meaningful progress on a project, groups must be 3-4 students. Note that larger groups will have higher expectations. Given that a significant portion of your final grade depends on the final project, each student is expected to contribute significantly to the final project. All projects must involve implementation of a multimodal computer vision system or algorithm along with a thorough evaluation. The goal is for your final project report to resemble a conference paper like those you have read throughout the class. Ideall, your group project will become a subsequent conference publication. The topic of your final project is open-ended and groups are free to chose a topic of their choosing. However, final projects should at least fall into one of the following broad categories:

  • Extend one of the papers we covered in class in a significant way, complete with a thorough experimental evaluation;
  • Propose a novel method or approach for solving a multimodal vision problem we discussed in class or that is already known in the literature and thoroughly evaluate it;
  • Propose a completely new multimodal vision problem and explain why it is significant and needs solving, implement an approach to solve the problem, and evaluate the approach

In summary, your final project can address any multimodal vision problem, either existing or new, as long as you propose a new method or significant extension or modification of existing methods. Applications of existing methods or techniques to new datasets or problems is not sufficient for the final project. All projects must be thoroughly experimentally evaluated. This may involve benchmarking existing relevant work in the case of a new problem or applying your method on standard benchmarks and computing standard metrics. Projects that overlap in some way with your existing research are OK, but design, conception, implementation, evaluation, and delivery of the project should be the result of the students, not other faculty members, and should be specific to this course. However, your project can build upon or extend your prior research efforts without an issue.

Project proposal

The project proposal should be 3-5 pages long (excluding references) and must use the CVPR latex template. Your project proposal should include the following:

  • A clear problem statement which describes the goal of the project.
  • A thorough literature review. Make sure you thoroughly search the literature before you start writing. You might find that your idea has already been taken. The literature review should resemble that in a CVPR conference paper and should cite existing work. It should clearly show how the proposed project does something the prior work you cite does not.
  • A detailed description of the proposed approach. The authors should describe new loss functions they plan to use, changes to existing models, etc.
  • The proposed experimental evaluation protocol and expected results. You should describe what experiments you plan to run and how you will run this. You should describe which datasets you plan to evaluate on, any existing code bases you will use, and what needs to be implemented by the group. You should also describe what your group is aiming for with the project (i.e. what do you consider a success). You should explain what you hope each experiment will show and discuss any uncertainty you have about the project. If you already have preliminary results, feel free to include them.

Project status report

The purpose of the project status report is to update me on your progress while also moving your project proposal document closer to the final report. The project status report should be about 3-5 pages (excluding references) and should describe the group’s progress on the project and any unforeseen blockers or challenges you are facing. The project status report should also use the CVPR latex template and should include Introduction, Related Work, Approach, and Results sections following the standard CVPR paper layout. You are free to reuse text from the proposal in the project status report (e.g. literature review). Please include any preliminary results, even if they aren’t good.

Project presentation

Each group will present their project to the class during the final several sessions of the class. The final project presentations should address the points listed in the guidelines for paper presentations above, but should be more descriptive of your project (since other students haven’t read your paper). The same guidelines for paper presentations apply to the project presentation, i.e. the presentation should be engaging, clear, and well-rehearsed. You don’t need to be as critical of your project as you were in the paper presentation, but you should point out any strengths or weaknesses of your project and think critically about how it could be improved. The length of the presentations will depend on the number of groups and class enrollment and will be determined after the add / drop deadline. Presentations should be 45 minutes long. Each member of the group should present a part or parts of the presentation that they are individually responsible for. Like your paper presentation, you should carefully rehearse your presentation both individually and as a group to ensure it flows well and that it is on time. Your presentation should cover the same points as the paper presentation, but should also mention any thinking behind design choices, motivations, etc. You should thoroughly present related work and your method since class members will not be familiar with the background of your project. The grading criteria for the project presentation is the same as for the paper presentation, though your project presentation should be especially polished. Since you have now received one round of feedback on presentation skills, you should use it to improve your presentation style. Mistakes from your prior presentation that are repeated in your project presentation will be graded more harshly, since you have received prior feedback. Please carefully review the peer reviews and grading notes from your paper presentation. Following your presentation, we will discuss your group project and provide feedback you can incorporate in your final report.

Final report

The project final report should resemble a CVPR conference paper and should be eight pages (excluding references). This means having a polished concept figure, method figure(s), tables with results, qualitative results, etc. To be clear, your final report should be of the same quality of presentation as the other conference papers you have read in this class (even though your results may not be as compelling from a one-semester class project). Your final report must include an Abstract as well as Introduction, Related work, Approach, Results, and Conclusion sections. The final report should be self-contained and thoroughly described in sufficient detail that someone else in this class could implement your approach given the chance. At the end of your final report (this does not count towards the eight pages), each student in the group should document everything they contributed to the project and how work was divided among group members.

Final project grading criteria

Your final project grade will primarily be based on the thought process and effort put into your project as demonstrated through your presentation and final report. While you should work to get the best results possible, it is understood that given the limited time available your method might not outperform other state-of-the-art approaches. The best projects are those that have new, clever ideas, not necessarily those that perform best on a given benchmark. Your project report (and presentation) will be evaluated on the following factors: 1) how well you related it to prior research; 2) the clarity and format of the presentation and report; 3) and completeness. For completeness, you will be evaluated on whether you complied with all the requirements of the project (e.g. does your report have all the required sections) and the degree to which you put in the thought and effort required to deliver an interesting and compelling class project. The paper will also be evaluated on the degree to which all experimental evaluations necessarily for evaluating it have been performed (e.g. main results, ablations, qualitative results).

Project deliverables

All project deliverables will be uploaded on Canvas.

  • Project proposal (5% of final grade) - due March 3rd, 10:00 PM
  • Project status report (5% of final grade) - due April 7, 10:00 PM
  • Project presentations (15% of final grade) - April 19th through May 3rd, due 10:00 PM the day before your presentation
  • Project final report (20% of final grade) - due 9:45 AM, May 8

A note on submission: Untimely submissions will be significantly penalized at the discretion of the instructor. You must submit all components of the final project on time. It is your responsibility to make sure all submissions in this class are complete. Once you submit, please download the file again and verify it opens successfully. Corrupted files will receive no credit. In the event of an outage on Canvas that affects submission, you may e-mail the instructor your files as a fallback.

Additional information

Academic accomodations

Virginia Tech welcomes students with disabilities into the University’s educational programs. The University promotes efforts to provide equal access and a culture of inclusion without altering the essential elements of coursework. If you anticipate or experience academic barriers that may be due to disability, including but not limited to ADHD, chronic or temporary medical conditions, deaf or hard of hearing, learning disability, mental health, or vision impairment, please contact the Services for Students with Disabilities (SSD) office (540-231-3788, ssd@vt.edu, or visit https://ssd.vt.edu). If you have an SSD accommodation letter, please meet with me privately during office hours as early in the semester as possible to deliver your letter and discuss your accommodations. You must give me reasonable notice to implement your accommodations, which is generally 5 business days and 10 business days for final exams.

Academic integrity

The tenets of the Virginia Tech Graduate Honor Code will be strictly enforced in this course, and all assignments shall be subject to the stipulations of the Graduate Honor Code. For more information on the Graduate Honor Code, please refer to the GHS Constitution. Specifically, you are encouraged to discuss the content covered in this course with others. However, you are responsible for doing your paper reviews on your own. This means you should work on your paper reviews yourself and should not share your paper reviews with others. You are allowed to use code and materials from other papers and sites, but you must cite your sources and clearly describe your contributions. If you have any questions as to whether something runs afoul of this policy, please contact the instructor before using the resource or submitting the assignment.

Emergencies and medical conditions

If you have an emergency or medical condition, you must inform the instructor before the deadline of the assignment. You may be required to submit documentation of the emergency or condition to the Dean of Students Office.

Tentative schedule

Date Topic Papers Presenter(s) Due
01/18 Introduction
[slides]
Chris Topics due 1/19, 10:00 PM
01/23 Computer vision fundamentals
[slides]
Chris
01/25 Two-stream architectures
Primary: Agrawal, A., Lu, J., Antol, S., Mitchell, M., Zitnick, C.L., Parikh, D. and Batra, D., 2017. VQA: Visual Question Answering. International Journal of Computer Vision, 123(1). [paper] Pages 1-12 (don't need to read appendix). Cedric
[slides]
01/30 Two-stream architectures:
Part II

Primary: Lu, J., Yang, J., Batra, D., & Parikh, D. (2018). Neural baby talk. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 7219-7228). [paper]

Secondary: Ren, S., He, K., Girshick, R., & Sun, J. (2015). Faster r-cnn: Towards real-time object detection with region proposal networks. Advances in neural information processing systems, 28. [paper]
Himanshu
[slides]
02/01 Multi-stream architectures
Primary: Salvador, A., Hynes, N., Aytar, Y., Marin, J., Ofli, F., Weber, I., & Torralba, A. (2017). Learning cross-modal embeddings for cooking recipes and food images. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 3020-3028). [paper]

Secondary: Castrejon, L., Aytar, Y., Vondrick, C., Pirsiavash, H., & Torralba, A. (2016). Learning aligned cross-modal representations from weakly aligned data. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 2940-2949). [paper]
Chris
[slides]
02/06 Cross-modal retrieval
Primary: Thomas, C., & Kovashka, A. (2020, August). Preserving semantic neighborhoods for robust cross-modal retrieval. In European Conference on Computer Vision (pp. 317-335). Springer, Cham. [paper]

Secondary: Song, Yale, and Mohammad Soleymani. "Polysemous visual-semantic embedding for cross-modal retrieval." Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2019. [paper]
Chris
[website]
02/08 Attention mechanisms
Primary: Nguyen, D. K., & Okatani, T. (2018). Improved fusion of visual and language representations by dense symmetric co-attention for visual question answering. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 6087-6096). [paper]

Secondary: Yang, Z., He, X., Gao, J., Deng, L., & Smola, A. (2016). Stacked attention networks for image question answering. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 21-29). [paper]
Xiaona
[slides]
02/13 Attention mechanisms:
Part II

Primary: Huang, L., Wang, W., Chen, J., & Wei, X. Y. (2019). Attention on attention for image captioning. In Proceedings of the IEEE/CVF international conference on computer vision (pp. 4634-4643). [paper]

Secondary: Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., ... & Polosukhin, I. (2017). Attention is all you need. Advances in neural information processing systems, 30. [paper] [illustrated explanation]
Xavier
[slides]
02/15 Multimodal transformers
Primary: Li, X., Yin, X., Li, C., Zhang, P., Hu, X., Zhang, L., ... & Gao, J. (2020). Oscar: Object-semantics aligned pre-training for vision-language tasks. In 16th European Conference of Computer Vision (ECCV), Glasgow, UK, August 23–28, 2020, Proceedings (pp. 121-137). Springer International Publishing. [paper]

Secondary: Su, W., Zhu, X., Cao, Y., Li, B., Lu, L., Wei, F., & Dai, J. VL-BERT: Pre-training of Generic Visual-Linguistic Representations. In International Conference on Learning Representations. [paper]

Extra Background: Devlin, J., Chang, M. W., Lee, K., & Toutanova, K. (2019, June). BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers) (pp. 4171-4186). [paper] [illustrated explanation]
Amun
[slides]
02/20 Multimodal transformers:
Part II

Primary: Zellers, R., Lu, J., Lu, X., Yu, Y., Zhao, Y., Salehi, M., ... & Choi, Y. (2022). Merlot reserve: Neural script knowledge through vision and language and sound. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 16375-16387). [paper] [video]

Secondary: Akbari, H., Yuan, L., Qian, R., Chuang, W. H., Chang, S. F., Cui, Y., & Gong, B. (2021). Vatt: Transformers for multimodal self-supervised learning from raw video, audio and text. Advances in Neural Information Processing Systems, 34, 24206-24221. [paper] [video]
Chiawei
[slides]
[pptx]
02/22 Representation learning
Primary: Duan, J., Chen, L., Tran, S., Yang, J., Xu, Y., Zeng, B., & Chilimbi, T. (2022). Multi-modal alignment using representation codebook. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 15651-15660). [paper]

Secondary: Li, J., Selvaraju, R., Gotmare, A., Joty, S., Xiong, C., & Hoi, S. C. H. (2021). Align before fuse: Vision and language representation learning with momentum distillation. Advances in neural information processing systems, 34, 9694-9705. [paper]

Extra Background: He, K., Fan, H., Wu, Y., Xie, S., & Girshick, R. (2020). Momentum contrast for unsupervised visual representation learning. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 9729-9738). [paper] [video]
Muntasir
[slides]
02/27 Pre-training
Primary: Zhai, X., Wang, X., Mustafa, B., Steiner, A., Keysers, D., Kolesnikov, A., & Beyer, L. (2022). Lit: Zero-shot transfer with locked-image text tuning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 18123-18133). [paper] [supp]

Secondary: Radford, A., Kim, J. W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., ... & Sutskever, I. (2021, July). Learning transferable visual models from natural language supervision. In International conference on machine learning (pp. 8748-8763). [paper]
Kiet
[slides]
03/01 Prompt learning
Primary: Shu, M., Nie, W., Huang, D. A., Yu, Z., Goldstein, T., Anandkumar, A., & Xiao, C. (2022). Test-Time Prompt Tuning for Zero-Shot Generalization in Vision-Language Models. In Advances in Neural Information Processing Systems. [paper] [website]

Secondary: Zhou, K., Yang, J., Loy, C. C., & Liu, Z. (2022). Conditional prompt learning for vision-language models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 16816-16825). [paper]

Extra Background: Zhou, K., Yang, J., Loy, C. C., & Liu, Z. (2022). Learning to prompt for vision-language models. International Journal of Computer Vision, 130(9), 2337-2348. [paper]
Tianjiao
[slides]
Project proposal due March 3rd, 10:00 PM
03/06 Spring break
03/08 Spring break
03/13 Few shot learning
Primary: Huang, S., Dong, L., Wang, W., Hao, Y., Singhal, S., Ma, S., ... & Wei, F. (2023). Language Is Not All You Need: Aligning Perception with Language Models. arXiv preprint arXiv:2302.14045. [paper]

Secondary: Alayrac, J. B., Donahue, J., Luc, P., Miech, A., Barr, I., Hasson, Y., ... & Simonyan, K. (2022). Flamingo: a Visual Language Model for Few-Shot Learning. In Advances in Neural Information Processing Systems. [paper]
Deval
[slides]
03/15 Privileged modalities
Primary: Li, Y., Panda, R., Kim, Y., Chen, C. F. R., Feris, R. S., Cox, D., & Vasconcelos, N. (2022). VALHALLA: Visual Hallucination for Machine Translation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 5216-5226). [paper]

Secondary: Hoffman, J., Gupta, S., & Darrell, T. (2016). Learning with side information through modality hallucination. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 826-834). [paper]
Alvi
[slides]
03/20 Multimodal learning
Primary: Peng, X., Wei, Y., Deng, A., Wang, D., & Hu, D. (2022). Balanced multimodal learning via on-the-fly gradient modulation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 8238-8247). [paper]

Secondary: Wang, W., Tran, D., & Feiszli, M. (2020). What makes training multi-modal classification networks hard?. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 12695-12705). [paper]
Connor
[slides]
03/22 Knowledge representation and reasoning
Primary: Ding, Y., Yu, J., Liu, B., Hu, Y., Cui, M., & Wu, Q. (2022). Mukea: Multimodal knowledge extraction and accumulation for knowledge-based visual question answering. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 5089-5098). [paper]

Secondary: Marino, K., Chen, X., Parikh, D., Gupta, A., & Rohrbach, M. (2021). Krisp: Integrating implicit and symbolic knowledge for open-domain knowledge-based vqa. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 14111-14121). [paper]
Ting-Chih
[slides]
03/27 Embodied vision
Primary: Gadre, S. Y., Wortsman, M., Ilharco, G., Schmidt, L., & Song, S. (2023). CoWs on Pasture: Baselines and Benchmarks for Language-Driven Zero-Shot Object Navigation. To appear, In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition 2023. [paper]

Secondary: Khandelwal, A., Weihs, L., Mottaghi, R., & Kembhavi, A. (2022). Simple but effective: Clip embeddings for embodied ai. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 14829-14838). [paper]
Chase
[slides]
03/29 Text-to-image generation
Primary: Tao, M., Tang, H., Wu, F., Jing, X. Y., Bao, B. K., & Xu, C. (2022). Df-gan: A simple and effective baseline for text-to-image synthesis. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 16515-16525). [paper]

Secondary: Xu, T., Zhang, P., Huang, Q., Zhang, H., Gan, Z., Huang, X., & He, X. (2018). Attngan: Fine-grained text to image generation with attentional generative adversarial networks. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 1316-1324). [paper]
Aditya
[slides]
04/03 Guest lecture: Jiawei Ma
Towards Efficient Adaptation for Multi-Modal Video Understanding

Primary: Yang, Y., Ma, J., Huang, S., Chen, L., Lin, X., Han, G., & Chang, S. F. (2023). TempCLR: Temporal Alignment Representation with Contrastive Learning. In International Conference on Learning Representations (ICLR). [paper]

Secondary: Ma, J., Xie, H., Han, G., Chang, S. F., Galstyan, A., & Abd-Almageed, W. (2021). Partner-assisted learning for few-shot image classification. In Proceedings of the IEEE/CVF International Conference on Computer Vision (pp. 10573-10582). [paper]
04/05 Text-to-image generation:
Part II

Primary: Saharia, C., Chan, W., Saxena, S., Li, L., Whang, J., Denton, E., ... & Norouzi, M. (2022). Photorealistic Text-to-Image Diffusion Models with Deep Language Understanding. In Advances in Neural Information Processing Systems. [paper] [supp] [video]

Secondary: Rombach, R., Blattmann, A., Lorenz, D., Esser, P., & Ommer, B. (2022). High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 10684-10695). [paper] [supp]
Apoorv
[slides]
Project status report due April 7, 10:00 PM
04/10 Bias and fairness
Primary: Hirota, Y., Nakashima, Y., & Garcia, N. (2022). Quantifying societal bias amplification in image captioning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 13450-13459). [paper] [supp] [video]

Secondary: Hendricks, L. A., Burns, K., Saenko, K., Darrell, T., & Rohrbach, A. (2018). Women also snowboard: Overcoming bias in captioning models. In Proceedings of the European conference on computer vision (ECCV) (pp. 771-787). [paper]
Hanwen
[slides]
04/12 Guest lecture: Xudong Lin
How does textual knowledge break the limitations of the current paradigm of multimodal video understanding and reasoning?

Primary: Lin, X., Tiwari, S., Huang, S., Li, M., Shou, M. Z., Ji, H., & Chang, S. F. (2023). Towards Fast Adaptation of Pretrained Contrastive Models for Multi-channel Video-Language Retrieval. To appear, In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition 2023. [paper]

Secondary: Lin, X., Petroni, F., Bertasius, G., Rohrbach, M., Chang, S. F., & Torresani, L. (2022). Learning to recognize procedural activities with distant supervision. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 13853-13863). [paper]
04/17 Guest lecture: Brian Chen
Learning Video Representations from Self-supervision

Primary: Chen, B., Rouditchenko, A., Duarte, K., Kuehne, H., Thomas, S., Boggust, A., ... & Chang, S. F. (2021). Multimodal clustering networks for self-supervised learning from unlabeled videos. In Proceedings of the IEEE/CVF International Conference on Computer Vision (pp. 8012-8021). [paper]

Secondary: Chen, B., Selvaraju, R. R., Chang, S. F., Niebles, J. C., & Naik, N. (2023). Previts: contrastive pretraining with video tracking supervision. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (pp. 1560-1570). [paper]
04/19 Project presentation: Muntasir, Tianjiao, and Xiaona
Fine-Grained Alignment for Recipe Embeddings

Primary: Shukor, M., Couairon, G., Grechka, A., & Cord, M. (2022). Transformer decoders with multimodal regularization for cross-modal food retrieval. In Multimodal Learning and Applications Workshop (MULA) held in conjunction with the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 4567-4578). [paper]

Secondary: Salvador, A., Hynes, N., Aytar, Y., Marin, J., Ofli, F., Weber, I., & Torralba, A. (2017). Learning cross-modal embeddings for cooking recipes and food images. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 3020-3028). [paper]
04/24 Project presentation: Alvi, Chiawei, and Connor
Text-Guided Multi-Modal Diffusion Models for Joint Image and Audio Generation

Primary: Peebles, William, and Saining Xie. "Scalable Diffusion Models with Transformers." arXiv preprint arXiv:2212.09748 (2022). [paper]

Secondary: Ruan, L., Ma, Y., Yang, H., He, H., Liu, B., Fu, J., ... & Guo, B. (2022). MM-Diffusion: Learning Multi-Modal Diffusion Models for Joint Audio and Video Generation. arXiv preprint arXiv:2212.09478. [paper]
04/26 Project presentation: Aditya, Apoorv, Deval, and Kiet
Adapters Are All You Need

Primary: Sung, Y. L., Cho, J., & Bansal, M. (2022). VL-Adapter: Parameter-Efficient Transfer Learning for Vision-and-Language tasks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 5227-5237). [paper]

Secondary: Li, Y., Liu, H., Wu, Q., Mu, F., Yang, J., Gao, J., ... & Lee, Y. J. (2023). GLIGEN: Open-Set Grounded Text-to-Image Generation. arXiv preprint arXiv:2301.07093. [paper]

Extra Background: Shu, M., Nie, W., Huang, D. A., Yu, Z., Goldstein, T., Anandkumar, A., & Xiao, C. (2022). Test-Time Prompt Tuning for Zero-Shot Generalization in Vision-Language Models. In Advances in Neural Information Processing Systems. [paper]
05/01 Project presentation: Amun, Cedric, Chase, Xavier
Multimodal Political Bias Identification and Neutralization

Primary: Thomas, C., & Kovashka, A. (2019, December). Predicting the politics of an image using webly supervised data. In Proceedings of the 33rd International Conference on Neural Information Processing Systems (pp. 3630-3642). [paper]

Secondary: Pryzant, R., Martinez, R. D., Dass, N., Kurohashi, S., Jurafsky, D., & Yang, D. (2020, April). Automatically neutralizing subjective bias in text. In Proceedings of the aaai conference on artificial intelligence (Vol. 34, No. 01, pp. 480-489). [paper]

Extra Background: Thomas, C., & Kovashka, A. (2020, August). Preserving semantic neighborhoods for robust cross-modal retrieval. In European Conference on Computer Vision (pp. 317-335). Springer, Cham. [paper]
05/03 Project presentation: Ting-Chih and Hanwen
Heterogeneous Graph Network for Multi-page Document Visual Question Answering

Primary: Tito, R., Karatzas, D., & Valveny, E. (2022). Hierarchical multimodal transformers for Multi-Page DocVQA. arXiv e-prints, arXiv-2212. [paper]

Secondary: Huang, Y., Lv, T., Cui, L., Lu, Y., & Wei, F. (2022, October). Layoutlmv3: Pre-training for document ai with unified text and image masking. In Proceedings of the 30th ACM International Conference on Multimedia (pp. 4083-4091). [paper]
Project final report due 9:45 AM, May 8

Acknowledgements

This course was inspired by and/or uses resources from the following courses: