CS 6804: Multimodal Vision
Fall 2024
Department of Computer Science, Virginia Tech
Location: 240 McBryde Hall, 225 Stanger St. Blacksburg, VA.
Meeting time: Tuesdays and Thursdays, 5:00 PM - 6:15 PM
Instructor: Chris Thomas
E-mail: Please message me through the messaging system on Canvas rather than sending e-mail.
Office: 378 Data and Decision Sciences Building
Office hours: 4:00 - 5:00 PM Tuesdays and Thursdays. My Zoom is linked here.
Exam section: 17T. December 14, 2024, 7:00-9:00 PM. While there are no exams in this course, we may use this time for final presentations (virtually) if necessary.
Course overview
Course description: Humans are able to reason about how concepts read in text, heard in audio, and seen in visual content (different modalities of data) relate to one another by drawing on a learned multimodal understanding of the world. For example, reading a textual description might allow one to recognize a bird they have never seen before by drawing on a background understanding of what the color blue looks like and what a high-pitched bird call sounds like. Thus, building artificially intelligent systems capable of robust multimodal reasoning is an area of intense research interest. This graduate-level seminar course will introduce students to the latest research in multimodal computer vision, with a significant emphasis on vision and language. The course will feature foundational lectures, in-depth student-led presentations on state-of-the-art research, and classroom discussions. Students will complete an intensive, semester-long research project and report.
Prerequisites: A background in deep learning is strongly recommended. Prior machine learning coursework would also be highly beneficial. Concerned students should speak to the instructor as soon as possible to ensure they have sufficient background to successfully complete this course. You aren’t expected to be familiar with every technique in every paper covered in this course, but you should be able to understand the main ideas of each paper discussed and propose (and possibly implement for your class project) extensions or improvements to these papers. If you are uncomfortable with your ability to do this, speak to the instructor as soon as possible.
Format: This is a seminar-style graduate course covering recent multimodal computer vision research. The majority of the class time will be spent on listening to paper presentations by other students, followed by group discussions. Before each class, you will read and write a review of the primary paper presented. You will also work in a group on a final project which will be presented at the end of the semester to the class.
Canvas: Grades, class schedule, announcements, and other materials for this course will be released and updated on Canvas. You are strongly encouraged to sign up for Canvas e-mail and/or push notifications so that you are notified of new discussions and course updates posted to Canvas.
Course learning objectives:
- Learning about state of the art methods in multimodal computer vision
- Learning to think critically about research, which applies beyond this class or subject area. This involves developing the ability to critically assess research papers you encounter and to understand how different works are connected.
- Learning how to conceive novel ideas and extensions of existing research methods and implement your ideas
- Learning how to clearly write up and present research
- Learning how to work and collaborate with others in a research group
Topical outline: Example topics to be covered in this course include representation learning, fusion, pretraining, privileged modalities, prompt learning, cross-modal retrieval, model architectures (e.g. transformers, two-stream, convolutional), attention mechanisms, zero-shot and few-shot recognition, knowledge representation, generative models, neurosymbolic reasoning, and embodied vision. Students will be asked to vote on these topics (and to possibly contribute new ones that interest them) on the first day of the course. The schedule for the remainder of the course will be determined to ensure adequate topic coverage and to reflect student interest.
Requirements
Your final grade in the course will be based on your in-class participation and discussion, paper reviews, paper presentation, and final project, which will be weighted as follows:
Syntax | Description |
---|---|
Final project | 45% |
In-class participation and discussion | 15% |
Paper presentation | 20% |
Paper reviews | 20% |
The grade scale for the term will be:
Percentage | 100 | 90 | 89 | 85 | 80 | 79 | 75 | 70 | 69 | 65 | 60 | <60 |
Letter | A | A- | B+ | B | B- | C+ | C | C- | D+ | D | D- | F |
Virginia Tech does not award A+ grades. Any component of the course may be curved at instructor discretion. No grades will be lowered as a result of a curve.
Attendance: Because this is a seminar-style class attendance is required. Missing more than two classes without a valid excuse will negatively affect your final grade. Attendance may be randomly taken.
Participation: You are expected to regularly and meaningfully participate in this seminar class. Participation could include asking interesting questions or offering comments on a paper, as well as answering and responding to comments from other class members. Questions asked within the class about vision topics are assumed to be addressed to the class to answer, rather than the instructor. Note that thoroughly reading the papers ahead of the class and doing a good job on your paper review will help prepare you to participate in class. Each presenter has spent a considerable amount of time preparing their presentation and studying the paper, so you should do your part to engage.
Paper reviews
The purpose of paper reviews is to prepare students to critically assess and review multimodal computer vision research. Students are required to write one paper review for each class presentation. Students who are presenting that class session are not required to submit a paper review for that class. The paper review must be of one of the primary paper(s) presented each class, though students might refer to other papers in their review to compare and contrast the paper. If more than one student presents on the same day, you should still read the other papers in order to be able to participate. If there are multiple presenters, to ensure both papers are roughly equally reviewed, students will be assigned to review the first or second presenter’s primary paper based on the last digit of their student ID (e.g. even reviews first, odd reviews second). Paper reviews should be 1 page single-spaced using 11-point Times New Roman and 1 inch margins. Paper reviews should follow the following guidelines (modified from the CVPR Reviewer Guidelines).
Tips for writing good reviews
- In your review, look for what is good or stimulating in the paper, and what knowledge advancement it has made. Your review should highlight both the novelty and potential impact of the work. Above all, you should be specific and detailed in your reviews.
- Take the time to write good reviews. Ideally, you should read a paper and then think about it over the course of several days before you write your review.
- You should take enough time to write a thoughtful and detailed review. Bullet lists with one short sentence per bullet are NOT a detailed review.
- Be specific about novelty. Claims in a review that the submitted work “has been done before” MUST be backed up with specific references and an explanation of how closely they are related. At the same time, for a positive review, be sure to summarize what novel aspects are most interesting in the Strengths section.
- It is best to avoid the term “the authors” in your review because you are reviewing their work and not the person. Instead, try to refer to “the paper” or “the method”. Referring to the authors can be perceived as being confrontational when you write real reviews, even though you may not mean it this way.
- Be generous about suggesting new ideas for how the paper could be improved. You might suggest a new technical tool that could help, a dataset that could be tried, an application area that might benefit from the method, or a way to generalize the idea to increase its impact.
Contents of paper reviews
Each paper review should consist of the following parts and should clearly indicate which is being addressed at that point in the review (e.g. Summary, Relation to prior work, Strengths, Weaknesses, Future work).
- Summary: The paper review should first summarize the entire paper. This means first explaining what the paper is trying to do and how the paper proposes to do it. Your summary should focus on the primary novelty and contributions of the paper, rather than unimportant details. Your summary of the paper typically will include a summary of a new model architecture or loss function but might also involve describing key mathematical insights which undergird the paper. If the primary contribution of the paper is a dataset, the paper should describe details about what makes the dataset significant. Your summary should also summarize how the method is experimentally evaluated and any significant findings or results, whether quantitative or qualitative.
- Relation to prior work: The review should next summarize the paper’s relation to prior work and why its contributions are (or are not) significant in your opinion. Reading the paper’s “Related Work” section and understanding how the paper differs from prior work will help you write this section.
- Strengths: The review should mention at least three strengths of the approach. For example, you might explain how a particular technique or design is expected to solve problems with existing work. Simply rephrasing the strengths of the paper from its contribution section does not adequately address this point. Instead, rely on your own impression of the work and your own judgments about its novelty.
- Weaknesses: What do you feel detracts from the paper’s contributions? Your review should mention at least three weaknesses. Some example weaknesses include cases where the method is likely to not perform well because of its design, computational costs, non-standard inference or train requirements, shortcomings in the proposed loss function or formulation, or a weak experimental evaluation.
- Future work: Propose at least one possible extension of the paper. This might be a fix to a weakness you identified (e.g. a modified model or loss function) or you might propose how the techniques developed in the paper could be applied in some novel way for a different task. You should not, however, simply rephrase or repeat the future work suggested by the paper itself. Instead, think critically about how you might extend the paper as a researcher.
Grading policy for paper reviews
Paper reviews will be submitted through Canvas in PDF format and are due at 10:00 PM the day before the class in which the paper will be presented. The submission time reported by Canvas will be used to determine the time of submission. A submission that is one minute late is still late. Timely reviews start at 10 points. You will lose points for making false statements, vague or irrelevant claims which don’t indicate a deeper understanding of the paper, reviews which simply rephrase the paper’s own summary, claimed strengths, or weaknesses without providing your own insights (we want to hear what you think - not what the authors say!), or reviews which otherwise fail to address the above criteria. If your review is submitted by 12:00 PM the day of the paper presentation, your review will start at half credit. After 12:00 PM on the day of the class, you will not receive credit for your review. You will get three late days, meaning you can submit three reviews late (i.e. by 12:00 PM on the day of the class) without a penalty. These late days apply to paper reviews only.
Paper presentations
Each student will give 1-2 presentations throughout the course (depending on enrollment). Each class will be focused on a particular topic of interest and will contain one primary paper and possibly several background papers. Students will express their topic preferences after the first class via Canvas, though due to the class size, there is no guarantee students will be matched to their desired topic.
Guidelines for paper presentations
Students should thoroughly read the assigned papers and other relevant background papers. This means fully understanding key equations, model design choices, etc. Your presentation, at a minimum, should:
- Clearly define what problem the paper is addressing.
- Provide motivation for why the problem is important, interesting, and/or challenging.
- Address prior related work that has attempted to address this problem (or a related problem).
- Describe, in detail, the proposed approach for the problem. For example, this may involve describing details of the model design and key loss functions used to train it. You should understand all equations that you present in class.
- Explain how the paper is evaluated. You should fully describe the experimental set-up and present any quantitative and qualitative results. If there are any unusual metrics that students may not know, you should explain what those are and how they are computed.
- Discuss key strengths and weaknesses of the paper.
- Propose ideas for future work and identify any open research questions.
The non-primary paper(s) often provide background context to the primary paper. For example, the primary paper may build upon methods or results developed in these papers. In this case, you should clearly present this background work in such a way that students will be able to understand how the primary paper builds upon that prior work.
If there is one presenter, presentations should be 45 minutes long. If there are two presenters, presentations should be 20 minutes long. Presenters who present by themselves on a day will not be asked to do more than one presentation and will be allowed to skip writing two paper reviews. Good paper presentations are the result of extensive preparation and practice. You should practice your presentation many times before presenting it to the class to ensure you know what to say and to time yourself. Your presentation should be at most 2-3 minutes shorter than the time specified and certainly no longer than 45 minutes (or 20 minutes if there are two presenters). In sum, your presentation is expected to be highly polished. This means you end on time, your presentation is well organized, and you explain the paper(s) presented clearly. After your paper presentation, the paper presenter will moderate a ~20-25 minute (or 10-15 minute if there are two presenters) discussion session. The presenter is responsible for preparing possible topics for discussion and driving the discussion. The discussion could involve potential weaknesses in the approach, ideas for future work, thoughts on terminology used by the authors, the relationship of the paper to other literature, claims that the paper made that weren’t adequately justified, choices that you didn’t understand, etc.
You are strongly encouraged to use illustrations and graphics to explain concepts. Using animations, images, and videos (if applicable) is highly encouraged to make your presentation more engaging. Avoid slides which are just walls of text. Instead, you are encouraged to use short bullets (which you animate) and explain the rest verbally. You are also encouraged to animate equations and explain them piece by piece, which can help students better understand complex concepts. You are highly encouraged to search online for relevant materials which you may use in your presentation. However, make sure to clearly cite all your sources. You are free to use slides made by others, but be aware that your presentation requirements are different from the authors or others who may be describing the paper. Unlike authors presenting their work at a conference, your presentation should view the paper critically. This means identifying weaknesses and thinking about the significance of the work in its broader context. Finally, make sure to use your own words on slides and during your presentation. You should not be memorizing someone else’s words and presenting them as your own or copying text verbatim from the paper (or elsewhere) to slides.
Paper presentation slides should be uploaded to Canvas (pptx or pdf) by 10:00 PM on the day before the intended class presentation.
Grading policy for paper presentations
Your grade for your paper presentation will be based on: 1) clarity and presentation quality; 2) whether you covered the key points of the paper; 3) correctness of your statements made during the presentation; 4) whether you addressed all the guidelines above; 5) peer reviews by others in the class; 6) how well you facilitated the discussion of the paper; and 7) how well you delivered you presentation (i.e. was it clearly practiced, met time constraints, etc.).
Final project
This course will conclude with a student-driven group project, with a report due at the end of the course. In order to make meaningful progress on a project, groups must be 3-5 students. Note that larger groups will have higher expectations. Given that a significant portion of your final grade depends on the final project, each student is expected to contribute significantly to the final project. All projects must involve implementation of a multimodal computer vision system or algorithm along with a thorough evaluation. The goal is for your final project report to resemble a conference paper like those you have read throughout the class. Ideally, your group project will become a subsequent conference publication. The topic of your final project is open-ended and groups are free to choose a topic of their choosing. However, final projects should at least fall into one of the following broad categories:
- Extend one of the papers we covered in class in a significant way, complete with a thorough experimental evaluation;
- Propose a novel method or approach for solving a multimodal vision problem we discussed in class or that is already known in the literature and thoroughly evaluate it;
- Propose a completely new multimodal vision problem and explain why it is significant and needs solving, implement an approach to solve the problem, and evaluate the approach
In summary, your final project can address any multimodal vision problem, either existing or new, as long as you propose a new method or significant extension or modification of existing methods. Applications of existing methods or techniques to new datasets or problems is not sufficient for the final project. All projects must be thoroughly experimentally evaluated. This may involve benchmarking existing relevant work in the case of a new problem or applying your method on standard benchmarks and computing standard metrics. Projects that overlap in some way with your existing research are OK, but design, conception, implementation, evaluation, and delivery of the project should be the result of the students, not other faculty members, and should be specific to this course. However, your project can build upon or extend your prior research efforts without an issue.
Each student in the group should document everything they contributed to the project and how work was divided among group members. Students will be asked to provide a review of each other students’ contributions in their group as a form of peer review at the end of the course. Please take this seriously and recognize that free-riders will be significantly penalized. Groups experiencing free-riding members should speak to the instructor as soon as possible if unable to resolve the issue internally.
Project proposal
The project proposal should be 3-5 pages long (excluding references) and must use the CVPR latex template. Your project proposal should include the following:
- Project title
- Group members
- Group logistics (how will you communicate, how will the group regularly meet to discuss the project, who will handle each part, etc.)
- A clear problem statement which describes the goal of the project.
- A thorough literature review. Make sure you thoroughly search the literature before you start writing. You might find that your idea has already been taken. The literature review should resemble that in a CVPR conference paper and should cite existing work. It should clearly show how the proposed project does something the prior work you cite does not.
- A detailed description of the proposed approach. The authors should describe new loss functions they plan to use, changes to existing models, etc.
- Identify the computational resources that you plan to use. Do not propose a project that you are not confident you will have sufficient resources to execute.
- The proposed experimental evaluation protocol and expected results. You should describe what experiments you plan to run and how you will run this. You should describe which datasets you plan to evaluate on, any existing code bases you will use, and what needs to be implemented by the group. You should also describe what your group is aiming for with the project (i.e. what do you consider a success). You should explain what you hope each experiment will show and discuss any uncertainty you have about the project. If you already have preliminary results, feel free to include them.
Project status report
The purpose of the project status report is to update me on your progress while also moving your project proposal document closer to the final report. The project status report should be about 3-5 pages (excluding references) and should describe the group’s progress on the project and any unforeseen blockers or challenges you are facing. The project status report should also use the CVPR latex template and should include Introduction, Related Work, Approach, and Results sections following the standard CVPR paper layout. You are free to reuse text from the proposal in the project status report (e.g. literature review). Please include any preliminary results, even if they aren’t good.
Project presentation
Each group will present their project to the class during the final several sessions of the class. The final project presentations should address the points listed in the guidelines for paper presentations above, but should be more descriptive of your project (since other students haven’t read your paper). The same guidelines for paper presentations apply to the project presentation, i.e. the presentation should be engaging, clear, and well-rehearsed. You don’t need to be as critical of your project as you were in the paper presentation, but you should point out any strengths or weaknesses of your project and think critically about how it could be improved. The length of the presentations will depend on the number of groups and class enrollment and will be determined after the add / drop deadline. Presentations should be TBD minutes long. Each member of the group should present a part or parts of the presentation that they are individually responsible for. Like your paper presentation, you should carefully rehearse your presentation both individually and as a group to ensure it flows well and that it is on time. Your presentation should cover the same points as the paper presentation, but should also mention any thinking behind design choices, motivations, etc. You should thoroughly present related work and your method since class members will not be familiar with the background of your project. The grading criteria for the project presentation is the same as for the paper presentation, though your project presentation should be especially polished. Since you have now received one round of feedback on presentation skills, you should use it to improve your presentation style. Mistakes from your prior presentation that are repeated in your project presentation will be graded more harshly, since you have received prior feedback. Please carefully review the peer reviews and grading notes from your paper presentation. Following your presentation, we will discuss your group project and provide feedback you can incorporate in your final report.
Final report
The project final report should resemble a CVPR conference paper and should be eight pages (excluding references). This means having a polished concept figure, method figure(s), tables with results, qualitative results, etc. To be clear, your final report should be of the same quality of presentation as the other conference papers you have read in this class (even though your results may not be as compelling from a one-semester class project). Your final report must include an Abstract as well as Introduction, Related work, Approach, Results, and Conclusion sections. The final report should be self-contained and thoroughly described in sufficient detail that someone else in this class could implement your approach given the chance. Each student in the group will be required to privately submit a brief writeup on a separate Canvas page documenting what every other group member in the group contributed to the project and how work was divided among group members.
Final project grading criteria
Your final project grade will primarily be based on the thought process and effort put into your project as demonstrated through your presentation and final report. While you should work to get the best results possible, it is understood that given the limited time available your method might not outperform other state-of-the-art approaches. The best projects are those that have new, clever ideas, not necessarily those that perform best on a given benchmark. Your project report (and presentation) will be evaluated on the following factors: 1) how well you related it to prior research; 2) the clarity and format of the presentation and report; 3) and completeness. For completeness, you will be evaluated on whether you complied with all the requirements of the project (e.g. does your report have all the required sections) and the degree to which you put in the thought and effort required to deliver an interesting and compelling class project. The paper will also be evaluated on the degree to which all experimental evaluations necessarily for evaluating it have been performed (e.g. main results, ablations, qualitative results).
Project deliverables
All project deliverables will be uploaded on Canvas.
- Project proposal (5% of final grade) - due October 4th, 10:00 PM
- Project status report (5% of final grade) - due November 8, 10:00 PM
- Project presentations (15% of final grade) - TBD based on enrollment / number of groups
- Project final report (20% of final grade) - due December 14, 9:00 PM
A note on submission: Untimely submissions will be significantly penalized at the discretion of the instructor. You must submit all components of the final project on time. It is your responsibility to make sure all submissions in this class are complete. Once you submit, please download the file again and verify it opens successfully. Corrupted files will receive no credit. In the event of an outage on Canvas that affects submission, you may e-mail the instructor your files as a fallback.
GPU Access: Your final project will likely require use of a GPU. Some GPU resources you can utilize include Google Colab, the Advanced Research Computing center at Virginia Tech, and the GLogin cluster run by the department (SSH to glogin.cs.vt.edu). Please take note of GPU limitations when designing your final project.
Additional information
Academic accommodations
Virginia Tech welcomes students with disabilities into the University’s educational programs. The University promotes efforts to provide equal access and a culture of inclusion without altering the essential elements of coursework. If you anticipate or experience academic barriers that may be due to disability, including but not limited to ADHD, chronic or temporary medical conditions, deaf or hard of hearing, learning disability, mental health, or vision impairment, please contact the Services for Students with Disabilities (SSD) office (540-231-3788, ssd@vt.edu, or visit https://ssd.vt.edu). If you have an SSD accommodation letter, please meet with me privately during office hours as early in the semester as possible to deliver your letter and discuss your accommodations. You must give me reasonable notice to implement your accommodations, which is generally 5 business days and 10 business days for final exams.
Academic integrity
The tenets of the Virginia Tech Graduate Honor Code will be strictly enforced in this course, and all assignments shall be subject to the stipulations of the Graduate Honor Code. For more information on the Graduate Honor Code, please refer to the GHS Constitution. Specifically, you are encouraged to discuss the content covered in this course with others. However, you are responsible for doing your paper reviews on your own. This means you should work on your paper reviews yourself and should not share your paper reviews with others. You are allowed to use code and materials from other papers and sites, but you must cite your sources and clearly describe your contributions. You may also use generative AI tools (e.g. ChatGPT) to polish your writing or to help with coding, but you must not use such tools to complete your paper reviews for you. For example, uploading a paper to ChatGPT and requesting it list strengths and weaknesses that you then re-write on your own is an academic integrity violation. Such activity will be reported to the graduate school and may result in you failing the class and receiving a disciplinary penalty. If you have any questions as to whether something runs afoul of this policy, please contact the instructor before using the resource or submitting the assignment.
Emergencies and medical conditions
If you have an emergency or medical condition, you must inform the instructor before the deadline of the assignment. You may be required to submit documentation of the emergency or condition to the Dean of Students Office.
Extension requests: To ensure fairness, extensions will not be considered absent documented extraordinary circumstances. You should also inform the instructor before the deadline of the assignment or exam. You will also likely be required to submit documentation to the Dean of Students Office for verification.
Incomplete requests: Should you be unable to complete the requirements of this course during the semester because of extraordinary circumstances, you may request an incomplete through the last day of class. You will also likely be required to submit documentation to the Dean of Students Office for verification.
Acknowledgements
This course was inspired by and/or uses resources from the following courses:
- Vision and Language AI Seminar by Trevor Darrell, University of California, Berkeley, Fall 2024
- Vision-Language Models for Computer Vision by Adriana Kovashka, University of Pittsburgh, Fall 2023
- Advanced Topics in Computer Vision by Andrew Owens, University of Michigan, Winter 2022
- Computer Vision by Adriana Kovashka, University of Pittsburgh, Spring 2021
- Advanced Computer Vision by Carl Vondrick, Columbia University, Spring 2019
- Advanced Computer Vision by Jia-Bin Huang, Virginia Tech, Spring 2017
- Visual Recognition by Adriana Kovashka, University of Pittsburgh, Spring 2015
- Advanced Topics in Computer Vision by Devi Parikh, Virginia Tech, Spring 2014