AI-Assisted Teaching Observation Framework

An AI-assisted classroom observation system that uses computer vision and language models to help instructors understand student engagement — without storing any identifiable footage.

Funding

4-VA Collaborative Research Grant — $30,000

Status

Active — Multi-institutional Collaboration

Started

Fall 2025

Institutions

Virginia Tech · University of Virginia

Team

Sehrish Basir Nizamani

Principal Investigator
Faculty – Virginia Tech

Nolan Platt

Undergraduate Research Assistant
Virginia Tech

Yoonje Lee

Graduate Research Assistant
Virginia Tech

Khyati Goyal

Undergraduate Research Assistant
Virginia Tech

Zannah Zeiw

Undergraduate Research Assistant
Virginia Tech

Saad Nizamani

Collaborator
Virginia Tech

Alp Tural

Collaborator
Virginia Tech

Elif Tural

Collaborator
Virginia Tech

Nada Basit

Collaborator
University of Virginia

Andrew Katz

Collaborator
Virginia Tech

Overview

Understanding how students engage during a lecture has always been valuable for improving teaching, but getting that information has traditionally meant hiring trained observers, recording full sessions, or relying on instructor intuition alone. All of those approaches are slow, expensive, or raise real privacy concerns.

This project takes a different approach. We built a pipeline that analyzes classroom video to extract engagement signals including student pose, posture, and visual attention, and then permanently deletes the original footage before any results are saved. What remains is purely geometric data: skeletal coordinates and gaze vectors stored as JSON. No faces, no identifiable imagery. The system is fully FERPA-compliant by design.

That data is then processed by a large language model that generates engagement timelines, attention heatmaps, and behavioral summaries, all delivered to the instructor through a web dashboard after each session.

Why This Matters

Instructors rarely get structured, actionable feedback on what is actually happening in their classrooms during a lecture. Student surveys are retrospective and subjective. Peer observation is infrequent and high-stakes. Most automated tools either require invasive hardware like eye-tracking glasses, store identifiable video that creates privacy risk, or only output narrow classification labels without giving instructors anything they can act on.

This system gives instructors a private, low-effort way to see how their class engaged across a session: where attention dropped, when engagement shifted, and which parts of a lecture held the room. The goal is formative, reflective insight. Not surveillance, not grading, just better teaching.

How It Works

The system processes classroom video through three sequential stages before the instructor ever sees a result.

Stage 1: Privacy-First Vision Processing

As soon as a video is uploaded, faces are blurred using Gaussian filtering before any other processing occurs. OpenPose then extracts 25 skeletal keypoints per person per frame, capturing head position, torso orientation, limb placement, and overall posture. At the same time, Gazelle (built on a frozen DINOv2 encoder) estimates where each student's gaze is directed, producing a normalized attention vector per person. Once both extraction passes are complete, the original video frames are permanently deleted. Only the JSON coordinate files remain.

Stage 2: LLM Behavioral Analysis

The pose and gaze data is passed to QwQ-32B, a reasoning-focused large language model, which analyzes student behavior entirely from geometric data with no video, no images, and no identifiable information involved. Analysis runs in three layers: 60-second segments that generate per-student behavioral timelines, 5-minute synthesis passes that identify patterns across those segments, and a final summary that characterizes engagement across the full session.

Stage 3: Instructor Dashboard

Results are delivered through a secure web dashboard. Instructors can view spatial attention heatmaps showing where student gaze concentrated across the room, engagement timelines charting posture and visual focus over the lecture, and LLM-generated behavioral summaries with references back to specific time windows for verification. Processing runs asynchronously so instructors upload a video and results appear when the pipeline finishes.

What Instructors See

Attention Heatmaps

A spatial overlay of where students were visually attending across the session. This is useful for identifying whether students tracked the board, a secondary screen, or were looking elsewhere during key moments of the lecture.

Engagement Timelines

A chart tracking posture and behavioral states including leaning forward, neutral posture, slouching, and heads down across the full session. These timelines make it easy to spot moments where the class checked out or where a particular activity brought energy back.

Behavioral Summaries

Plain-language summaries describing what happened during each segment of the lecture, what behavioral transitions were observed, and which time windows may be worth revisiting. Each observation is tied back to a specific time range in the session for easy reference.

Privacy and Ethics

Privacy is not an afterthought in this system. It is a hard architectural constraint. The pipeline was designed from the start so that no identifiable imagery can be retained, regardless of how the system is configured.

Face anonymization is applied before any other analysis, on every frame
Original video frames are permanently deleted immediately after pose and gaze extraction
Only geometric coordinates (JSON) are stored, with no faces or identifiable footage retained
FERPA compliance is maintained by design throughout the entire pipeline
Instructor consent is required; this system is a formative reflection tool, not an evaluation or grading instrument
Human oversight is preserved; all LLM outputs are framed as insights for instructor reflection, not automated judgments

Pilot studies are underway at Virginia Tech and the University of Virginia under full IRB and FERPA compliance.

Acknowledgements

This work is supported by a 4-VA Collaborative Research Grant. We gratefully acknowledge the Carnegie Mellon University Perceptual Computing Laboratory (OpenPose), Facebook Research (Gazelle / DINOv2), and the Virginia Tech Department of Computer Science for their support of this work.

Contact

For questions about this project or interest in collaborating, please reach out:

Sehrish Basir Nizamani
Department of Computer Science, Virginia Tech
Email: sehrishbasir@vt.edu