首頁 » 📚文章 » 🏫課程

MIT《如何AI（几乎）所有事物》

创建: 2025-11-29 | 更新: 2025-11-29 | 字数: 3995字 | 时长: 8分钟 | RM

· 本文阅读量次

目錄

MIT How to AI(Almost) Everything
Content

MIT How to AI(Almost) Everything

前言

本门课程主要讲的是多模态大模型，而不是AI应用方式。

课程资料链接

Course Description

Artificial Intelligence (AI) holds great promise as a technology to enhance digital productivity, physical interactions, overall well-being, and the human experience. To enable the true impact of AI, these systems will need to be grounded in real-world data modalities, from language-only systems to vision, audio, sensors, medical data, music, art, smell, and taste. This course will introduce the basic principles of AI (focusing on modern deep learning and foundation models) and how we can apply AI to novel real-world data modalities. In addition, we will introduce the principles of multimodal AI that can process many modalities at once, such as connecting language and multimedia, music and art, sensing and actuation, and more.

Through lectures, readings, discussions, and a significant research component, this course will develop critical thinking skills and intuitions when applying AI to new data modalities, knowledge of recent technical achievements in AI, and a deeper understanding of the AI research process.

Course Info

Instructor

Prof. Paul Liang

Departments

Media Arts and Sciences

Content

Introduction

Multisensory intelligence: Creating human-AI symbiosis across scales and sensory mediums to enhance productivity, creativity, and wellbeing.

Course Overview

AI for new modalities: data, modeling, evaluation, deployment
Multimodal AI: connecting multiple different data sources

Learning Objectives

Study recent technical achievements in AI research
Improve critical and creative thinking skills
Understand future research challenges in AI
Explore and implement new research ideas in AI

Preferred Pre-requisites

Some knowledge of programming (ideally in Python)
Some basic understanding of modern AI capabilities & limitations
Bring external (non-AI) domain knowledge about your problem
Bonus: worked on AI for some modality

Research Projects on New Modalities

Motivation: Many tasks of real-world impact go beyond image and text.

Challenges:

AI with non-deep-learning effective modalities (e.g., tabular, time-series)
Multimodal deep learning + time-series analysis + tabular models
AI for physiological sensing, IoT sensing in cities, climate and environment sensing
Smell, taste, art, music, tangible and embodied systems

Potential models and dataset to start with

Brain EEG Signal: https://arxiv.org/abs/2306.16934
Speech: https://arxiv.org/pdf/2310.02050.pdf
Facial Motion: https://arxiv.org/abs/2308.10897
Tactile: https://arxiv.org/pdf/2204.00117.pdf

Research Projects on AI Reasoning

Motivation: Robust, reliable, interpretable reasoning in (multimodal) LLMs.

Challenges:

Fine-grained and compositional reasoning
Neuro-symbolic reasoning
Emergent reasoning in foundation models

Potential models and dataset to start with

Can LLMs actually reason and plan?
Code for VQA: CodeVQA: https://arxiv.org/pdf/2306.05392.pdf, VisProg: https://prior.allenai.org/projects/visprog, Viper: https://viper.cs.columbia.edu/
Cola: https://openreview.net/pdf?id=kdHpWogtX6Y
NLVR2: https://arxiv.org/abs/1811.00491
Reference games: https://mcgill-nlp.github.io/imagecode/, https://github.com/AlabNII/onecommon, https://dmg-photobook.github.io/

Research Projects on Interactive Agents

Motivation: Grounding AI models in the web, computer, or other virtual worlds to help humans with digital tasks.

Challenges:

Web visual understanding is quite different from natural image understanding
Instructions and language grounded in web images, tools, APIs
Asking for human clarification, human-in-the-loop
Search over environment and planning

Potential models and dataset to start with

WebArena: https://arxiv.org/pdf/2307.13854.pdf
AgentBench: https://arxiv.org/pdf/2308.03688.pdf
ToolFormer: https://arxiv.org/abs/2302.04761
SeeAct: https://osu-nlp-group.github.io/SeeAct/

Research Projects on Embodied and Tangible AI

Motivation: Building tangible and embodied AI systems that help humans in physical tasks.

Challenges:

Perception, reasoning, and interaction
Connecting sensing and actuation
Efficient models that can run on hardware
Understanding influence of actions on the world (world model)

Potential models and dataset to start with

Virtual Home: http://virtual-home.org/paper/virtualhome.pdf
Habitat 3.0 https://ai.meta.com/static-resource/habitat3
RoboThor: https://ai2thor.allenai.org/robothor
LangSuite-E: https://github.com/bigai-nlco/langsuite
Language models and world models: https://arxiv.org/pdf/2305.10626.pdf

Research Projects on Socially Intelligent AI

Motivation: Building AI that can understand and interact with humans in social situations.

Challenges:

Social interaction, reasoning, and commonsense.
Building social relationships over months and years.
Theory-of-Mind and multi-party social interactions.

Potential models and dataset to start with

Multimodal WereWolf: https://persuasion-deductiongame.socialai-data.org/
Ego4D: https://arxiv.org/abs/2110.07058
MMToM-QA: https://openreview.net/pdf?id=jbLM1yvxaL
11866 Artificial Social Intelligence: https://cmu-multicomp-lab.github.io/asi-course/spring2023/

Research Projects on Human-AI Interaction

Motivation: What is the right medium for human-AI interaction? How can we really trust AI? How do we enable collaboration and synergy?

Challenges:

Modeling and conveying model uncertainty – text input uncertainty, visual uncertainty, multimodal uncertainty? cross-modal interaction uncertainty?
Asking for human clarification, human-in-the-loop, types of human feedback and ways to learn from human feedback through all modalities.
New mediums to interact with AI. New tasks beyond imitating humans, leading to collaboration.

Potential models and dataset to start with

MMHal-Bench: https://arxiv.org/pdf/2309.14525.pdf aligning multimodal LLMs
HACL: https://arxiv.org/pdf/2312.06968.pdf hallucination + LLM

Research Projects on Ethics and Safety

Motivation: Large AI models are can emit unsafe text content, generate or retrieve biased images.

Challenges:

Taxonomizing types of biases: text, vision, audio, generation, etc.
Tracing biases to pretraining data, seeing how bias can be amplified during training, fine-tuning.
New ways of mitigating biases and aligning to human preferences.

Potential models and dataset to start with

Many works on fairness in LLMs -> how to extend to multimodal?
Mitigating bias in text generation, image-captioning, image generation

Introduction to AI and AI research

Introduction to AI and AI research
Generating ideas, reading and writing papers, AI experimentation

How Do We Get Research Ideas?

Turn a concrete understanding of existing research’s failings to a higher-level experimental question. • Bottom-up discovery of research ideas • Great tool for incremental progress, but may preclude larger leaps

Move from a higher-level question to a lower-level concrete testing of that question.

• Top-down design of research ideas • Favors bigger ideas, but can be disconnected from reality

Beware “Does X Make Y Better?” “Yes”

The above question/hypothesis is natural, but indirect

If the answer is “no” after your experiments, how do you tell what’s going wrong?

Usually you have an intuition about why X will make Y better (not just random)

Can you think of other research questions/ hypotheses that confirm/falsify these assumptions

How to do Literature Review and Read a Paper?

Google scholar
Papers with code, Github, Huggingface
Recent conference proceedings
Blog posts
Survey papers, tutorials, courses

Testing Research Ideas

Gather and process dataset, visualize data, gather labels, do data splits.
Implement the most simple pipeline and get it working. -> Pipeline = data loading + basic model + eval function + loss/visualization/deployment
Change one component of the model at a time, repeat x10 (top-down or bottom-up).
Find what works best, and exploit.
Scale up experiments, repeat across multiple datasets.
Careful ablation studies.
Qualitative comparisons and visualizations.
Repeat until successful.

How to Write a Paper

Prepare a 15min talk (with figures, examples, tables, etc.)
Convert the talk into a paper.

More resources

Module 1: Foundations of AI

Data, structure, and information

Lecture Outline:

Vision, language, audio, sensing, set, graph modalities
Modality profile
Types of data and labels
Common learning objectives and generalization

Most of AI is about learning abstractions, or representations, from data.

Modality Profile:

Element representations: Discrete, continuous, granularity
Element distributions: Density, frequency
Structure: Temporal, spatial, latent, explicit
Information: Abstraction, entropy
Noise: Uncertainty, noise, missing data
Relevance: Task, context dependence

Summary: How To Data

Decide how much data to collect, and how much to label (costs and time)
Clean data: normalize/standardize, find noisy data, anomaly/outlier detection
Visualize data: plot, dimensionality reduction (PCA, t-sne), cluster analysis
Decide on evaluation metric (proxy + real, quantitative and qualitative)
Choose model class and learning algorithm (more next lecture)

Huggingface Tutorial

“Huggingface” is a set of multiple packages
- transformers: Provides API to initialize large pretrained models
- datasets: Provides easy way to download datasets
Not from Huggingface but often used together
- bitsandbytes: Provides functions to quantize large models
- flash-attn: Allows the model to run faster with less memory
Some terms to keep in mind
- LoRA: Adapter to train large models with less memory
- Bfloat16: Robust half precision representation often used to save memory

The Recipe

Become one with the Data
Set up end-to-end skeleton and get dumb baselines
Overfit to diagnose errors
Regularize for better generalization ● Add more real data – the best way to reduce overfitting
● Use data augmentation and pretraining
● Reduce input dimensions and model size
● Techniques: Dropout, weight decay, early stopping
Tune hyperparameters ● Prefer random search over grid search
● Use Bayesian optimization tools when available
● Don’t overcomplicate – start with simple models
Squeeze out final improvements

How to Design ML Models for New Data

Look at the data first
For simple, low dimensional data, start with simple models (SVM, Random Forest, Shallow MLP/CNN)
For vision/language data, try pretrained model
Start simple, then add complexity. Simple ones can be used as baselines.

How to Debug Your Model

Look at the data first. Is the input data & label correct?
- Ensure no data leakage;
Look at the outputs. Is model only predicting one label?
- Label imbalance: Data Augmentation; loss scaling
Look at the training loss
- Loss is nan: Inspect weights and inputs for NaN values. Make sure weights are initialized. LLM: Use bfloat16 instead of float16.
- Loss not changing: Model underfitting. Increase learning rate; decrease weight decay; Add more complexity; Use better optimizer*.
Look at Loss (Continued)
- Loss highly varied/increasing: Decrease learning rate; Gradient Clipping; Use better Optimizers
Look at train vs val accuracy (or any other metrics)
- Train » Val: Model overfitting. More weight decay, reduce model complexity, data augmentation, get more data
- Train ≈ Val ≈ 100%: Check for data leakage

Personal tip: I recommend trying second order optimizers from packages like Heavyball

Common model architectures

Lecture Outline

A unifying paradigm of model architectures
Temporal sequence models
Spatial convolution models
Models for sets and graphs

Summary: How To Model

Decide how much data to collect, and how much to label (costs and time)
Clean data: normalize/standardize, find noisy data, anomaly/outlier detection
Visualize data: plot, dimensionality reduction (PCA, t-sne), cluster analysis
Decide on evaluation metric (proxy + real, quantitative and qualitative)
Choose modeling paradigm - domain-specific vs general-purpose
Figure out base elements and their representation
Figure out data invariances & equivariances (+other parts of modality profile)
Iterate between data collection, model design, model training, hyperparameter tuning etc. until satisfied.

Discussion 1: Learning and generalization

Module 2: Foundations of multimodal AI

Multimodal connections and alignment

Modality refers to the way in which something expressed or perceived.

A research-oriented definition… Multimodal is the science of heterogeneous and interconnected(Connected + Interacting) data.

Heterogeneous Modalities: Information in different modalities shows diverse qualities, structures, & representations.

Challenge 1: Representation

Definition: Learning representations that reflect cross-modal interactions between individual elements, across different modalities This is a core building block for most multimodal modeling problems!

Challenge 2: Alignment

Definition: Identifying and modeling cross-modal connections between all elements of multiple modalities, building from the data structure.

Sub-challenges:

Discrete connections: Explicit alignment (e.g., grounding)
Contextualized representation: Implicit alignment + representation
Continuous alignment: Granularity of individual elements

Challenge 2a: Discrete Alignment

Definition: Identify and model connections between elements of multiple modalities

Challenge 2b: Continuous Alignment

Definition: Model alignment between modalities with continuous signals and no explicit elements

Challenge 3: Reasoning

Definition: Combining knowledge, usually through multiple inferential steps, exploiting multimodal alignment and problem structure.

Challenge 4: Generation

Definition: Learning a generative process to produce raw modalities that reflects cross-m

Challenge 5: Transference

Definition: Transfer knowledge between modalities, usually to help the target modality which may be noisy or with limited resources.

Challenge 6: Quantification

Definition: Empirical and theoretical study to better understand heterogeneity, cross-modal interactions, and the multimodal learning process.

Discussion 2: Modern AI architectures

Multimodal interactions and fusion

Visual-and-Language Transformer (ViLT) (≈ BERT + ViT)

ALBEF: Align Before Fusion 32 (≈ BERT + ViT + CLIP-ish)

Discussion 3: Multimodal alignment

Transference

Definition: Transfer knowledge between modalities, usually to help the primary modality which may be noisy or with limited resources

Part 1: Transfer via Pretrained Models

Definition: Transferring knowledge from large-scale pretrained models to downstream tasks involving

Part 2: Co-learning

Definition: Transferring information from secondary to primary modality by sharing representation spaces between both modalities.

Co-learning via Alignment

Definition: Transferring information from secondary to primary modality by sharing representation spaces between modalities

Representation alignment: word embedding space for zero-shot visual classification

Co-learning via Translation

Definition: Transferring information from secondary to primary modality by using the secondary modality as a generation target.

Part 3: Model Induction

Model Induction 𝑦1 𝑦2 Definition: Keeping individual unimodal models separate but inducing common behavior across separate models.

Discussion 4: Multimodal interactions

Module 3: Large models and modern AI

Pre-training, scaling, fine-tuning LLMs

Today’s Agenda

History of LLMs, RNNs vs Transformer
Pretraining of LLMs
Types of Architecture
Instruction Finetuning & Preference Tuning
Efficient Training
Practical Tips

Scaling Laws:

Bigger model allows models to reach a better performance given sufficient compute
Over training models getting popular nowadays

Limitations of RL + Reward Modeling

Human preferences are unreliable
- Chatbots are rewarded to produce responses that seem authoritative and helpful, regardless of truth
- This can result in making up facts + hallucinations
Reward Model doesn’t always reflect humans’ preferences & may have unintended behaviors

Efficient Training: LoRA / Efficient low rank adaptation

Training the whole model takes a lot of compute and GPU memory
Solution: Freeze the model, train a small adapter that updates with a low-rank decomposition

Efficient Training: Mixture of Experts

Train multiple parallel networks (experts) simultaneously
During each forward pass, only activate k experts
Saves compute & GPU memory
Deepseek R1: 671B, only 37B activated, performance on par with OpenAI o1-mini

Efficient Inference: Quantization

Range Clipping
Scale & Shift
Convert to lower bits
Calibration

Practical Tips: How to Instruction Finetune an LLM

Data Preparation: Convert your data to conversation format
Choosing a good starting point
Secure compute & Finetune the model
Evaluation & Deployment

Large multimodal models

Today’s lecture

Multimodal foundation models and pre-training
Adapting LLMs into multimodal LLMs
From text to multimodal generation
Latest directions: natively multimodal, multimodal MoE, real-world modalities

Part 1: Multimodal foundation model representations of text, video, audio

Part 2: Adapting large language models for multimodal text generation

Part 3: Enabling text and image generation

Visual-and-Language Transformer (ViLT)

Pre-training datasets

Largest dataset is DataComp. It has 12.8 billion image-text pairs.
Recent efforts shifted more towards filtering for high quality multimodal data.

Examples include DFN (2B), COYO (600M), and Obelics (141M)

Native Multimodal Models

Background
- Non-native VLMs: Image encoder paired with frozen trained LLM. The image encoder can either be frozen or trained. Most VLMs now use this structure.
- Native Multimodal Modals: LLMs Trained from scratch with multimodal input
  - Late fusion: Image patches -> Image Encoder -> Linear -> LLM.
  - Early fusion: Image patches -> Linear -> LLM (No image encoder!)

Scaling Laws for Native Multimodal Models

Early fusion models hold small advantage on small scales.
On larger scales, both architectures perform similarly. (We don’t actually need image encoders!)
NMMs scale similarly to unimodal LLMs, with slightly varying scaling exponents depending on the target data type and training mixture
Sparse structure like MOE significantly benefits NMMs at the same inference cost
In an MOE structure, Modality-aware design (having separate image/text experts) performs worse than modality-agnostic design (unified experts for both image/text tokens)

Discussion 5: Large language models

Modern generative AI

Todays Lecture

What are Generative Models?
Current State of the Art (Flow Matching)
Conditional Generation
Architectures
Tips to Train these Models

### Discussion 6: Large multimodal models

Module 4: Interactive AI

Discussion 7: Generative AI

Multi-step reasoning

Today’s lecture

Multimodal reasoning: Solving hard problems by breaking them down into step-by-step reasoning steps in multiple modalities
AI agents
Human-AI interaction
Ethics and safety

Models: Multimodal fusion and generation
Data: Hard challenges + human reasoning steps
Training: Reinforcement learning for emergent reasoning
Human: Trustworthy, safe, controllable

Multimodal Reasoning

Part 1: Multimodal foundation model representations of text, video, audio

Part 2: Adapting large language models for multimodal text generation

Part 3: Enabling text and image generation

Part 4: Human-AI interaction

Interactive and embodied AI

Today’s lecture

Basics of reinforcement learning
Modern RL for LLM alignment and reasoning
Interactive LLM agents

Tips and Training for Reinforcement Learning

Sanity Check with Fixed Policy
Monitor KL Divergence (in PPO-like algorithms)
Plot Entropy Over Time
Use Greedy Rollouts for Evaluation
Debug Value Function Separately: Visualize predicted vs. actual return
Gradient Norm Clipping is Crucial
Check Advantage Distribution
Train on a Frozen Replay Buffer
Use Curriculum Learning: Gradually increase task difficulty or reward sparsity
Watch for Mode Collapse in MoE or Multi-Head Policies

Human-AI interaction and safety

Interactive Agents

Multisensory agents for the web and digital automation

Example task: Purchase a set of earphones with at least 4.5 stars in rating and ship it to me.

Embodied Agents

Generate precise robotics control directly via trained vision language models.

Human-AI interaction

What medium(s) is most intuitive for human-AI interaction?
- especially beyond language prompting 1
What new technical challenges in AI have to be solved for human-AI interaction?
- quantification
What new opportunities arise when integrating AI with the human experience?
- productivity, creativity, wellbeing

Quantification

Definition: Empirical and theoretical studies to better understand model shortcomings and predict and control model behavior.

Quantification - Safety

Easy to generate biased and dangerous content with language models!

But there exist ways to ‘jailbreak’ the safety measures in aligned LLMs

Readings

评论