MIT How to AI(Almost) Everything
前言
本门课程主要讲的是多模态大模型,而不是AI应用方式。
Course Description
Artificial Intelligence (AI) holds great promise as a technology to enhance digital productivity, physical interactions, overall well-being, and the human experience. To enable the true impact of AI, these systems will need to be grounded in real-world data modalities, from language-only systems to vision, audio, sensors, medical data, music, art, smell, and taste. This course will introduce the basic principles of AI (focusing on modern deep learning and foundation models) and how we can apply AI to novel real-world data modalities. In addition, we will introduce the principles of multimodal AI that can process many modalities at once, such as connecting language and multimedia, music and art, sensing and actuation, and more.
Through lectures, readings, discussions, and a significant research component, this course will develop critical thinking skills and intuitions when applying AI to new data modalities, knowledge of recent technical achievements in AI, and a deeper understanding of the AI research process.
Course Info
Instructor
Departments
Content
Introduction
Multisensory intelligence: Creating human-AI symbiosis across scales and sensory mediums to enhance productivity, creativity, and wellbeing.

Course Overview
- AI for new modalities: data, modeling, evaluation, deployment
- Multimodal AI: connecting multiple different data sources
Learning Objectives
- Study recent technical achievements in AI research
- Improve critical and creative thinking skills
- Understand future research challenges in AI
- Explore and implement new research ideas in AI
Preferred Pre-requisites
- Some knowledge of programming (ideally in Python)
- Some basic understanding of modern AI capabilities & limitations
- Bring external (non-AI) domain knowledge about your problem
- Bonus: worked on AI for some modality
Research Projects on New Modalities
Motivation: Many tasks of real-world impact go beyond image and text.
Challenges:
- AI with non-deep-learning effective modalities (e.g., tabular, time-series)
- Multimodal deep learning + time-series analysis + tabular models
- AI for physiological sensing, IoT sensing in cities, climate and environment sensing
- Smell, taste, art, music, tangible and embodied systems
Potential models and dataset to start with
- Brain EEG Signal: https://arxiv.org/abs/2306.16934
- Speech: https://arxiv.org/pdf/2310.02050.pdf
- Facial Motion: https://arxiv.org/abs/2308.10897
- Tactile: https://arxiv.org/pdf/2204.00117.pdf
Research Projects on AI Reasoning
Motivation: Robust, reliable, interpretable reasoning in (multimodal) LLMs.
Challenges:
- Fine-grained and compositional reasoning
- Neuro-symbolic reasoning
- Emergent reasoning in foundation models
Potential models and dataset to start with
- Can LLMs actually reason and plan?
- Code for VQA: CodeVQA: https://arxiv.org/pdf/2306.05392.pdf, VisProg: https://prior.allenai.org/projects/visprog, Viper: https://viper.cs.columbia.edu/
- Cola: https://openreview.net/pdf?id=kdHpWogtX6Y
- NLVR2: https://arxiv.org/abs/1811.00491
- Reference games: https://mcgill-nlp.github.io/imagecode/, https://github.com/AlabNII/onecommon, https://dmg-photobook.github.io/
Research Projects on Interactive Agents
Motivation: Grounding AI models in the web, computer, or other virtual worlds to help humans with digital tasks.
Challenges:
- Web visual understanding is quite different from natural image understanding
- Instructions and language grounded in web images, tools, APIs
- Asking for human clarification, human-in-the-loop
- Search over environment and planning
Potential models and dataset to start with
- WebArena: https://arxiv.org/pdf/2307.13854.pdf
- AgentBench: https://arxiv.org/pdf/2308.03688.pdf
- ToolFormer: https://arxiv.org/abs/2302.04761
- SeeAct: https://osu-nlp-group.github.io/SeeAct/
Research Projects on Embodied and Tangible AI
Motivation: Building tangible and embodied AI systems that help humans in physical tasks.
Challenges:
- Perception, reasoning, and interaction
- Connecting sensing and actuation
- Efficient models that can run on hardware
- Understanding influence of actions on the world (world model)
Potential models and dataset to start with
- Virtual Home: http://virtual-home.org/paper/virtualhome.pdf
- Habitat 3.0 https://ai.meta.com/static-resource/habitat3
- RoboThor: https://ai2thor.allenai.org/robothor
- LangSuite-E: https://github.com/bigai-nlco/langsuite
- Language models and world models: https://arxiv.org/pdf/2305.10626.pdf
Research Projects on Socially Intelligent AI
Motivation: Building AI that can understand and interact with humans in social situations.
Challenges:
- Social interaction, reasoning, and commonsense.
- Building social relationships over months and years.
- Theory-of-Mind and multi-party social interactions.
Potential models and dataset to start with
- Multimodal WereWolf: https://persuasion-deductiongame.socialai-data.org/
- Ego4D: https://arxiv.org/abs/2110.07058
- MMToM-QA: https://openreview.net/pdf?id=jbLM1yvxaL
- 11866 Artificial Social Intelligence: https://cmu-multicomp-lab.github.io/asi-course/spring2023/
Research Projects on Human-AI Interaction
Motivation: What is the right medium for human-AI interaction? How can we really trust AI? How do we enable collaboration and synergy?
Challenges:
- Modeling and conveying model uncertainty – text input uncertainty, visual uncertainty, multimodal uncertainty? cross-modal interaction uncertainty?
- Asking for human clarification, human-in-the-loop, types of human feedback and ways to learn from human feedback through all modalities.
- New mediums to interact with AI. New tasks beyond imitating humans, leading to collaboration.
Potential models and dataset to start with
- MMHal-Bench: https://arxiv.org/pdf/2309.14525.pdf aligning multimodal LLMs
- HACL: https://arxiv.org/pdf/2312.06968.pdf hallucination + LLM
Research Projects on Ethics and Safety
Motivation: Large AI models are can emit unsafe text content, generate or retrieve biased images.
Challenges:
- Taxonomizing types of biases: text, vision, audio, generation, etc.
- Tracing biases to pretraining data, seeing how bias can be amplified during training, fine-tuning.
- New ways of mitigating biases and aligning to human preferences.
Potential models and dataset to start with
- Many works on fairness in LLMs -> how to extend to multimodal?
- Mitigating bias in text generation, image-captioning, image generation
Introduction to AI and AI research
- Introduction to AI and AI research
- Generating ideas, reading and writing papers, AI experimentation

How Do We Get Research Ideas?
Turn a concrete understanding of existing research’s failings to a higher-level experimental question. • Bottom-up discovery of research ideas • Great tool for incremental progress, but may preclude larger leaps
Move from a higher-level question to a lower-level concrete testing of that question.
• Top-down design of research ideas • Favors bigger ideas, but can be disconnected from reality
Beware “Does X Make Y Better?” “Yes”
The above question/hypothesis is natural, but indirect
- If the answer is “no” after your experiments, how do you tell what’s going wrong?
Usually you have an intuition about why X will make Y better (not just random)
Can you think of other research questions/ hypotheses that confirm/falsify these assumptions
How to do Literature Review and Read a Paper?
- Google scholar
- Papers with code, Github, Huggingface
- Recent conference proceedings
- Blog posts
- Survey papers, tutorials, courses
Testing Research Ideas
- Gather and process dataset, visualize data, gather labels, do data splits.
- Implement the most simple pipeline and get it working. -> Pipeline = data loading + basic model + eval function + loss/visualization/deployment
- Change one component of the model at a time, repeat x10 (top-down or bottom-up).
- Find what works best, and exploit.
- Scale up experiments, repeat across multiple datasets.
- Careful ablation studies.
- Qualitative comparisons and visualizations.
- Repeat until successful.
How to Write a Paper
- Prepare a 15min talk (with figures, examples, tables, etc.)
- Convert the talk into a paper.
More resources
- https://github.com/pliang279/awesome-phd-advice
- https://github.com/jbhuang0604/awesome-tips
- https://www.cs197.seas.harvard.edu/
- https://medium.com/spotprobe/the-hexagon-of-ideas-02e5b770d75e
Module 1: Foundations of AI
Data, structure, and information
Lecture Outline:
- Vision, language, audio, sensing, set, graph modalities
- Modality profile
- Types of data and labels
- Common learning objectives and generalization
Most of AI is about learning abstractions, or representations, from data.
Modality Profile:
- Element representations: Discrete, continuous, granularity
- Element distributions: Density, frequency
- Structure: Temporal, spatial, latent, explicit
- Information: Abstraction, entropy
- Noise: Uncertainty, noise, missing data
- Relevance: Task, context dependence

Summary: How To Data
- Decide how much data to collect, and how much to label (costs and time)
- Clean data: normalize/standardize, find noisy data, anomaly/outlier detection
- Visualize data: plot, dimensionality reduction (PCA, t-sne), cluster analysis
- Decide on evaluation metric (proxy + real, quantitative and qualitative)
- Choose model class and learning algorithm (more next lecture)
Huggingface Tutorial
- “Huggingface” is a set of multiple packages
- transformers: Provides API to initialize large pretrained models
- datasets: Provides easy way to download datasets
- Not from Huggingface but often used together
- bitsandbytes: Provides functions to quantize large models
- flash-attn: Allows the model to run faster with less memory
- Some terms to keep in mind
- LoRA: Adapter to train large models with less memory
- Bfloat16: Robust half precision representation often used to save memory
The Recipe
- Become one with the Data
- Set up end-to-end skeleton and get dumb baselines
- Overfit to diagnose errors
- Regularize for better generalization
● Add more real data – the best way to reduce overfitting
● Use data augmentation and pretraining
● Reduce input dimensions and model size
● Techniques: Dropout, weight decay, early stopping - Tune hyperparameters
● Prefer random search over grid search
● Use Bayesian optimization tools when available
● Don’t overcomplicate – start with simple models - Squeeze out final improvements
How to Design ML Models for New Data
- Look at the data first
- For simple, low dimensional data, start with simple models (SVM, Random Forest, Shallow MLP/CNN)
- For vision/language data, try pretrained model
- Start simple, then add complexity. Simple ones can be used as baselines.
How to Debug Your Model
- Look at the data first. Is the input data & label correct?
- Ensure no data leakage;
- Look at the outputs. Is model only predicting one label?
- Label imbalance: Data Augmentation; loss scaling
- Look at the training loss
- Loss is nan: Inspect weights and inputs for NaN values. Make sure weights are initialized. LLM: Use bfloat16 instead of float16.
- Loss not changing: Model underfitting. Increase learning rate; decrease weight decay; Add more complexity; Use better optimizer*.
- Look at Loss (Continued)
- Loss highly varied/increasing: Decrease learning rate; Gradient Clipping; Use better Optimizers
- Look at train vs val accuracy (or any other metrics)
- Train » Val: Model overfitting. More weight decay, reduce model complexity, data augmentation, get more data
- Train ≈ Val ≈ 100%: Check for data leakage
Personal tip: I recommend trying second order optimizers from packages like Heavyball
Common model architectures
Lecture Outline
- A unifying paradigm of model architectures
- Temporal sequence models
- Spatial convolution models
- Models for sets and graphs
Summary: How To Model
- Decide how much data to collect, and how much to label (costs and time)
- Clean data: normalize/standardize, find noisy data, anomaly/outlier detection
- Visualize data: plot, dimensionality reduction (PCA, t-sne), cluster analysis
- Decide on evaluation metric (proxy + real, quantitative and qualitative)
- Choose modeling paradigm - domain-specific vs general-purpose
- Figure out base elements and their representation
- Figure out data invariances & equivariances (+other parts of modality profile)
- Iterate between data collection, model design, model training, hyperparameter tuning etc. until satisfied.
Discussion 1: Learning and generalization
- Learning the Bitter Lesson
- Unifying Grokking and Double Descent
- Generalization in Neural Networks
- Textbooks are all you Need
- A Conceptual Pipeline for Machine Learning
Module 2: Foundations of multimodal AI
Multimodal connections and alignment


Modality refers to the way in which something expressed or perceived.
A research-oriented definition… Multimodal is the science of heterogeneous and interconnected(Connected + Interacting) data.
Heterogeneous Modalities: Information in different modalities shows diverse qualities, structures, & representations.

Challenge 1: Representation
Definition: Learning representations that reflect cross-modal interactions between individual elements, across different modalities This is a core building block for most multimodal modeling problems!
Challenge 2: Alignment
Definition: Identifying and modeling cross-modal connections between all elements of multiple modalities, building from the data structure.
Sub-challenges:
- Discrete connections: Explicit alignment (e.g., grounding)
- Contextualized representation: Implicit alignment + representation
- Continuous alignment: Granularity of individual elements
Challenge 2a: Discrete Alignment
Definition: Identify and model connections between elements of multiple modalities

Challenge 2b: Continuous Alignment
Definition: Model alignment between modalities with continuous signals and no explicit elements
Challenge 3: Reasoning
Definition: Combining knowledge, usually through multiple inferential steps, exploiting multimodal alignment and problem structure.
Challenge 4: Generation
Definition: Learning a generative process to produce raw modalities that reflects cross-m
Challenge 5: Transference
Definition: Transfer knowledge between modalities, usually to help the target modality which may be noisy or with limited resources.
Challenge 6: Quantification
Definition: Empirical and theoretical study to better understand heterogeneity, cross-modal interactions, and the multimodal learning process.
Discussion 2: Modern AI architectures
- Scaling Laws for Generative Mixed-Modal Models
- Not All Tokens Are What You Need for Pretraining
- PaLI: A Jointly-Scaled Multilingual Language-Image Model
- The Evolution of Multimodal Model Architectures
- An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale
- A ConvNet for the 2020s
- Inductive Representation Learning on Large Graphs
- Janossy Pooling: Learning Deep Permutation-Invariant Functions for Variable-Size Inputs
Multimodal interactions and fusion
Visual-and-Language Transformer (ViLT) (≈ BERT + ViT)
ALBEF: Align Before Fusion 32 (≈ BERT + ViT + CLIP-ish)

Discussion 3: Multimodal alignment
- The Platonic Representation Hypothesis
- What Makes for Good Views for Contrastive Learning?
- Understanding the Emergence of Multimodal Representation Alignment
- Does equivariance matter at scale?
- Learning Transferable Visual Models From Natural Language Supervision?
- Emerging Properties in Self-Supervised Vision Transformers
- Foundations & trends in multimodal machine learning - Principles, challenges, and open questions
Cross-modal transfer
Transference
Definition: Transfer knowledge between modalities, usually to help the primary modality which may be noisy or with limited resources
Part 1: Transfer via Pretrained Models
Definition: Transferring knowledge from large-scale pretrained models to downstream tasks involving
Part 2: Co-learning
Definition: Transferring information from secondary to primary modality by sharing representation spaces between both modalities.
Co-learning via Alignment
Definition: Transferring information from secondary to primary modality by sharing representation spaces between modalities
Representation alignment: word embedding space for zero-shot visual classification
Co-learning via Translation
Definition: Transferring information from secondary to primary modality by using the secondary modality as a generation target.
Part 3: Model Induction
Model Induction 𝑦1 𝑦2 Definition: Keeping individual unimodal models separate but inducing common behavior across separate models.

Discussion 4: Multimodal interactions
- Multimodal interaction: A review
- Quantifying & Modeling Multimodal Interactions: An Information Decomposition Framework
- Does my multimodal model learn cross-modal interactions? It’s harder to tell than you might think!
- Kosmos-2: Grounding Multimodal Large Language Models to the World
- Chameleon: Mixed-modal early-fusion foundation models
- MM1: Methods, Analysis and Insights from Multimodal LLM Pre-training
- MoMa: Efficient Early-Fusion Pre-training with Mixture of Modality-Aware Experts
Module 3: Large models and modern AI
Pre-training, scaling, fine-tuning LLMs
Today’s Agenda
- History of LLMs, RNNs vs Transformer
- Pretraining of LLMs
- Types of Architecture
- Instruction Finetuning & Preference Tuning
- Efficient Training
- Practical Tips

Scaling Laws:
- Bigger model allows models to reach a better performance given sufficient compute
- Over training models getting popular nowadays
Limitations of RL + Reward Modeling
- Human preferences are unreliable
- Chatbots are rewarded to produce responses that seem authoritative and helpful, regardless of truth
- This can result in making up facts + hallucinations
- Reward Model doesn’t always reflect humans’ preferences & may have unintended behaviors
Efficient Training: LoRA / Efficient low rank adaptation
- Training the whole model takes a lot of compute and GPU memory
- Solution: Freeze the model, train a small adapter that updates with a low-rank decomposition
Efficient Training: Mixture of Experts
- Train multiple parallel networks (experts) simultaneously
- During each forward pass, only activate k experts
- Saves compute & GPU memory
- Deepseek R1: 671B, only 37B activated, performance on par with OpenAI o1-mini
Efficient Inference: Quantization
- Range Clipping
- Scale & Shift
- Convert to lower bits
- Calibration
Practical Tips: How to Instruction Finetune an LLM
- Data Preparation: Convert your data to conversation format
- Choosing a good starting point
- Secure compute & Finetune the model
- Evaluation & Deployment
Large multimodal models
Today’s lecture
- Multimodal foundation models and pre-training
- Adapting LLMs into multimodal LLMs
- From text to multimodal generation
- Latest directions: natively multimodal, multimodal MoE, real-world modalities
Part 1: Multimodal foundation model representations of text, video, audio
Part 2: Adapting large language models for multimodal text generation
Part 3: Enabling text and image generation
Visual-and-Language Transformer (ViLT)
Pre-training datasets
- Largest dataset is DataComp. It has 12.8 billion image-text pairs.
- Recent efforts shifted more towards filtering for high quality multimodal data.
Examples include DFN (2B), COYO (600M), and Obelics (141M)
Native Multimodal Models
- Background
- Non-native VLMs: Image encoder paired with frozen trained LLM. The image encoder can either be frozen or trained. Most VLMs now use this structure.
- Native Multimodal Modals: LLMs Trained from scratch with multimodal input
- Late fusion: Image patches -> Image Encoder -> Linear -> LLM.
- Early fusion: Image patches -> Linear -> LLM (No image encoder!)
Scaling Laws for Native Multimodal Models
- Early fusion models hold small advantage on small scales.
- On larger scales, both architectures perform similarly. (We don’t actually need image encoders!)
- NMMs scale similarly to unimodal LLMs, with slightly varying scaling exponents depending on the target data type and training mixture
- Sparse structure like MOE significantly benefits NMMs at the same inference cost
- In an MOE structure, Modality-aware design (having separate image/text experts) performs worse than modality-agnostic design (unified experts for both image/text tokens)
Discussion 5: Large language models
- LoRA: Low-Rank Adaptation of Large Language Models
- Gated Linear Attention Transformers with Hardware-Efficient Training
- Unintended Impacts of LLM Alignment on Global Representation
- A Visual Guide to Quantization
- Scaling Instruction-Finetuned Language Models
Modern generative AI
Todays Lecture
- What are Generative Models?
- Current State of the Art (Flow Matching)
- Conditional Generation
- Architectures
- Tips to Train these Models
### Discussion 6: Large multimodal models
- LLaMA-Adapter: Efficient Fine-tuning of Language Models with Zero-init Attention
- Cobra: Extending Mamba to Multi-Modal Large Language Model for Efficient Inference
- ModaVerse: Efficiently Transforming Modalities with LLMs
- Spider: Any-to-Many Multimodal LLM
- SPHINX-X: Scaling Data and Parameters for a Family of Multi-modal Large Language Models
- How Far Are We to GPT-4V? Closing the Gap to Commercial Multimodal Models with Open-Source Suites
- NExT-GPT: Any-to-Any Multimodal LLM
- Learning to rebalance multi-modal optimization by adaptively masking subnetworks
Module 4: Interactive AI
Discussion 7: Generative AI
- Large Language Diffusion Models
- Compositional Generative Modeling: A Single Model is Not All You Need
- Flow Matching for Generative Modeling
- Flow Matching Guide and Code
- FlowMotion: Target-Predictive Conditional Flow Matching for Jitter-Reduced Text-to-Motion Generation
- MusFlow: Multimodal Music Generation via Conditional Flow Matching
- Unraveling the Connections Between Flow Matching and Diffusion Probabilistic Models
- Exploring Diffusion and Flow Matching Under Generator Matching
Multi-step reasoning
Today’s lecture
- Multimodal reasoning: Solving hard problems by breaking them down into step-by-step reasoning steps in multiple modalities
- AI agents
- Human-AI interaction
- Ethics and safety
Models: Multimodal fusion and generation
Data: Hard challenges + human reasoning steps
Training: Reinforcement learning for emergent reasoning
Human: Trustworthy, safe, controllable
Multimodal Reasoning
Part 1: Multimodal foundation model representations of text, video, audio
Part 2: Adapting large language models for multimodal text generation
Part 3: Enabling text and image generation
Part 4: Human-AI interaction
Interactive and embodied AI
Today’s lecture
- Basics of reinforcement learning
- Modern RL for LLM alignment and reasoning
- Interactive LLM agents

Tips and Training for Reinforcement Learning
- Sanity Check with Fixed Policy
- Monitor KL Divergence (in PPO-like algorithms)
- Plot Entropy Over Time
- Use Greedy Rollouts for Evaluation
- Debug Value Function Separately: Visualize predicted vs. actual return
- Gradient Norm Clipping is Crucial
- Check Advantage Distribution
- Train on a Frozen Replay Buffer
- Use Curriculum Learning: Gradually increase task difficulty or reward sparsity
- Watch for Mode Collapse in MoE or Multi-Head Policies
Human-AI interaction and safety
Interactive Agents
Multisensory agents for the web and digital automation
Example task: Purchase a set of earphones with at least 4.5 stars in rating and ship it to me.

Embodied Agents
Generate precise robotics control directly via trained vision language models.
Human-AI interaction
- What medium(s) is most intuitive for human-AI interaction?
- especially beyond language prompting 1
- What new technical challenges in AI have to be solved for human-AI interaction?
- quantification
- What new opportunities arise when integrating AI with the human experience?
- productivity, creativity, wellbeing
Quantification
Definition: Empirical and theoretical studies to better understand model shortcomings and predict and control model behavior.
Quantification - Safety
Easy to generate biased and dangerous content with language models!
But there exist ways to ‘jailbreak’ the safety measures in aligned LLMs

Readings
Foundations and Trends in Multimodal Machine Learning: Principles, Challenges, and Open Questions
Foundations and Trends in Multimodal Machine Learning: Principles, Challenges, and Open Questions
Geometric Deep Learning: Grids, Groups, Graphs, Geodesics, and Gauges
Geometric Deep Learning: Grids, Groups, Graphs, Geodesics, and Gauges
An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale
Neural Machine Translation by Jointly Learning to Align and Translate
Foundations and Trends in Multimodal Machine Learning: Principles, Challenges, and Open Questions
Characterization and classification of semantic image-text relations
When and why vision-language models behave like bags-of-words, and what to do about it?
An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale
Janossy Pooling: Learning Deep Permutation-Invariant Functions for Variable-Size Inputs
Quantifying & Modeling Multimodal Interactions: An Information Decomposition Framework
Does my multimodal model learn cross-modal interactions? It’s harder to tell than you might think!
Understanding the Emergence of Multimodal Representation Alignment
Learning Transferable Visual Models From Natural Language Supervision?
Foundations & trends in multimodal machine learning - Principles, challenges, and open questions
LLaVA-Med: Training a Large Language-and-Vision Assistant for Biomedicine in One Day
Quantifying & Modeling Multimodal Interactions: An Information Decomposition Framework
Does my multimodal model learn cross-modal interactions? It’s harder to tell than you might think!
Kosmos-2: Grounding Multimodal Large Language Models to the World
MM1: Methods, Analysis and Insights from Multimodal LLM Pre-training
MoMa: Efficient Early-Fusion Pre-training with Mixture of Modality-Aware Experts
Super-NaturalInstructions: Generalization via Declarative Instructions on 1600+ NLP Tasks
Quantifying & Modeling Multimodal Interactions: An Information Decomposition Framework
Multimodal Transformer for Unaligned Multimodal Language Sequences
Scaling Laws for Native Multimodal Models Scaling Laws for Native Multimodal Models
Gated Linear Attention Transformers with Hardware-Efficient Training
Unintended Impacts of LLM Alignment on Global Representation
Scaling Rectified Flow Transformers for High-Resolution Image Synthesis
LLaMA-Adapter: Efficient Fine-tuning of Language Models with Zero-init Attention
Cobra: Extending Mamba to Multi-Modal Large Language Model for Efficient Inference
SPHINX-X: Scaling Data and Parameters for a Family of Multi-modal Large Language Models
How Far Are We to GPT-4V? Closing the Gap to Commercial Multimodal Models with Open-Source Suites
Learning to rebalance multi-modal optimization by adaptively masking subnetworks
Compositional Generative Modeling: A Single Model is Not All You Need
FlowMotion: Target-Predictive Conditional Flow Matching for Jitter-Reduced Text-to-Motion Generation
MusFlow: Multimodal Music Generation via Conditional Flow Matching
Unraveling the Connections Between Flow Matching and Diffusion Probabilistic Models
Exploring Diffusion and Flow Matching Under Generator Matching
Deepseek-r1: Incentivizing reasoning capability in LLMs via reinforcement learning
Direct preference optimization: Your language model is secretly a reward model
Interactive Sketchpad: A Multimodal Tutoring System for Collaborative, Visual Problem-Solving
VideoWebArena: Evaluating Multimodal Agents on Video Understanding Web Tasks
ArtPrompt: ASCII Art-based Jailbreak Attacks against Aligned LLMs