别再只盯着文本了!用Python实战多模态情感识别:从IEMOCAP数据集到MLLM微调

张开发
2026/4/13 4:00:20 15 分钟阅读

分享文章

别再只盯着文本了!用Python实战多模态情感识别:从IEMOCAP数据集到MLLM微调
用Python实战多模态情感识别从IEMOCAP数据集到MLLM微调情感识别一直是人机交互领域的重要研究方向。传统的文本情感分析虽然成熟但在真实对话场景中仅依赖文字往往难以捕捉到完整的情感状态——一个微笑的表情或颤抖的声调可能比文字本身传递更多信息。这就是为什么多模态情感识别(MERC)正成为研究热点它通过整合文本、语音和视觉信号让机器更接近人类的情感理解能力。本文将带您从零开始构建一个多模态情感识别系统。不同于理论综述我们聚焦于可落地的代码实现使用Python和PyTorch框架基于IEMOCAP这一权威数据集最终实现一个能同时处理文本、音频和视觉信息的端到端模型。特别地我们会探讨如何利用LoRA等高效微调技术将预训练的多模态大语言模型(MLLM)适配到情感识别任务解决数据稀缺和计算资源有限的现实问题。1. 环境准备与数据加载1.1 安装必要依赖开始前确保已安装Python 3.8版本。推荐使用conda创建虚拟环境conda create -n merc python3.8 conda activate merc安装核心依赖库pip install torch2.0.1 torchaudio2.0.2 torchvision0.15.2 pip install transformers4.30.2 librosa0.10.0 opencv-python4.7.0.72 pip install pandas2.0.2 numpy1.24.3 scikit-learn1.2.21.2 下载并解析IEMOCAP数据集IEMOCAP是情感识别领域的标杆数据集包含约12小时的视频对话标注了六种基本情感(高兴、悲伤、愤怒、中性、兴奋、沮丧)。数据集结构如下IEMOCAP/ ├── Session{1-5}/ │ ├── sentences/wav/ # 音频文件 │ ├── dialog/EmoEvaluation/ # 情感标签 │ └── sentences/MOCAP_rotated/ # 视觉特征我们首先定义一个数据加载类import os import pandas as pd from torch.utils.data import Dataset class IEMOCAPDataset(Dataset): def __init__(self, root_dir, sessions[1,2,3,4,5]): self.data [] for session in sessions: label_dir f{root_dir}/Session{session}/dialog/EmoEvaluation for file in os.listdir(label_dir): if file.endswith(.txt): with open(f{label_dir}/{file}) as f: for line in f: if line.startswith([): parts line.strip().split(\t) utterance_id parts[1] emotion parts[2] self.data.append({ session: session, utterance_id: utterance_id, emotion: emotion, audio_path: f{root_dir}/Session{session}/sentences/wav/{utterance_id}.wav, video_path: f{root_dir}/Session{session}/sentences/MOCAP_rotated/{utterance_id}.csv }) # 过滤掉少量标注为xxx的无效样本 self.data [x for x in self.data if x[emotion] ! xxx] def __len__(self): return len(self.data) def __getitem__(self, idx): return self.data[idx]2. 多模态特征提取2.1 文本特征处理对于文本模态我们使用预训练的BERT模型提取语义特征。首先需要将音频转录文本(数据集已提供)转换为向量from transformers import BertTokenizer, BertModel import torch text_tokenizer BertTokenizer.from_pretrained(bert-base-uncased) text_model BertModel.from_pretrained(bert-base-uncased) def get_text_features(transcript): inputs text_tokenizer(transcript, return_tensorspt, paddingTrue, truncationTrue, max_length128) with torch.no_grad(): outputs text_model(**inputs) return outputs.last_hidden_state.mean(dim1) # 取平均作为句子表示2.2 音频特征提取音频特征提取采用Librosa库提取MFCC、韵律等关键声学特征import librosa import numpy as np def extract_audio_features(audio_path, sr16000): # 加载音频文件 y, sr librosa.load(audio_path, srsr) # 提取MFCC特征 mfcc librosa.feature.mfcc(yy, srsr, n_mfcc13) mfcc_delta librosa.feature.delta(mfcc) mfcc_delta2 librosa.feature.delta(mfcc, order2) # 提取基频和能量特征 f0 librosa.yin(y, fmin50, fmax500) rms librosa.feature.rms(yy) # 拼接所有特征 features np.vstack([mfcc, mfcc_delta, mfcc_delta2, f0, rms]) return features.T # 转置为(time, feature)格式2.3 视觉特征处理对于视觉模态我们使用OpenCV和MediaPipe提取面部关键点import cv2 import mediapipe as mp import numpy as np mp_face_mesh mp.solutions.face_mesh.FaceMesh( static_image_modeTrue, max_num_faces1, refine_landmarksTrue, min_detection_confidence0.5) def extract_visual_features(video_path): # 从视频中采样关键帧 cap cv2.VideoCapture(video_path) frames [] while cap.isOpened(): ret, frame cap.read() if not ret: break frames.append(frame) cap.release() # 提取面部关键点 landmarks [] for frame in frames[::5]: # 每5帧取1帧 results mp_face_mesh.process(cv2.cvtColor(frame, cv2.COLOR_BGR2RGB)) if results.multi_face_landmarks: face_landmarks results.multi_face_landmarks[0] landmarks.append(np.array([(lm.x, lm.y, lm.z) for lm in face_landmarks.landmark])) if len(landmarks) 0: return np.concatenate(landmarks).mean(axis0) # 取平均特征 return np.zeros((478, 3)) # MediaPipe有478个面部关键点3. 多模态融合模型构建3.1 基础架构设计我们采用基于Transformer的跨模态融合架构包含三个关键组件模态特定编码器分别处理不同模态的输入跨模态注意力层学习模态间的交互分类头输出最终情感预测import torch.nn as nn from transformers import BertModel class MultimodalTransformer(nn.Module): def __init__(self, num_classes6): super().__init__() # 文本编码器 (冻结BERT的大部分层) self.text_encoder BertModel.from_pretrained(bert-base-uncased) for param in self.text_encoder.parameters(): param.requires_grad False # 音频编码器 self.audio_proj nn.Sequential( nn.Linear(18, 256), # MFCC(13)delta(13)delta2(13)f0(1)rms(1) nn.GELU(), nn.LayerNorm(256), nn.Linear(256, 768) # 与BERT隐藏层大小对齐 ) # 视觉编码器 self.visual_proj nn.Sequential( nn.Linear(478*3, 512), nn.GELU(), nn.LayerNorm(512), nn.Linear(512, 768) ) # 跨模态Transformer self.cross_modal_transformer nn.TransformerEncoderLayer( d_model768, nhead8, dim_feedforward3072) # 分类头 self.classifier nn.Sequential( nn.Linear(768*3, 512), nn.ReLU(), nn.Dropout(0.2), nn.Linear(512, num_classes) ) def forward(self, text, audio, visual): # 文本特征 text_feats self.text_encoder(**text).last_hidden_state[:, 0, :] # 音频特征 audio_feats self.audio_proj(audio.mean(dim1)) # 时序平均 # 视觉特征 visual_feats self.visual_proj(visual.flatten(1)) # 跨模态交互 combined torch.stack([text_feats, audio_feats, visual_feats], dim1) combined self.cross_modal_transformer(combined) # 分类 logits self.classifier(combined.flatten(1)) return logits3.2 使用LoRA进行高效微调对于大型预训练模型我们可以使用LoRA(Low-Rank Adaptation)技术进行高效微调大幅减少可训练参数from peft import LoraConfig, get_peft_model def setup_lora(model): # 配置LoRA config LoraConfig( r8, # 秩 lora_alpha16, target_modules[query, key, value], lora_dropout0.1, biasnone ) # 应用LoRA到文本编码器 model.text_encoder get_peft_model(model.text_encoder, config) return model4. 训练与评估4.1 数据预处理流程构建完整的数据处理流水线from torch.utils.data import DataLoader from sklearn.preprocessing import LabelEncoder # 初始化数据集 dataset IEMOCAPDataset(path/to/IEMOCAP) # 编码情感标签 label_encoder LabelEncoder() labels [x[emotion] for x in dataset] label_encoder.fit(labels) # 自定义collate_fn处理不同长度序列 def collate_fn(batch): texts [x[transcript] for x in batch] # 假设已有转录文本 audio [extract_audio_features(x[audio_path]) for x in batch] visual [extract_visual_features(x[video_path]) for x in batch] # 文本tokenize text_inputs text_tokenizer( texts, return_tensorspt, paddingTrue, truncationTrue, max_length128) # 音频填充/截断 audio_lengths [len(x) for x in audio] max_audio_len max(audio_lengths) audio_padded torch.zeros(len(batch), max_audio_len, 18) for i, (a, l) in enumerate(zip(audio, audio_lengths)): audio_padded[i, :l] torch.FloatTensor(a[:l]) # 视觉特征 visual torch.stack([torch.FloatTensor(x) for x in visual]) # 标签 labels torch.LongTensor(label_encoder.transform( [x[emotion] for x in batch])) return text_inputs, audio_padded, visual, labels # 创建DataLoader train_loader DataLoader( dataset, batch_size16, shuffleTrue, collate_fncollate_fn, num_workers4)4.2 训练循环实现设置训练过程包括损失函数和评估指标import torch.optim as optim from tqdm import tqdm def train_model(model, train_loader, val_loader, epochs10): device torch.device(cuda if torch.cuda.is_available() else cpu) model model.to(device) # 优化器和损失函数 optimizer optim.AdamW(model.parameters(), lr5e-5) criterion nn.CrossEntropyLoss() best_val_acc 0 for epoch in range(epochs): model.train() train_loss 0 correct 0 total 0 for text, audio, visual, labels in tqdm(train_loader): text {k: v.to(device) for k, v in text.items()} audio, visual, labels audio.to(device), visual.to(device), labels.to(device) optimizer.zero_grad() outputs model(text, audio, visual) loss criterion(outputs, labels) loss.backward() optimizer.step() train_loss loss.item() _, predicted outputs.max(1) total labels.size(0) correct predicted.eq(labels).sum().item() train_acc 100 * correct / total val_acc evaluate(model, val_loader, device) print(fEpoch {epoch1}: Train Loss: {train_loss/len(train_loader):.4f}, fTrain Acc: {train_acc:.2f}%, Val Acc: {val_acc:.2f}%) if val_acc best_val_acc: best_val_acc val_acc torch.save(model.state_dict(), best_model.pth) return model def evaluate(model, loader, device): model.eval() correct 0 total 0 with torch.no_grad(): for text, audio, visual, labels in loader: text {k: v.to(device) for k, v in text.items()} audio, visual, labels audio.to(device), visual.to(device), labels.to(device) outputs model(text, audio, visual) _, predicted outputs.max(1) total labels.size(0) correct predicted.eq(labels).sum().item() return 100 * correct / total5. 模型优化与部署5.1 处理模态缺失的鲁棒性增强实际应用中常遇到某些模态缺失的情况。我们通过以下方式增强模型鲁棒性模态dropout训练随机屏蔽某些模态缺失模态检测自动识别并调整融合策略修改模型前向传播def forward(self, text, audio, visual, text_maskNone, audio_maskNone, visual_maskNone): # 默认不缺失任何模态 if text_mask is None: text_mask torch.ones(text[input_ids].size(0), dtypetorch.bool) if audio_mask is None: audio_mask torch.ones(audio.size(0), dtypetorch.bool) if visual_mask is None: visual_mask torch.ones(visual.size(0), dtypetorch.bool) # 文本特征 (对缺失样本使用零向量) text_feats torch.zeros(text[input_ids].size(0), 768).to(text[input_ids].device) if text_mask.any(): text_outputs self.text_encoder( **{k: v[text_mask] for k, v in text.items()}) text_feats[text_mask] text_outputs.last_hidden_state[:, 0, :] # 音频特征 audio_feats torch.zeros(audio.size(0), 768).to(audio.device) if audio_mask.any(): audio_feats[audio_mask] self.audio_proj(audio[audio_mask].mean(dim1)) # 视觉特征 visual_feats torch.zeros(visual.size(0), 768).to(visual.device) if visual_mask.any(): visual_feats[visual_mask] self.visual_proj(visual[visual_mask].flatten(1)) # 剩余部分保持不变...5.2 模型轻量化与加速为实际部署考虑我们可以通过以下技术优化模型知识蒸馏训练小型学生模型模仿大型教师模型量化将模型参数从FP32转换为INT8ONNX导出跨平台部署示例量化代码def quantize_model(model): quantized_model torch.quantization.quantize_dynamic( model, {nn.Linear}, # 量化线性层 dtypetorch.qint8 ) return quantized_model6. 实际应用案例6.1 实时情感识别系统构建一个简单的实时情感识别demoimport sounddevice as sd import cv2 from threading import Thread class RealTimeMERC: def __init__(self, model_path): self.model load_model(model_path) self.audio_buffer [] self.video_buffer [] def audio_callback(self, indata, frames, time, status): self.audio_buffer.append(indata.copy()) def video_capture(self): cap cv2.VideoCapture(0) while True: ret, frame cap.read() if ret: self.video_buffer.append(frame) def predict_emotion(self, text): # 获取最新3秒的音频和视频 audio np.concatenate(self.audio_buffer[-3000:]) visual self.video_buffer[-30:] # 假设30fps # 提取特征 audio_feats extract_audio_features(audio) visual_feats extract_visual_features(visual) # 文本处理 text_input text_tokenizer(text, return_tensorspt) # 预测 with torch.no_grad(): logits self.model(text_input, audio_feats, visual_feats) emotion label_encoder.inverse_transform([logits.argmax().item()]) return emotion[0] def run(self): # 启动音频录制线程 audio_thread Thread(targetsd.InputStream( callbackself.audio_callback, channels1, samplerate16000)) audio_thread.start() # 启动视频捕获线程 video_thread Thread(targetself.video_capture) video_thread.start() # 主线程处理用户输入和显示结果 while True: text input(请输入当前对话文本: ) emotion self.predict_emotion(text) print(f检测到情感: {emotion})6.2 模型性能优化技巧根据实践经验提升模型效果的关键点数据增强策略音频添加噪声、改变音高和速度视觉随机裁剪、颜色抖动文本同义词替换、随机插入/删除不平衡数据处理对不同情感类别使用加权损失函数过采样少数类别或欠采样多数类别超参数调优使用Optuna等工具进行贝叶斯优化重点关注学习率、batch size和dropout率# 示例加权损失函数 class_counts np.bincount(label_encoder.transform(labels)) class_weights 1. / class_counts class_weights torch.FloatTensor(class_weights).to(device) criterion nn.CrossEntropyLoss(weightclass_weights)7. 前沿技术展望多模态情感识别领域正在快速发展以下几个方向值得关注自监督学习利用大量未标注多模态数据预训练通用表示多模态大语言模型如GPT-4V、Flamingo等模型的适配应用跨文化情感建模解决不同文化背景下的情感表达差异轻量化部署在边缘设备上实现实时推理一个有趣的尝试是将多模态情感识别与对话系统结合class EmotionalChatbot: def __init__(self, merc_model, llm_model): self.merc merc_model self.llm llm_model self.emotion_history [] def respond(self, text, audio, visual): # 识别用户情感 emotion self.merc.predict(text, audio, visual) self.emotion_history.append(emotion) # 基于情感生成回复 prompt f用户情感状态: {emotion}\n最近情感历史: {self.emotion_history[-3:]}\n prompt f请生成一个适当的情感化回复回应用户的最后发言: {text} response self.llm.generate(prompt) return response

更多文章