MediaPipe实战:从零构建人体姿态与手势识别应用

张开发
2026/4/13 20:58:55 15 分钟阅读

分享文章

MediaPipe实战:从零构建人体姿态与手势识别应用
1. 环境准备与安装指南第一次接触MediaPipe时最头疼的就是环境配置。记得去年给团队做内部分享时有同事在Windows上折腾了半天都没跑通最后发现是Python版本不兼容。下面分享几个我验证过的安装方案帮你避开这些坑。Windows平台安装就像搭积木一样简单pip install mediapipe这个命令会自动安装CPU版本适合大多数入门场景。但如果你有NVIDIA显卡可以试试下面这个加速方案pip install mediapipe-gpu我实测RTX 3060显卡上手势识别延迟能从15ms降到6ms左右。不过要注意CUDA版本匹配问题推荐CUDA 11.2cuDNN 8.1的组合。Jetson设备安装稍微复杂些。上周刚在Jetson Xavier NX上部署过需要先安装特定版本的whl包wget https://nvidia.box.com/shared/static/6t9q1k8ddqk0agkzm6g0.zip unzip 6t9q1k8ddqk0agkzm6g0.zip pip install mediapipe-0.8.6-cp36-cp36m-linux_aarch64.whl这里有个小技巧安装前先执行sudo apt install libopencv-dev python3-opencv能避免90%的依赖错误。常见问题排查报错ImportError: libcblas.so.3试试sudo apt install libatlas-base-dev摄像头无法识别检查用户组权限sudo usermod -a -G video $USER内存不足在Jetson上建议添加swap空间2. 人体姿态检测实战2.1 基础API调用先来看个最简单的静态图像检测案例import cv2 import mediapipe as mp mp_pose mp.solutions.pose pose mp_pose.Pose(min_detection_confidence0.7) img cv2.imread(yoga.jpg) # 关键的三步曲 results pose.process(cv2.cvtColor(img, cv2.COLOR_BGR2RGB)) mp.solutions.drawing_utils.draw_landmarks( img, results.pose_landmarks, mp_pose.POSE_CONNECTIONS) cv2.imwrite(output.jpg, img)这段代码我称之为三点式调用法初始化模型→处理图像→绘制结果。参数min_detection_confidence建议设置在0.6-0.8之间太低会有误检太高可能漏检。2.2 关键点数据解析MediaPipe返回的33个关键点就像人体骨架的GPS坐标landmark results.pose_landmarks.landmark right_shoulder landmark[mp_pose.PoseLandmark.RIGHT_SHOULDER] print(f右肩坐标: (x:{right_shoulder.x:.3f}, y:{right_shoulder.y:.3f}))坐标值是归一化的要转实际像素可以这样height, width img.shape[:2] pixel_x int(right_shoulder.x * width) pixel_y int(right_shoulder.y * height)关键点对应关系有个记忆技巧0-10: 面部区域鼻子、眼睛等11-16: 上肢右侧肩→手17-22: 上肢左侧23-28: 下肢右侧29-32: 下肢左侧2.3 实时视频处理结合OpenCV实现实时检测cap cv2.VideoCapture(0) while cap.isOpened(): success, frame cap.read() if not success: continue frame cv2.flip(frame, 1) # 镜像翻转 results pose.process(cv2.cvtColor(frame, cv2.COLOR_BGR2RGB)) if results.pose_landmarks: # 绘制半透明骨架 annotated_image frame.copy() mp.solutions.drawing_utils.draw_landmarks( annotated_image, results.pose_landmarks, mp_pose.POSE_CONNECTIONS) cv2.addWeighted(annotated_image, 0.7, frame, 0.3, 0, frame) cv2.imshow(Pose Detection, frame) if cv2.waitKey(5) 0xFF 27: break这个实现里我加了两个优化使用cv2.flip让画面更符合镜面直觉通过addWeighted实现半透明效果减少画面遮挡3. 手势识别开发详解3.1 手部关键点检测手势识别的基础代码结构类似姿态检测mp_hands mp.solutions.hands hands mp_hands.Hands(max_num_hands2) results hands.process(cv2.cvtColor(img, cv2.COLOR_BGR2RGB)) if results.multi_hand_landmarks: for hand_landmarks in results.multi_hand_landmarks: mp.solutions.drawing_utils.draw_landmarks( img, hand_landmarks, mp_hands.HAND_CONNECTIONS)这里max_num_hands参数控制最大检测手数实测单卡1080Ti能稳定处理4只手同时检测。3.2 关键点业务逻辑判断拇指是否展开的典型逻辑thumb_tip hand_landmarks.landmark[mp_hands.HandLandmark.THUMB_TIP] thumb_ip hand_landmarks.landmark[mp_hands.HandLandmark.THUMB_IP] if thumb_tip.y thumb_ip.y: # y轴越小位置越高 print(拇指展开)更复杂的剪刀石头布识别def detect_gesture(landmarks): tips [8,12,16,20] # 指尖关键点索引 extended sum(landmarks[i].y landmarks[i-2].y for i in tips) if extended 0: return rock elif extended 2: if landmarks[8].x landmarks[12].x: return scissors elif extended 4: return paper return unknown3.3 性能优化技巧在树莓派这类资源受限设备上可以启用模型裁剪hands mp_hands.Hands( static_image_modeFalse, model_complexity0, # 0-2数值越小越轻量 min_detection_confidence0.5)实测在Raspberry Pi 4上model_complexity1时帧率约8FPS降为0后能提升到15FPS另一个妙招是降低输入分辨率frame cv2.resize(frame, (320, 240)) # 处理小尺寸图像 results hands.process(frame) frame cv2.resize(frame, (640, 480)) # 输出还原4. 工程化应用案例4.1 体感游戏控制器用姿态检测控制飞机大战游戏def get_control_signal(landmarks): left_shoulder landmarks[mp_pose.PoseLandmark.LEFT_SHOULDER] right_shoulder landmarks[mp_pose.PoseLandmark.RIGHT_SHOULDER] tilt (left_shoulder.y - right_shoulder.y) * 2 # 倾斜度 if abs(tilt) 0.1: return left if tilt 0 else right return neutral4.2 智能健身教练深蹲动作检测算法knees [landmarks[mp_pose.PoseLandmark.LEFT_KNEE], landmarks[mp_pose.PoseLandmark.RIGHT_KNEE]] hips [landmarks[mp_pose.PoseLandmark.LEFT_HIP], landmarks[mp_pose.PoseLandmark.RIGHT_HIP]] squat_depth sum(knee.y - hip.y for knee,hip in zip(knees,hips))/2 if squat_depth 0.15: print(下蹲到位)4.3 手语翻译系统结合LSTM实现连续手语识别# 收集连续帧的手势特征 sequence [] for frame in video_stream: hands process_frame(frame) if hands: keypoints extract_normalized_points(hands) sequence.append(keypoints) if len(sequence) 30: # 30帧窗口 prediction model.predict(np.array(sequence)) sequence.pop(0)最后提醒几个工程化要点在多人场景下记得用results.pose_world_landmarks获取世界坐标系数据对于抖动问题可以加入卡尔曼滤波平滑轨迹考虑使用多线程分离图像采集和模型推理

更多文章