3 篇博文含有标签「VLM」

Falcon-Perception

2026年5月6日 · 阅读需 1 分钟

GoCoding

Falcon Perception: a natively multimodal, dense, autoregressive Transformer model that performs object detection, instance segmentation, or OCR from natural language queries.

Falcon Perception: 一种原生多模态、稠密自回归的 Transformer 模型，能够根据自然语言查询执行目标检测、实例分割或 OCR 任务。

https://github.com/tiiuae/Falcon-Perception

Sapiens2

2026年5月5日 · 阅读需 1 分钟

GoCoding

Sapiens2: 1K resolution vision transformers pretrained on 1B human images.

for human-centric tasks: pose estimation, body-part segmentation, surface normals, and pointmaps.

Sapiens2: Meta AI 提出的人体中心的视觉基础模型。

10亿张人体图像预训练，参数量 0.1B~5B，原生支持 1K 分辨率（4K 变体支持超高清）
任务：姿态估计、人体部位分割、表面法线、点图

https://github.com/facebookresearch/sapiens2

Rex-Omni is a 3B-parameter multimodal model that unifies visual perception tasks, including object detection, OCR, pointing, keypointing, and visual prompting into a single next point prediction framework.

Rex-Omni 是一个 3B 参数多模态模型，它将视觉感知任务（包括物体检测、OCR、指向、关键点定位和视觉提示）统一到一个单一的下一点预测框架中。

主页: https://rex-omni.github.io/
代码: https://github.com/IDEA-Research/Rex-Omni