Vision

19 posts in Vision

3.5x Faster Image Generation: DDiT Dynamically Resizes Patches in Diffusion Transformers

Vision
3.5x Faster Image Generation: DDiT Dynamically Resizes Patches in Diffusion Transformers

80.3 on ScreenSpotPro: GUI-Owl-1.5 Sets New Bar for Open-Source GUI Agents

AI Agents
80.3 on ScreenSpotPro: GUI-Owl-1.5 Sets New Bar for Open-Source GUI Agents

Unified Latents Hits 1.4 FID by Replacing Stable Diffusion's Ad Hoc VAE with a Diffusion Prior

Vision
Unified Latents Hits 1.4 FID by Replacing Stable Diffusion's Ad Hoc VAE with a Diffusion Prior

Baidu Introduces ERNIE 5.0: Trillion-Parameter Unified Multimodal MoE Rivals GPT-5

Vision
Baidu Introduces ERNIE 5.0: Trillion-Parameter Unified Multimodal MoE Rivals GPT-5

Prompt Fatigue Solved: Vibe AIGC Turns Users Into 'Commanders' of Multi-Agent Creative Workflows

AI Agents
Prompt Fatigue Solved: Vibe AIGC Turns Users Into 'Commanders' of Multi-Agent Creative Workflows

Google Introduces Agentic Vision: Gemini 3 Flash Now Zooms, Annotates, and Investigates Images

AI Agents
Google Introduces Agentic Vision: Gemini 3 Flash Now Zooms, Annotates, and Investigates Images

260% Better at Catching Moving Objects: DynamicVLA Solves Robot Latency Problem

AI Agents
260% Better at Catching Moving Objects: DynamicVLA Solves Robot Latency Problem

First Holistic OCR Model: OCRVerse Unifies Document Parsing and Code Generation

Vision
First Holistic OCR Model: OCRVerse Unifies Document Parsing and Code Generation

6x Fewer Tokens, Better OCR: DeepSeek's Visual Causal Flow Beats GPT-4o and Gemini

Vision
6x Fewer Tokens, Better OCR: DeepSeek's Visual Causal Flow Beats GPT-4o and Gemini

UPLiFT vs Cross-Attention Upsamplers: Linear Scaling Meets SOTA Quality

Vision
UPLiFT vs Cross-Attention Upsamplers: Linear Scaling Meets SOTA Quality

2x Faster VLA Inference with 70% Fewer Layers: Shallow-π Distillation for Edge Robotics

Infrastructure
2x Faster VLA Inference with 70% Fewer Layers: Shallow-π Distillation for Edge Robotics

90% Attention Sparsity with Zero Quality Loss: SALAD Speeds Up Video Diffusion 1.7x

Infrastructure
90% Attention Sparsity with Zero Quality Loss: SALAD Speeds Up Video Diffusion 1.7x

18x Faster Audiovisual Generation: Lightricks' Open-Source LTX-2 Rivals Veo 3

Vision
18x Faster Audiovisual Generation: Lightricks' Open-Source LTX-2 Rivals Veo 3

2.7x Better 3D Reconstruction from Messy Videos: Meta's ShapeR Tackles Real-World Capture

Vision
2.7x Better 3D Reconstruction from Messy Videos: Meta's ShapeR Tackles Real-World Capture

First Cross-Universe Character Mixing: MiMiX Puts Mr. Bean in Tom and Jerry

Vision
First Cross-Universe Character Mixing: MiMiX Puts Mr. Bean in Tom and Jerry

40% Faster Video from Single Images: Pixel-to-4D Predicts Dynamic 3D Gaussians in One Pass

Vision
40% Faster Video from Single Images: Pixel-to-4D Predicts Dynamic 3D Gaussians in One Pass

VLM Hallucinations Exposed: VIB-Probe Pinpoints and Suppresses Faulty Attention Heads

Safety
VLM Hallucinations Exposed: VIB-Probe Pinpoints and Suppresses Faulty Attention Heads

16x Faster On-Device Video Generation: Qualcomm's ReHyAt Distills Attention in 160 GPU Hours

Vision
16x Faster On-Device Video Generation: Qualcomm's ReHyAt Distills Attention in 160 GPU Hours

Training-Free Fix Boosts Vision-Language Models 3 Points by Correcting Attention Errors

Vision
Training-Free Fix Boosts Vision-Language Models 3 Points by Correcting Attention Errors