Tinker: Diffusion's Gift to 3D--Multi-View Consistent Editing From Sparse Inputs without Per-Scene Optimization

We introduce Tinker, a versatile framework for high-fidelity 3D editing thatoperates in both one-shot and few-shot regimes without any per-scenefinetuning. Unlike prior techniques that demand extensive per-sceneoptimization to ensure multi-view consistency or to produce dozens ofconsistent edited input views, Tinker delivers robust, multi-view consistentedits from as few as one or two images. This capability stems from repurposingpretrained diffusion models, which unlocks their latent 3D awareness. To driveresearch in this space, we curate the first large-scale multi-view editingdataset and data pipeline, spanning diverse scenes and styles. Building on thisdataset, we develop our framework capable of generating multi-view consistentedited views without per-scene training, which consists of two novelcomponents: (1) Referring multi-view editor: Enables precise, reference-drivenedits that remain coherent across all viewpoints. (2) Any-view-to-videosynthesizer: Leverages spatial-temporal priors from video diffusion to performhigh-quality scene completion and novel-view generation even from sparseinputs. Through extensive experiments, Tinker significantly reduces the barrierto generalizable 3D content creation, achieving state-of-the-art performance onediting, novel-view synthesis, and rendering enhancement tasks. We believe thatTinker represents a key step towards truly scalable, zero-shot 3D editing.Project webpage: https://aim-uofa.github.io/Tinker