ObjectiveClothed human generation, which aims to recover the 3D geometry and texture of the human body from input data to generate accurate 3D human models, is a challenging problem in the fields of computer vision and computer graphics. The need for high-quality generations has become increasingly critical with the growing demand for realistic 3D human models in applications such as virtual reality and augmented reality. Traditional multiview generation methods, which are often expensive and impractical for everyday use, typically require specialized equipment to capture images from multiple viewpoints. By contrast, obtaining single-view images from the web is much easier than obtaining multiview images. Thus, single-view generation methods become more cost-effective than multiview generation methods, and the model creation process becomes simple. Given these advantages, we consider using a single view as input to recover the 3D model of a clothed human. However, single-view images lack comprehensive spatial information and structural details of occluded regions. Thus, recovering a complete 3D shape becomes difficult. As a result, existing methods based on implicit functions struggle to learn rear-view information effectively, thereby leading to overly smooth and unrealistic back regions in the generated 3D human model. Methods combining diffusion models show some potential in enhancing texture detail performance. However, most of these methods lack view consistency constraints, thereby making the full recovery of the local texture details of the human body difficult. Additionally, the absence of precise geometric constraints during the diffusion process causes discrepancies between the generated models and the true geometry, particularly when handling complex 3D structures. Existing methods typically assume a uniform point distribution across spatial regions by ignoring variations in the distribution of query points caused by differences in distance from the human body surface. This assumption makes adapting to the geometric complexity differences across various regions of the body difficult for these methods. As a result, these methods face limitations when generating the surfaces of loose clothing, which have complex and variable geometries. This study addresses these challenges by combining three mechanisms: pose diffusion priors generation, multiview consistency constraints, and adaptive geometry generation. This approach not only preserves the generative capabilities of the diffusion model but also introduces geometric constraints to ensure the accuracy of the generation. Furthermore, this method can generate high-quality 3D human models by incorporating the probability distribution of human body structure. This study proposes a generation method that integrates pose diffusion priors with multiview consistency.MethodThis study constructs a method for single-view clothed human generation. First, a human pose estimation algorithm is used to extract 25 key points, which are encoded into Gaussian heatmaps to achieve spatial continuity modeling. This approach enables the model to understand the spatial relationships around the key points. The Gaussian heatmaps, combined with the human mask and UV mapping, are used to construct a pose feature vector. This feature vector guides the denoising process of the latent diffusion model and generates 2D diffusion images for unseen viewpoints through an adaptive cross-attention mechanism. Second, after the normal information of the (skinned multi-person linear model expressive, SMPLX) human template estimated from the input image and the 2D diffusion image are fused, they are input into the cross-view normal consistency network, where the multiview consistency mechanism extracts the corresponding 3D spatial features for each viewpoint. Finally, the voxelized features of the SMPLX human template and the 3D spatial features are fused and input into the distribution prediction network for spatial occupancy probability estimation. The model can express geometric uncertainty at different spatial locations and sample from the learned probability distribution by learning the distribution parameters of each point. Then, the 3D features, voxelized features, and sampling results are input into the occupancy prediction network to achieve 3D clothed human generation. Our entire model is trained on the THuman2.0 (Tsinghua human 2.0 dataset) dataset, with 490 images being used for training and 21 images being used for testing. We tested the model on the CAPE (clothed auto-person encoding) dataset to evaluate the generalization ability of the model further. This dataset is divided into two subsets: CAPE fitted poses (CAPE-FP), which contains 75 images used to assess the geometric generation accuracy of the method under simple poses, and CAPE nonfitted poses (CAPE-NFP), which contains 75 images and focuses on evaluating the method’s adaptability to complex poses. The experiments are conducted on an NVIDIA GeForce RTX 3090 GPU, with a learning rate being set to 1 × 10⁻
4 and a batch size of 2.ResultWe conducted experiments on the THuman2.0 and CAPE datasets and compared the single-view clothed human generation results with the results of six other methods. Chamfer distance (CD) is used to evaluate the overall geometric similarity of the 3D human body, and point-to-surface distance (P2S) is used to assess the geometric accuracy of the reconstructed surface. Both metrics perform well when their values are small. On the THuman2.0 dataset, the CD and P2S metrics of the single-view clothed human generation method were reduced by 6.27% and 5.74%, respectively, compared with those of the best-performing method. On the CAPE-FP and CAPE-NFP subsets, the CD and P2S of the single-view clothed human generation method performed better than those of the other comparison methods. On the entire CAPE dataset, the CD metric of the single-view clothed human generation method decreased by an average of 8.67%, and the P2S metric decreased by an average of 2.38%. Quantitative experiments show that our method has good generalization ability for unseen data and can effectively handle human generation tasks in complex poses. Inference efficiency comparison results show that the computational complexity of our method is lower than that of similar diffusion model methods. Experimental results indicate that combining pose diffusion priors and multiview consistency helps recover the texture details of the 3D human body, and adaptive geometry generation enables accurate recovery of complex clothing topologies.ConclusionThe single-view 3D clothed human generation method proposed in this paper, which combines pose diffusion priors and multiview consistency, effectively recovers the local details of the clothed human and accurately generates 3D human models with complex topological structures, such as rich wrinkle details and loose clothing.… …
相似文献