The ability to accurately extract urban digital surface models (DSMs) at a reasonable cost is quite valuable. In this study, the ZiYuan-3 (ZY-3) satellite across-track and along-track modes were used to build stereo pairs and elevation models featuring different views. Image matching processing was used to generate optical point clouds for multiple stereo pairs, which were used to analyze the effects of point cloud fusion under different scenarios and optimize the ZY-3 multiview point cloud integration. Then, with the assistance of a high-precision airborne light detection and ranging point cloud, 12 indices, including the maximum, minimum, mean, and B10 to B90, were selected from the building roof optical point cloud in order to construct an urban building elevation model. The results were as follows: (1) The across-track optical point cloud formed by stereo pairs of ZY-3 nadir images is more suitable for urban DSM extraction due to the high spatial resolution and small base/height ratio of these images. (2) The B70 index performs best in the estimated urban building elevation model [R2 = 0.91, root mean square error (RMSE) = 5.59 m]; this model accurately identified 95.39% and 95.25% of the whole-building elevations in a narrow-scale test of 3170 buildings and a broad-scale test of 1,510,606 buildings, respectively. Moreover, the model R2 and number of accurate building elevation recognitions gradually increased with increasing numbers of fused point clouds, while the RMSE decreased significantly. (3) A combined ZY-3 multiview point cloud fusion and multispectral data analysis technique delivered a more refined urban DSM and better characterization of building elevation under complex underlying surface scenarios than were obtained using traditional stereo pair technology. This research presents the methods and theory relating to the application of ∼2- to 6-m resolution satellite images for complex urban underlying surface scenarios.