Traditionally, the direct method is prone to fall into local optimum because it estimates the camera pose depending on gradient search entirely. For this problem, the IMU (inertial measurement unit) data is tightly associated with camera tracking to provide accurate short-term motion constraints and initial value of gradient direction, and the visual pose tracking result is corrected to improve the tracking accuracy of the monocular visual odometry. Then, the sensor data fusion model is established based on camera and IMU measurements, and the sliding window is used to optimize the solution. During the marginalization process, the state variables which should be marginalized or added to the sliding window are selected according to the camera motion between the current frame and the previous keyframes. In this way, accurate enough prior states for optimization can be ensured to achieve better optimization and fusion performance. Experimental results show that the total orientation error of the proposed algorithm is about 3° and the total translation error is less than 0.4 m compared with the existing visual odometry algorithms.