In visual object tracking tasks, scenarios such as deformation and scale variation are still challenging. In this work, we proposed a new tracking architecture with transformer and multiple masks as its key components. The transformer structure models the spatial and temporal connections among frames. The transformer encoder learns the target via attention mechanism, while the decoder utilizes the information of pervious frames to better track the current frame. To make sure that transformer pays attention to the exact target area, we propose multiple masks. Multiple masks suppress the background while leaving the target area unchanged. Multiple masks consist of spatial masks and temporal masks. Spatial masks focus on the current information while temporal masks make use of the historical information. Multiple masks further enhance the transformer, making it more focused on the target and more robust under extreme scenarios. With the transformer and multiple masks, our proposed tracker achieves the state-of-the-art level performance.