Previous methods address weakly supervised grounding by training a grounding system via learning to reconstruct language information contained in input queries from predicted proposals. Instead, we explore the consistency contained in both visual and language modalities, and leverage complementary external knowledge to facilitate weakly supervised grounding.