Less is More: Generating Grounded Navigation Instructions from Landmarks
Citations Over TimeTop 10% of 2022 papers
Abstract
We study the automatic generation of navigation instructions from 360° images captured on indoor routes. Existing generators suffer from poor visual grounding, causing them to rely on language priors and hallucinate objects. Our Marky-mt5 system addresses this by focusing on visual landmarks; it comprises a first stage landmark detector and a second stage generator-a multimodal, multilingual, multi-task encoder-decoder. To train it, we bootstrap grounded landmark annotations on top of the Room-across-Room (RxR) dataset. Using text parsers, weak supervision from RxR's pose traces, and a multilingual image-text encoder trained on 1.8b images, we identify 971k English, Hindi and Telugu landmark descriptions and ground them to specific regions in panoramas. On Room-to-Room, human wayfind-ers obtain success rates (SR) of 71% following Marky-mt5's instructions, just shy of their 75% SR following human instructions-and well above SRs with other genera-tors. Evaluations on RxR's longer, diverse paths obtain 61-64% SRs on three languages. Generating such high-quality navigation instructions in novel environments is a step to-wards conversational navigation tools and could facilitate larger-scale training of instruction-following agents.
Related Papers
- → Development of Landmark Knowledge at Decision Points(2013)13 cited
- → Learning to Hallucinate Examples from Extrinsic and Intrinsic Supervision(2021)5 cited
- → Hallucinations in Children: A Follow‐up Study(1988)39 cited
- → Distances and directions are computed separately by honeybees in landmark-based search(1998)15 cited
- The hallucinating patient and nursing intervention.(1976)