Abstract: When we look around and perform complex tasks, how we see and selectively process what we see is crucial. How-ever, the lack of this visual search mechanism in current multimodal LLMs (MLLMs ...