Search for a command to run...
Can Pre-trained Vision and Language Models Answer Visual Information-Seeking Questions?