In this experiment, a researcher needed to add up some numeric values scattered across twelve different emails. He made a screen recording of himself scrolling through the emails. He then got Google Gemini to extract the numbers from his screen recording into a CSV file for use in a spreadsheet.
While this is a simple example, the implications of the ability to video-scrape screencasts are significant. It means anything you can display on your screen (websites, apps, e-learning, etc.), and anything that can be captured as video from a phone or camera (books on a bookshelf, panoramic displays), has the potential to become usable input for AI.
Although several major models, Including those from OpenAI and Anthropic, have research previews that demonstrate the ability to accept video as input, only Google Gemini has released this feature. This is probably because the computation costs of processing video are so high. However, computation costs will inevitably fall, so expect video as input to be widely available in the near future.