Extract text from images using various OCR modes
Video captioning/tracking
4M: Massively Multimodal Masked Modeling