Upload an image to generate a caption, extract text, create audio from context, and determine the context using GPT-2 and Florence-2-base.