Returns every text fragment and BLIP-generated image caption in JSON. No summarisation – perfect for downstream quiz agents.