Hi Ed,
I don't understand what you mean by "pre-trained". The agi-bio guys have dumped assorted genetic and protein databases into the atomspace, thus enabling search and crawling and graphical exploration. These are "world models" where the world is biochemical interactions.
I have dozens of datasets of natural languages sliced and diced every-which way. They are "world models" where the "world" consists of analyzed, digested English text. I'm trying to ratchet my way up in the sophistication of these models.
The model that I use for natural language, I think it can be applied to sound and vision, or "perception" in general. It is described in the attached paper. As far as I know, no one has created atomspaces that contain data extracted from image, video or sound sources.
--linas