"Embodimentalist AI was wrong and has misdirected half a generation. Its most advanced incarnation is arguably the Roomba." [Similar to what I think. Read the rest of that thread.]

This is somewhat related to a previous comment of mine about how "all you need is text". And several months ago I wrote a long defense of this argument. One component of that defense was this:



I had posted it before, but somehow forgot what they actually showed.

The abstract:

We introduce language-driven image generation, the task of generating an image visualizing the semantic contents of a word embedding, e.g., given the word embedding of grasshopper, we generate a natural image of a grasshopper. We implement a simple method based on two mapping functions. The first takes as input a word embedding (as produced, e.g., by the word2vec toolkit) and maps it onto a high-level visual space (e.g., the space defined by one of the top layers of a Convolutional Neural Network). The second function maps this abstract visual representation to pixel space, in order to generate the target image. Several user studies suggest that the current system produces images that capture general visual properties of the concepts encoded in the word embedding, such as color or typical environment, and are sufficient to discriminate between general categories of objects.

In other words, using word co-occurrence statistics, they can produce an image of the object!

How good are they? Well, they're not great; but you can at least make out some visual aspects of the object -- see the table on page 2. I'm sure that using much better image-generation methods they could sharpen these up considerably.

And I should point out that this requires a little bit of cheating: you need a small number of examples of mappings from word vector to image vector, in order to set parameters in the mapping function. But that "small number" is really, really small compared to the number of objects you can map, once the function is set.

Furthermore, these examples are only necessary because you force the computer to cough up visual output. The computer must have some way to map its text-derived representations to pixels. But as long as communication is in a text modality, it isn't necessary.

/r/thisisthewayitwillbe Thread Link - mobile.twitter.com