These unusual, unsettling photographs present that AI is getting smarter


Of all the AI ​​models in the world, OpenAI's GPT-3 has caught the public's imagination the most. It can spit out poems, short stories, and songs with little prompting, and has shown it to fool people that its results were written by a human. But his eloquence is more of a salon trick, not to be confused with real intelligence.

Still, the researchers believe the techniques used to create GPT-3 may hold the secret of more advanced AI. GPT-3 trained with an enormous amount of text data. What if the same methods were trained on both text and images?

New research from the Allen Institute for Artificial Intelligence, AI2, has taken this idea to the next level. The researchers have developed a new text and image model, also known as a visual language model, that can generate images with a label. The images look disturbing and freaky – nothing like the hyper-realistic deepfakes generated by GANs – but they could point a promising new direction for achieving more general intelligence, and possibly smarter robots as well.

Fill in the blank

GPT-3 is one of a group of models known as "Transformers" that first became popular with the success of Google's BERT. Before BERT, language models were pretty bad. They had enough predictive power to be useful for applications like autocomplete, but not enough to generate a long sentence that followed grammar rules and common sense.

BERT changed this by introducing a new technique called "masking". Different words are hidden in a sentence and the model is asked to fill in the gap. For example:

  • The woman went to the ___ to exercise.
  • They bought ___ bread to make sandwiches.

The idea is that when the model is forced to perform these exercises, often a million times, it will discover patterns in how words are put together into sentences and sentences into paragraphs. As a result, it can better generate and interpret text to come closer to understanding the meaning of language. (Google is now using BERT to provide more relevant search results on its search engine.) After masking proved extremely effective, researchers tried applying it to visual language models by hiding words in captions, as follows:

There is a ____ on the ground near a tree.


This time, the model was able to look at both the surrounding words and the content of the image to fill in the void. Then, through millions of repetitions, it could discover not only the patterns between the words, but the relationships between the words and the elements in each picture.

The result is models that can link text descriptions to visual references – just like babies can make connections between the words they learn and the things they see. For example, the models can look at the photo below and write a meaningful headline such as “Women who play field hockey”. Or they can answer questions like "What is the color of the ball?" by connecting the word "sphere" to the circular object in the picture.

Women play field hockeyA visual language model could meaningfully label this photo: "Women play field hockey".


A picture is worth a thousand words

However, the AI2 researchers wanted to know whether these models actually developed a conceptual understanding of the visual world. A child who has learned the word for an object can not only conjure up the word to identify the object, but can also draw it when the word prompts them to do so, even if the object itself is not there. Therefore, the researchers asked the models to do the same: generate images from captions. Instead, they all spit out nonsensical pixel patterns.

A confusing web of pixels.It's a bird! It's an airplane! No, it's just an AI-generated Gobbledy book.


It makes sense: converting text into images is far more difficult than the other way around. A caption doesn't specify everything that is in an image, says Ani Kembhavi, who heads the computer vision team at AI2. So a model has to rely on a lot of common sense in the world to fill in the details.

For example, if a “giraffe walking on a road” is to be drawn, it must be concluded that the road is more gray than pink and next to a field of grass than to the ocean – although none of this information is explicitly stated.

So Kembhavi and colleagues Jaemin Cho, Jiasen Lu, and Hannaneh Hajishirzi decided to see if they could teach a model all of this implicit visual knowledge by tweaking their approach to masking. Instead of just training the model to predict masked words in the captions from the corresponding photos, they also trained it to predict masked pixels in the photos based on their corresponding captions.

The final images produced by the model are not exactly realistic. But that is not the point. They contain the right high-level visual concepts – the AI ​​equivalent of a child drawing a stick figure to represent a human. (You can try the model yourself here.)

Various outputs generated by the AI2 model, all of which look shaky and freaky but still convey the high-level visual concepts of their respective captions.Examples of images generated by the AI2 model from the subtitles below.


The ability of visual language models to perform this type of imaging is an important advance in AI research. It suggests that the model is indeed capable of some level of abstraction, a fundamental ability to understand the world.

In the long term, this could have implications for robotics. The better a robot understands its visual environment and uses language to communicate about it, the more complex the tasks it can perform. In the short term, this type of visualization could also help researchers better understand what “black box” AI models are learning, says Hajishirzi.

In the future, the team plans to experiment more to improve the quality of the imaging and to add more topics, objects and adjectives to the visual and linguistic vocabulary of the model.

"The imaging was really a missing piece of the puzzle," says Lu. "By making this possible, we can make the model learn better representations to represent the world."


Steven Gregory