Neural Networks’ Unique Perceptions: Decoding Machine vs. Human Sensory Recognition




A recent study explores the mysterious world of deep neural networks and finds that although these models may recognize items similar to human sensory systems, their methods of recognition differ from human perception. The networks typically generated distorted or unintelligible sights and sounds when asked to generate stimuli similar to a particular input.

This suggests that, in contrast to human perception patterns, neural networks have their own unique "invariances." The study provides information on assessing models that replicate human sensory experiences.

Main Details:

Deep neural networks frequently generate sounds or pictures that have little relation to the goal when they are given similar stimuli.

The models' perception of stimuli differs from that of humans, since they seem to acquire specific invariances from human perceptual systems.

Although the stimuli produced by the models are not exactly the same as the original inputs, using adversarial training can help them become more human-recognizable.

Refer to MIT

When an object is upside down or a word is pronounced by a voice we have never heard before, our sensory systems are nevertheless able to identify it rather well.

Deep neural networks are computational models that may be trained to do similar tasks, such as accurately detecting a phrase or a picture of a dog independent of the speaker's voice tone or fur color. But according to a recent research by neuroscientists at MIT, these models frequently react in the same manner to phrases or images that have nothing in common with the target.

Most of these neural networks produced sounds or images that were unidentifiable to human observers when they were asked to produce a phrase or image that they reacted to in the same way as a particular natural input, such a picture of a bear. This implies that these models have their own peculiar "invariances," which are the identical responses to stimuli with very disparate properties.

A member of MIT's McGovern Institute for Brain Research and Center for Brains, Minds, and Machines, Josh McDermott is an associate professor of brain and cognitive sciences. He says the findings provide a new method for assessing how well these models replicate the structure of human sensory perception.

According to McDermott, the study's principal author, "this paper shows that you can use these models to derive unnatural signals that end up being very diagnostic of the representations in the model." "This test ought to be included in the battery of tests that our field uses to assess models.”

The primary author of the open-access work that was published in Nature Neuroscience today is Jenelle Feather, PhD '22, a research fellow at the Flatiron Institute Center for Computational Neuroscience. The paper's other authors are MIT graduate student Guillaume Leclerc and computer scientist Aleksander Mądry, the Cadence Design Systems Professor of Computing at MIT.

Various perspectives

Deep neural networks, which can examine millions of inputs (sounds or images) and identify common traits to categorize a target phrase or object with about the same accuracy as humans, have been taught by researchers in recent years. These models are thought to be the most accurate representations of biological sensory systems at the moment.

This type of categorization is thought to teach the human sensory system to ignore characteristics that don't directly relate to an object's fundamental identity, including the amount of light it receives or the angle at which it is observed. This is referred to as invariance, which states that despite variances in those less significant aspects, things are still seen as being the same.

"Traditionally, the way we have conceptualized sensory systems has been that they accumulate invariances to all those sources of variation that various instances of the same object may possess," adds Feather. "An organism needs to understand that, despite appearing as very different sensory signals, they are the same thing."

The researchers hypothesized that comparable invariances may emerge in deep neural networks trained to execute classification tasks. They created stimuli within these models that elicit the same type of reaction as an example stimulus that the researchers had provided to the model in an attempt to address that topic.

They refer to these stimuli as "model metamers," bringing back to life a notion from classical perception research: one might diagnose a system's invariances by utilizing stimuli that are indistinguishable from it. Metamers were first introduced in the field of human perception research to explain colors that appear to be the same but are really composed of distinct light wavelengths.

To their astonishment, the researchers discovered that the majority of the visuals and audio generated in this manner had nothing in common with the samples provided to the models. The noises were like incomprehensible noise, and the majority of the visuals were just a tangle of randomly shaped pixels. The majority of the time, when researchers displayed the photos to human observers, the observers did not categorize the images created by the models in the same way as the original target example.

"Humans can hardly recognize them at all. They lack distinguishable characteristics that a human may utilize to categorize an item or term, and they don't sound or appear natural, according to Feather.

The results imply that the models have, in some way, evolved unique invariances distinct from those of the human perceptual system. Because of this, even when a pair of stimuli varies greatly from one another, the models mistakenly believe they are the same.

The similar impact was discovered by the researchers in other auditory and visual models. All these models, however, seemed to evolve their own distinct invariances. The metamers were as unidentifiable to the second model as they were to human viewers when they were displayed from one model to another.

According to McDermott, "the important conclusion from that is that these models appear to have what we refer to as idiosyncratic invariances." "They are specific to the model; other models do not share these invariances, as they have learned to be invariant to these specific dimensions in the stimulus space."

Additionally, the researchers discovered that by employing a technique known as adversarial training, they might make a model's metamers more recognisable to humans. This method was first created to overcome another drawback of object recognition models: the ability of the model to misidentify a picture by making minute, nearly undetectable adjustments to it.

Though they were still not as recognized as the original stimulus, the researchers discovered that models whose metamers were more recognizable to humans were produced via adversarial training, which includes integrating some of these slightly changed pictures in the training data. The researchers note that this enhancement seems to be unrelated to how training affects the models' resistance to adversarial attacks.

Feather states, "This specific type of training has a big effect, but we don't really know why it has that effect." "There's room for more research in that area."

The researchers suggest that one helpful way to assess how well a computational model replicates the fundamental structure of human sensory perception systems is to examine the metamers generated by the models.

According to Feather, "you can use this behavioral test on a particular model to see if the invariances are shared between the model and human observers." "It may also be utilized to assess the degree of idiosyncrasy among the invariances in a particular model, potentially revealing avenues for future model improvement."

Divergent invariances between artificial and biological brain networks are shown by model metamers.

It is frequently suggested to use deep neural network models of sensory systems to learn representational transformations with invariances similar to those seen in the brain. We created "model metamers," or stimuli whose activations inside a model stage are matched to those of a genuine stimulus, in order to expose these invariances.

When created from late model phases, metamers for the most advanced supervised and unsupervised neural network models of vision and audition were frequently utterly incomprehensible to humans, indicating disparities between model and human invariances. Although targeted model modifications increased the model metamers' human recognizability, the overall human–model difference remained.

There may be idiosyncratic invariances in models in addition to those needed for the job, as evidenced by the strong correlation between a model's metamers' human recognizability and their recognizability by other models.

Metamer recognizability separated from adversarial vulnerability and conventional brain-based benchmarks, exposing a unique mode of failure of current sensory models and offering an additional benchmark for model evaluation.