Explaining the Science Behind Augmented Reality

Part 3: Neuroscience & AR Design – Vision First: From Features to Action


BY Stefano Baldassi


Natural machines are computers that get intimate with our bodies and our brains. This is the take home message from the first post of this series. Wearable augmented reality should aim to be a natural match for the brain, so it’s important for AR designers to understand how the brain works. We set this in five points in the second post of this series.


It’s now time to discover how the brain perceives, interprets space, and interacts with the environment. This will provide developers and designers with solid, common ground to understand the value of the NeuroInterface Principles and translate this science into actionable design.


Perception is the first stage of our active knowledge of the world. If we are planning on delivering natural interfaces, we need to know how people perceive.


In this context, we put together two concepts that overlap only partially: sensation and perception. By sensation, we mean the way that our senses have to transduce different forms of energy onto neural signal. For example, the retina is the sensation organ for vision and the cochlea is for our acoustic system. By perception, we mean the complex set of operations that our brain does to transform raw sensory input into meaningful information that we interpret and use in our active life. Recognizing a face or an object and planning an action on that object is an act of perception. Interacting with a digital user interface or attributing affordance (defined as when the appearance of the object is clearly associated to the right action, e.g. a bottle opener should unequivocally afford my action of opening a bottle) to a digital tool is also an act of perception.


We sense and perceive the world through different channels, but we are by far visual animals. In fact, about one third of the human cerebral cortex is involved in vision. In other words, the neurons in these areas will activate if stimulated with visual information. 


Functional regions of the cerebral cortex (Pearson Education Inc.)


Understanding how the brain is engineered to transform visual input into recognition and motor interactions may inspire developers and designers to more deeply embrace our principles of NeuroInterface Design and build AR applications that the brain is perfectly equipped for. Therefore, in this and the next blog post I will highlight some general properties of visual perception that should be clear to AR developers when expanding digital interfaces in the space around a user. I will focus on three low level properties today and move to higher levels of visual recognition, affordance, and visual actions in the next post of this series. The visual system is more complex than what I will highlight below. For a good primer on the neuroscientific research into human sensory and perception, I recommend this book written by top scientists from Harvard University, the University of California at Berkeley, and Purdue University.



The Visual System Performs Feature Segmentation

Like the deep neural networks that support a lot of the AI around us, such as Google enabling us to instantly search for places and people in our photos, the visual system classifies complex patterns in a feed forward manner. It first decomposes them in simpler and more decipherable smaller patterns that are put together in deeper layers. The retina, like the sensor of our camera, senses the world with ‘discrete (but specialized) pixels’ called photoreceptors – cones and rods – and transmits information to an incredibly deep and specialized set of neural layers that reconstruct the building blocks onto meaningful objects and eventually the visual scenery where we live and act.


Light moves through the eye and is absorbed by rods and cones (Stagnor).


Spatially confined channels in the visual cortex of the brain selectively encode the size of lines and borders, local and global motion pattern, color, stereo disparity, contrast, contour continuity, textural regularities, and so forth. Different parts of the visual field have different sensitivity to certain features (e.g., color, contrast, motion). When this rich set of features gets integrated and fits specific ‘cognitive models’, the visual content achieves recognition and affordance. AR interfaces are distributed in space like any other element of our physical world and an accurate understanding of how features are segmented in different parts of the visual field is not only suggested, but actually needed.


Illustration of the visual field from our visual cortex (Kean)



3D is in the Brain

If we consider the rays of light that hit our retinae at any given moment, there is no direct information about the three-dimensional structure of the space. We sense the visual world with a pair of 2D sensors: the left and right retinae in the back of our eyes. This flat information is mapped in the early modules of the visual cortex in a way that reflects the slightly different horizontal points of view that the two eyes have on the same object.


This lateral shift of the image, called "retinal disparity", is interpreted by "binocular" neurons in visual areas of the brain that have different levels of sensitivity to different disparities and form the base of our stereo vision. Objects closer to us will have larger disparities than farther objects, which project in very similar ways in the two eyes. However, stereo-vision is providing only partial information about the organization of the spatial environment around us, and informs about depth only within a short range. In order to build a full representation of each object in the environment, the brain requires that the disparity information is integrated with a rich set of image cues that are called pictorial cues to depth. These cues are called pictorial because they have been typically introduced by Renaissance artists to render a sense of space and depth and they are all monocular (i.e. they work with one eye only). Occlusion is the main cue and is the most powerful of all: closer objects hide farther objects from view.


Italian Renaissance painter Sandro Botticelli's use of pictorial cues in his Cestello Annunciation produced in 1489 (Wolfe, Kluender, and Levi)


Also, size matters: the relative size of two same objects tells us their relative distance, but we also interpret the distance of familiar objects by their apparent size. Parallax and perspective are two additional powerful cues. While they seem more abstract and high level than stereo, the brain gives them big credit as it comes to weigh these cues for building space around us. Prioritizing stereo only and excluding these cues may hurt the design of spatialized UIs!



The Visual System Pays Attention!

Have you ever done the Gorilla test? If not, you may want to do it by watching the video below:


There are studies suggesting that at any given moment, our retinae are hit by 108 bits of information. By making an analogy with artificial sensors, we can say that the visual system samples at a rate that may vary from 10Hz to just a little under 100Hz (across the system and range of stimuli). The output of this may go down to a single bit, i.e. a binary decision that sometimes must be taken in a fraction of a second (e.g., based on the visual information available, we may turn right or left, decide to touch button A or button B, etc.).


This sets an incredible compression problem to our visual and cognitive system that evolution solved by enabling us with mechanisms to select relevant information that neuroscientists generally call “attentional” mechanisms. Visual and sensory attention in general makes it possible to select only what is relevant with the task at hand and discarding what is irrelevant – which is the reason why many of you, as well as me the first time I did it, might have missed incredible parts of the scene in the Gorilla test.


Attention is a complex neural and cognitive operation, but in the visual system, selection is driven in two main ways, both of which are very insightful for AR developers. The first type of visual attention is called top-down attention. When we are involved in productive, creative, or even entertaining operations, we decide to select from the environment everything that will help us in our tasks, which is a targeted subset of the visual information available. In this case, the brain "listens" to our intention and motivations and enhances everything that it knows to be helpful.


Top-down attention at work: The volleyball player's eyes are focused on the ball coming her way, and is about to leap into action.


Top-down attention is ruled by the mechanisms of visual search, that is by the ability to keep track and find individual elements in cluttered scenes. The ability of the developer and AR designer is that of understanding how to drive user tasks efficiently by providing visually efficient object segmentation for the relevant elements of the UI (that is best done by knowing how the visual system performs feature segmentation). Another element that will guide (and be guided by) good visual design is object affordance, which I deal with in detail in the next blog post.


The second type of visual attention is called bottom-up attention. When you are involved in a conversation, or performing a task on your desk, and a camera flash shoots, the whole flow breaks and we will unavoidably and automatically orient to the source of the flash. Indeed, nature equipped us with mechanisms to quickly break the flow and react to sudden stimuli that may represent danger. Visual and acoustic notifications in our devices are pervasive examples of this type of involuntary attention in user interfaces.


The circuitry in the brain that implement these complementary types of information selections are integrated in a way so that the modules that support bottom-up attention act as a circuit breaker for the mechanisms of top-down attention. Within a UI, the balance between these two types of attention are under good control of the developer. Like in many cases Meta identified in the principles of NeuroInterface Design, knowing the science, especially the neuroscience, tremendously helps design!


Want to learn more about The Science Behind AR? Read the first and second parts in the series:

Part 1: The Natural Machine

In the first installment of the series, Meta Chief Neuroscientist Stefano Baldassi seeks to explain the science behind augmented reality (AR).



Part 2: How Neuroscience Informs AR Design

Continuing the discussion, Baldassi sets the stage for the relationship between neuroscience and AR design. He dives into the fundamentals of neuroscience in this post.


Subscribe to AR news & best practices