What does a Tesla on the streets of San-Francisco, a Rumba in your living-room, Stryker’s Mako performing surgery, and Boston Dynamic’s Spot going up the stairs have in common?
They are all robots that, like humans and animals, interact and affect the world around them. The robotic stack behaves very much like how any living organism behaves: 1) it senses the world through cameras or microphones, 2) thinks the sensory input to understand the world 3) acts (takes actions) to interact and affect its surroundings. A robot typically loops over 1-3 until it finishes what it has set out to do- whether it is to drive to Starbucks, clean the living room, or perform Orthopedic surgery.
Perception typically deals with sensing and analysis of the environment that is relevant for downstream tasks. It is responsible for an accurate, robust, and efficient representation of the world that is useful for planning and controls tasks for downstream applications such as surgery, driving etc.
While the contemporary view across industry is that modern perception is propelled by large datasets, due to the algorithms’ being learning-based methods, this is an incomplete view and one that is retardant to the innovation of next generation of perceptual systems. While large datasets serve as one of the key ingredients to innovation and the ubiquity of robots in the future, an over-reliance on large datasets leads to diminishing returns. The next generation robots will have to perceive around obstacles, see deeper inside our brains and bodies, and take the right decision in ever more safety-critical and uncertain environments.
While the contemporary view across industry is that modern perception is propelled by large datasets due to its reliance on learning-based methods, this is an incomplete view.
We define 2 types of functions: Compressor and De-Compressor. A Compressor reduces the dimensionality from one space to other, and a De-Compressor increases the dimensionality of the space. For example, a sensor which we can abstract away as the Sensing Function is a Compressor because it samples the world through a certain lens (excuse the pun). An example could be that you’re in a loud Rolling Stones Concert with a film camera- the Sensing Function only samples the concert using a 35mm film with no audio or video. The Sensing Reality is a compressed version of the space of all Reality.
Current Perception Paradigm
The current Perception Paradigm works on sampling the Sensing Reality to create large datasets which are used to design learned and rule-based algorithms. The first reason why this paradigm is limiting is because algorithms are fundamentally limited by what exists in the sensing reality. While efficient sampling can be overcome with a data engine-esque setup, it cannot sample what is already discarded by the Sensing Function. An extreme example of this would be designing an algorithm that predicts the song that was playing at the Rolling Stones Concert- you could do it using cues like who is singing or the background, but wouldn’t it be easier to just use a microphone in the first place?
The second reason why this paradigm is limiting is more subtle- lack of use of simulation as a playground for exploration not just data augmentation. It is an impossible task to capture every possible scenario that could occur in the real world enough times. To account for this, the current perception paradigm is focused on augmenting large datasets by simulating rare and under-sampled scenarios. The main limitation here is that simulation only works as a tool to augment large datasets, not a playground for exploration. Going back to the concert example, I can augment my film camera dataset by simulating examples of people singing on stage, but what if I could simulate the same scenario but capture it differently- this time with a sequence of photographs+microphone? Lastly, as imaging systems move from rule-based towards learning-based, the paradigm will run into problems of interpretability and validation.
Current perceptual systems are limited by what exists in the sensing reality, lack of use of simulation as a playground, and a lack of attention to interpretability.
Ingredients to Modern Perception
There are four major ingredients to modern perception: Sensing, Simulation, Algorithm, and Interpretability. Each of these ingredients acts as a Compressor or a de-Compressor from one reality to the next with the goal to create an efficient, accurate and robust representation of the world with the goal is to represent the world in an efficient and accurate way for downstream applications. While the presence of every ingredient is necessary for the next generation of Perceptual Systems, the interplay between them is also important as each ingredient complements and contrasts the other.
Sensing deals with what can be captured and is the first compression of reality that happens. The Sensing Function is a compressor and dictates what can be physically captured and ignored. What the robot senses greatly limits its capabilities because it can only process information that is captured. For example, if your sensor is the RGB sensor, then the sensing reality is composed of all possible images the sensor can generate. This reality, or manifold, is then sampled in the real world to create large datasets. An example would be trying to see through the brain using a regular RGB camera vs. using an MRI machine- the latter sees through the skull while the former doesn’t.
Simulation servers as a de-compressor of reality and, in this case, aims to simulate and capture reality in novel ways that is useful to our system, in addition to its previous role in augmenting real-world datasets. We can use simulation to understand how the capture of reality affects the algorithm- for example understanding how sensor placement can affect learning. The simulation reality may not be accurate but is more vast than sensing- for example a floating car doesn’t exist in the real-world but can be simulated. Modern Perceptual systems must employ simulation to understand what and how to capture to solve the task.
Algorithms are another way to compress reality. Algorithms' serves as the primary way to convert the irrelevant information in the large datasets into a small, useful, and actionable space. These are the algorithms that researchers have developed over the years in optics, imaging and vision, machine learning etc. The focus of academia is often on this ingredient- the first neural network driven self-driving car was proposed with ALVINN in 1988 using 32x32 images. Today, we can do much more with the same images.
Interpretability is an important ingredient and is the only way to understand the reasoning behind algorithm’s outputs. The interpreter function decompresses the algorithm’s weights, actions, and outputs into a reality that can be verified and interpreted by humans. The reason why this is different than verification is that formal verification for learned algorithms is currently not possible, therefore we must rely on empirical evidence and validation of the algorithm. Since we can’t validate our algorithm on every single scenario that is possible, our insight into the algorithm is limited by this space.
There are four major ingredients to modern perception: Sensing, Experiences, Algorithm, and Verifiability. Each of these ingredients acts as a Compressor or a de-Compressor from one reality to the next with the goal to create an efficient, accurate and robust representation of the world for the desired task.
To build and deploy the next generation of perceptual systems that can see around obstacles, image the brain and are small enough to fit in your pocket at the same time, each of these ingredients must be considered. Companies operating in this domain must revisit the role each ingredient plays in their stack as these ingredients are very much co-dependent and even complimentary. For example, a better sensor will enable less overhead on the algorithms as the sensing reality will be richer. Moreover, a focus on the simulation ingredient will enable a diverse solutions that can easily adapt to changing budget and resource constraints.
Companies operating in this domain must revisit the role each ingredient plays in their stack as these ingredients are very much co-dependent and even complimentary. For example, a better sensor will enable less overhead on the algorithms and large datasets as the sensing reality will be richer.
— —
Questions, Comments, Feedback? Drop an email at ktiwary@mit.edu
Super interesting! A great read!