My goal in this blog post is to provide the beginner in computer vision with a framework for thinking about a future robot perception system. I intend to identify the features and functions such a system must have. Furthermore, I will list some questions we have to answer on our way to actualizing a robust perception system.
We must first define perception or seeing. In the context of visual perception, Seeing is the process of utilizing sensors and computation to construct a useful model of a scene by exploiting the pattern of interactions between light and the the scene. This model has been referred to by various names such as Spatial memory or Spatial AI. This survey paperconsiders the development of a representation or model an Open problem. The word "useful" here means that the model must help the robot "survive" by providing a few essential services. In addition, "survive" here means that a robot can satisfactorily engage with a dynamic environment by avoiding obstacles, interacting with humans and accomplishing whatever task or mission it chose or was assigned. Just as a cryptographic cipher should provide confidentiality in order to be valuable to its users, the Spatial model constructed by the perception process must provide the following navigational services:
Information Collection: Have I been here before? What are the objects in the environment. Some tasks under Information collection are:
- Object Recognition
- Object Tracking
- Place Recognition
- Data Association
- Loop Closure
Attention: What should I pay attention to? There is possibly too much information being gathered by the sensors. The robot model must perform some form of compression. The key task here is:
- Feature Extraction
Reasoning: Can I take that muddy road and get to my destination? My wheels may get stuck. The provision of a model that can serve as a foundation for reasoning may be the toughest service of all to provide. To fully provide this service, the robot will need to interact with the world and learn about other non-visual aspects of a scene. This is to aid future inference upon being presented with only visual information of an object. Some tasks under this service include:
- Physical Symbol Grounding
Planning: How do I get to my destination?
- Obstacle avoidance
A well built model that provides the above listed services can be said to answer the core questions of Navigation and SLAM which are:
- How do I get to where I am going?
- Where am I in the world?
- Where have I been and what are the things around me?
Next, I'll briefly comment on questions to ask if we are to implement each service.
- What visual quantities other than illuminance should we seek to capture from a scene?
- How are these visual quantities related to each other?
- Can we use these visual quantities and their inter-relationships to infer other characteristics of a scene such depth, motion, texture of objects?
- What characteristics of a scene can we acquire using visual sensors? In addition, can we measure these characteristics directly? The developments of Lidar, Light field cameras and Event based cameras are positive steps to answering this question.
We must take inspiration from Biology in developing perception systems. Biological visual systems will hint at strategies and effects we may wish to incorporate into robot perception systems. These effects include lightness constancy, size constancy, shape constancy, parallax and many more.
A lot of work has been done on feature extraction from camera images leading to techniques such as SIFT, SURF and ORB. Deep Learning has also emerged over the last decade as possible solution to feature extraction from a scene captured by cameras. Remember, our goal is the construction of Spatial model/representation of the scene. Hence the important question of how to represent these features remains. What data structures should we use to represent these features?
The reasoning service could quite possibly be the hardest service to correctly implement in a robot. This is because to reason about the world, the robot must solve the Physical Symbol Grounding problem. Symbol grounding is about assigning symbols to information abstracted from sensor data. The symbols represent persistent objects and concepts about the world. So then, what concepts should our model represent?
Philosophy and the Cognitive sciences may offer hints as to what concepts to represent. The notion of Image Schemas as described by Mark Johnson can be a starting point. Mark Johnson describes Image Schemas as structures for organizing recurrent patterns in our experience. Hence, a robot agent has to act in the world to acquire these schemas. For example, the Containment schema is one instance of an image schema. A robot would have to understand that when you put a cookie into a jar and close it, if you move the jar, the cookie inside moves with it.
What mathematical framework or data structure should be used to model these image schemas?
Planning addresses the question of how best to reach my destination given a goal and map. Path Planning may be the service that is close to being fully solved-at least for autonomous mobile robots. While good progress is being made on Motion Planning in the context of robot manipulators. Path planning is distinct from motion planning and this StackOverflow answer elucidates the differences.
I like to look for inspiration in other impressive systems which people have built. Two systems I think a robot perception system can learn from are Database Management systems(DBMS) and Cryptographic systems.
The architecture of database management systems typically comprises of a storage engine, Query Processor and Transport system. The implementation of a robot perception system may borrow ideas from DBMS in having a storage engine(to store abstracted sensor data) and a reasoning module for understanding and thinking about objects and concepts derived from model created by the storage engine.
Likewise, Cryptographic systems can provide a rough mental model for thinking about the navigational services. In cryptography, we have cryptographic primitives which provide cryptographic services. For example, a Hash function is a primitive that helps fulfill Data Integrity. Would it help to construct a perception system in terms of perception primitives that help meet one or more navigational services?
A lot of progress has been made in answering some questions. However, most questions remain either partially solved or completely unsolved. I hope this helps to clarify to the beginner how hard the vision problem is. The novice shouldn't be deterred! The field of robotic vision and perception is wide open. Fresh ideas, techniques and integrations are needed!
Murphy, R. R. (2019). Introduction to AI Robotics. The MIT Press
Johnson, M. (1987). The Body in the Mind: The Bodily Basis of Meaning, Imagination, and Reason. University of Chicago Press.
Petrov, A. (2019). Database Internals. A Deep Dive into How Distributed Data Systems Work. O'Reilly Media