2023 Author: Bryan Walter | [email protected]. Last modified: 2023-05-21 22:24
American engineers have created headphones with facial expression recognition. Each earpiece has a camera that captures the side of the face. By combining these frames, the machine learning algorithm reconstructs the face model with high accuracy, and can also recognize some words without sound. An article on development will be presented at the UIST 2020 conference.
Facial recognition is used not only for research, but also in everyday user tasks. For example, on iOS, you can use Animoji avatars that look like cartoon characters and accurately mimic the user's facial expressions. And NVIDIA recently offered to transfer not a video stream during video calls, but only a map of key points of the face, in order to then animate a photo of the interlocutor with its help.
Modern algorithms can very accurately create a map of key points in real time, even on smartphones. But for this, the algorithm needs a video camera, which means that in the case of the same smartphone, the device must be constantly held in front of you, which is far from always convenient. Engineers led by Cheng Zhang of Cornell University have come up with an unusual and convenient method for creating maps of key points of the face in real time - using headphones with cameras.
Engineers have created two prototype headphones: on-ear headphones and separate in-ear headphones. The main difference in them lies in the camera modules and the distance from the skin (1.5 centimeters for earbuds and 2.5 centimeters for overhead). The cameras in them are located in such a way as to shoot the side of the face from mouth to eyes. In its current form, the prototype sends data over a wire, first to the Raspberry Pi, and then to a powerful computer for processing.
Frames from both cameras are first subjected to preprocessing, during which the entire area outside the face is cut off from the image, then it is binarized and filtered to obtain a face contour. After that, frames from both sides of the face are fed to the ResNet-18 convolutional neural network, and then the vector obtained at its output is fed to the fully connected regression neural network, which produces two sets of key points of the face (for two halves of the face). In the last step, the point maps are connected to create a map of the entire face with 42 points.
The scheme of the neural network
The developers trained the algorithm on "raw" footage from two cameras with headphones and marked footage taken with the camera in front of people's faces. As a result, the algorithm learned to produce fairly accurate face maps from two side photographs. The root mean square error of positioning for all points is 0.77 and 0.74 millimeters for earbuds and on-ear headphones, respectively, and for 20 main points it is 1.43 and 1.39 millimeters, respectively. They also created a separate model to reconstruct masked face point maps with comparable accuracy.
As an example, the authors taught a smartphone program to send stickers with certain emotions readable by headphones, as well as switch songs by silent voice commands.
Recently, another group of engineers taught conventional wireless headphones to recognize finger gestures on the skin around the ear. The method uses a microphone, so it can potentially be used with many headphone models without modification.