Recently, the "Large-Scale Complex Scene Human Video Analysis" challenge hosted by the authoritative organization ACM MM’20 Grand Challenge was held as scheduled, with hundreds of teams including Amazon, Tencent, Dahua Technology, YITU Technology, and Sun Yat-sen University participating in the analysis of more than 56,000 complex events involving human behavior (including queuing, fighting, bending over, walking together, running, staying, etc.). YITU Technology from China won first place in "Track-4: Behavior Recognition."

It is understood that ACM is the largest professional academic organization in the field of computer science worldwide, and its Turing Award (A.M. Turing Award) is recognized as the Nobel Prize in the field of computer science. ACM MM is a top conference in the field of global multimedia and is designated as a Class A international conference by the China Computer Federation (CCF).

If facial recognition is compared to the "general outpatient department" of a hospital, then behavior recognition, especially human behavior recognition, is no less complex and challenging than "cardiology + neurology." The scene is complex and changeable, the actions are highly differentiated, and it is necessary to capture continuous actions and long-duration actions. These pose huge challenges to behavior recognition and analysis, which also requires the algorithm to have a more accurate analysis and reasoning ability for the behavior itself, and even infer scenes that have not been seen before.

YITU introduced that in the competition, the index of YITU's algorithm reached wf-mAP@avg 0.26, nearly tripling the benchmark algorithm in the academic community. Unlike international competitions that have been held many times, this competition was the first of its kind, and the participating teams could not understand the categories of recognition, the size of the dataset, and the specific requirements of recognition before the competition. They had to design the best algorithm in just over a month.

Firstly, behavior recognition in videos is more complex than in images, and how to model and the correlation between video frames are still academic challenges. When the application scenario is clear, and it is known that the object of analysis is the human body and the categories to be recognized are clear, the algorithm can be optimized specifically to improve algorithm performance and solve problems that could not be well solved before.

At the same time, YITU innovatively combined the algorithm with the scene this time. On the one hand, it automatically extracted accurate and rich scene information from the video, combined with advanced pedestrian detection and pedestrian re-identification algorithms, and fully constructed the relationships between people, people and scenes, and people and objects in the video; on the other hand, with the accumulation of algorithms for many years and the understanding of industry scenes, YITU deeply optimized the 14 specific tasks required in the competition.

It is worth noting that unlike other participating teams, YITU did not use a complex multi-model fusion strategy this time, but only used a single model, relying on background extraction and segmentation algorithms, combining behavior analysis with the scene, greatly reducing the difficulty of the problem, which also means that it is still possible to improve algorithm performance by fusing multiple models.

You can copy the link to share to others: //dribblessportsbar.com/node/905