Over the past decade, artificial intelligence (AI) has been the core driving force behind the intelligent transformation of the security industry, injecting new vitality and opportunities. As the wave of deep learning swept through various sectors, security became a pioneer in AI application, and with the advent of the large model era, the security industry once again finds itself at the forefront of technological application. Today, numerous security companies are focusing on key technologies such as multimodal large models, working to drive their industrial applications and help various industries achieve leaps in digitalization and intelligence.

The China Security magazine, under the China Security & Protection Industry Association, conducted an in-depth interview with Alex Duan, President of YITU TECH., exploring the application practices, current status, challenges, and future development trends of large models in the intelligent security industry.


Large Vision Langauge Model (LVLM) Empowering a New Era in Intelligent Security

The security industry has been a trailblazer in the use of AI and LVLM. The industry has evolved from high-definition surveillance to intelligent security—AI Security 1.0. In this stage, technologies such as facial recognition, human re-identification (ReID), video structuring, and vehicle/non-motorized vehicle structuring became key innovations. As security technology products have been widely applied in various fields, the demand for long-tail algorithms has increasingly emerged. Traditional deep learning models, based on supervised learning methods, face many constraints and limitations when dealing with complex scenarios. Despite progress in AI security in recent years, practical applications have not fully met expectations. Now, with the arrival of the large model era, we have entered AI Security 2.0. Based on the Transformer architecture, multimodal large models are revolutionizing the fragmented nature of the traditional security industry, showing three main features:

  1. "Thoughtful":LVLM no longer function merely as algorithms or tools; they embody the characteristics of assistants and intelligent agents. By watching a video, these models can accurately recognize the content within, transforming machine vision into intuitive algorithms that bring revolutionary changes to the industry.

  2. "Conversational": Interacting with LVLM feels more like communicating with another person. Users can search for videos through semantic queries or voice commands. For example, simply saying "Please pull up videos of locations with water accumulation" enables the system to quickly respond and display all relevant video clips. This functionality greatly enhances command and dispatch efficiency, saving valuable time in decision-making and coordination.

  3. "Evolable": An intelligent system that cannot evolve based on user needs and environmental changes is just a tool, not true intelligence. A truly intelligent system can self-evolve in response to changes. For example, YITU’s QuestMindTM supports on-site algorithm training, allowing rapid iteration and optimization based on real-world needs. A new algorithm can achieve zero-shot cold start in one minute, complete online annotation training in one hour, and be deployed in one day, demonstrating unprecedented intelligence and flexibility.

In summary, LVLM unify visual and linguistic models, integrating the underlying frameworks of the physical world and the cognitive world. This allows for seamless conversion and unbiased representation of multimodal information, opening up more possibilities for human-machine interaction modes, product iteration, and service operation models. AI 2.0's shift towards safety production and smart operations, driven by data and computing power, is the future direction of the security industry.


Challenges and Solutions in Data, Algorithms, and Computing Power

"Data, algorithms, and computing power" form the three key elements of AI. In the process of implementing large models in the security industry, first, data is the foundation of AI, but much of it currently remains dormant on hard drives without being effectively utilized. Existing video structuring technologies have limited data mining capabilities, which fail to meet the refined management and recognition needs. To address this, the introduction of multimodal large models has become crucial. These models can identify a wide range of objects in videos—whether it’s cats, dogs, plastic bags, park bridges, or even gas tanks on electric bikes—enabling the awakening of dormant data and providing richer information for the security field.

Second, algorithms are the brain of AI, and their needs essentially stem from business requirements, not the ideas of vendor laboratories. Therefore, algorithm development and application must be closely integrated with real business environments. The QuestMindTM provides the ability for on-site algorithm training, quickly responding to the needs of refined management. A new algorithm can be quickly deployed, ensuring both legal data protection and the timeliness required by business needs, ensuring rapid algorithm adaptation to change.

Third, computing power costs are a key factor determining the scalability of AI large models. Currently, high computing power costs limit the widespread application of large models. YITU TECH. has addressed this by optimizing performance. Its industry-first xPU fusion architecture servers virtualize low-cost memory into unified addressing of graphics memory, resulting in a ten-thousand-fold performance improvement and a hundred-fold reduction in costs.

In summary, addressing the challenges of "data, algorithms, and computing power" in large model implementation requires solutions such as multimodal large models, on-site algorithm training, and hyper-converged architecture for software-hardware optimization. These approaches help drive the practical application of large models in security and improve the efficiency and usability of AI in the field.


Progress of Large Model Implementation

Thanks to the deep content understanding, broad adaptability, and natural human-machine interaction capabilities of large models, their application in intelligent security is advancing rapidly. As a veteran in the AI field, YITU TECH. launched the first QuestMindTM in July 2023. This innovative model is now deployed in dozens of projects nationwide, showing great potential in areas such as video semantic search, object recognition, AI agent orchestration, and zero-shot cold start algorithms. Significant progress has been made in applications such as public safety, smart city construction, intelligent traffic, content moderation, smart parks, and emergency response, where there is an increasing demand for video analysis, behavior recognition, and real-time responses.

For example, in the second half of 2023, when a dangerous dog attack occurred in a western province, the city operation center faced the challenge of quickly developing a detection algorithm for dangerous dogs and deploying it in urban public areas. If conventional deep learning methods had been used, the process of data collection, annotation, and training would have taken at least two weeks, severely impacting the timeliness of the management. However, using the algorithm training methods based on LVLM simplified this process. The pre-trained large model base achieved nearly 70% accuracy, and within just five days, over 90% accuracy was reached through user corrections. This ability to produce new algorithms on-site at incredible speed and flexibility effectively ensured public safety.

This user-demand-driven, technology-innovation-driven strategy is the key force driving AI 2.0’s development. With continuous technological advancement, large models in intelligent security are expected to play a critical role in more specialized markets and complex scenarios, especially in areas requiring high personalization and dynamic adaptation.


Promising Future for Large Models in Intelligent Security

The intelligent security industry is on the brink of a breakthrough in the development of large models. With further advancements in multimodal large models, security systems are transitioning from traditional visual surveillance to deeper levels of content understanding, scene adaptability, and human-machine interaction. The future of intelligent security will place greater emphasis on the integration of data and computing power, driving the transformation from traditional security to safety production management and the realization of smart operations. Guided by these trends, YITU TECH. will focus on the deep integration of technological innovation and product implementation in this new wave of AI, strengthening the combination of multimodal large model technologies with domain expertise to create products that are more industry-aware, customer-oriented, and user-friendly, helping "AI+" land faster across industries and opening up new possibilities for artificial intelligence, ushering in a new era of video situational understanding.

You can copy the link to share to others: //dribblessportsbar.com/node/943