OpenVINO 2025.4 Release

はやいものでもう12月ですね。今年はAIが爆発的に伸びた年。来年は今年以上に伸びるのではないでしょうか。   

さて、OpenVINO 2025.4がリリースされました。恐らく今年最後のリリースだと思われます。

What’s new

  • More Gen AI coverage and frameworks integrations to minimize code changes
    • New models supported:
      • On CPUs & GPUs: Qwen3-Embedding-0.6B, Qwen3-Reranker-0.6B, Mistral-Small-24B-Instruct-2501.
      • On NPUs: Gemma-3-4b-it and Qwen2.5-VL-3B-Instruct.
    • Preview: Mixture of Experts (MoE) models optimized for CPUs and GPUs, validated for Qwen3-30B-A3B.
    • GenAI pipeline integrations: Qwen3-Embedding-0.6B and Qwen3-Reranker-0.6B for enhanced retrieval/ranking, and Qwen2.5VL-7B for video pipeline.
  • Broader LLM model support and more model compression techniques
    • Gold support for Windows ML* enables developers to deploy AI models and applications effortlessly across CPUs, GPUs, and NPUs on Intel® Core™ Ultra processor-powered AI PCs.
    • The Neural Network Compression Framework (NNCF) ONNX backend now supports INT8 static post-training quantization (PTQ) and INT8/INT4 weight-only compression to ensure accuracy parity with OpenVINO IR format models. SmoothQuant algorithm support added for INT8 quantization.
    • Accelerated multi-token generation for GenAI, leveraging optimized GPU kernels to deliver faster inference, smarter KV-cache reuse, and scalable LLM performance.
    • GPU plugin updates include improved performance with prefix caching for chat history scenarios and enhanced LLM accuracy with dynamic quantization support for INT8.
  • More portability and performance to run AI at the edge, in the cloud or locally
    • Announcing support for Intel® Core™ Ultra Processor Series 3.
    • Encrypted blob format support added for secure model deployment with OpenVINO™ GenAI. Model weights and artifacts are stored and transmitted in an encrypted format, reducing risks of IP theft during deployment. Developers can deploy with minimal code changes using OpenVINO GenAI pipelines.
    • OpenVINO™ Model Server and OpenVINO™ GenAI now extend support for Agentic AI scenarios with new features such as output parsing and improved chat templates for reliable multi-turn interactions, and preview functionality for the Qwen3-30B-A3B model. OVMS also introduces a preview for audio endpoints.
    • NPU deployment is simplified with batch support, enabling seamless model execution across Intel® Core™ Ultra processors while eliminating driver dependencies. Models are reshaped to batch_size=1 before compilation.
    • The improved NVIDIA Triton Server* integration with OpenVINO backend now enables developers to utilize Intel GPUs or NPUs for deployment.