はやいものでもう12月ですね。今年はAIが爆発的に伸びた年。来年は今年以上に伸びるのではないでしょうか。
さて、OpenVINO 2025.4がリリースされました。恐らく今年最後のリリースだと思われます。
What’s new
- More Gen AI coverage and frameworks integrations to minimize code changes
- New models supported:
- On CPUs & GPUs: Qwen3-Embedding-0.6B, Qwen3-Reranker-0.6B, Mistral-Small-24B-Instruct-2501.
- On NPUs: Gemma-3-4b-it and Qwen2.5-VL-3B-Instruct.
- Preview: Mixture of Experts (MoE) models optimized for CPUs and GPUs, validated for Qwen3-30B-A3B.
- GenAI pipeline integrations: Qwen3-Embedding-0.6B and Qwen3-Reranker-0.6B for enhanced retrieval/ranking, and Qwen2.5VL-7B for video pipeline.
- New models supported:
- Broader LLM model support and more model compression techniques
- Gold support for Windows ML* enables developers to deploy AI models and applications effortlessly across CPUs, GPUs, and NPUs on Intel® Core™ Ultra processor-powered AI PCs.
- The Neural Network Compression Framework (NNCF) ONNX backend now supports INT8 static post-training quantization (PTQ) and INT8/INT4 weight-only compression to ensure accuracy parity with OpenVINO IR format models. SmoothQuant algorithm support added for INT8 quantization.
- Accelerated multi-token generation for GenAI, leveraging optimized GPU kernels to deliver faster inference, smarter KV-cache reuse, and scalable LLM performance.
- GPU plugin updates include improved performance with prefix caching for chat history scenarios and enhanced LLM accuracy with dynamic quantization support for INT8.
- More portability and performance to run AI at the edge, in the cloud or locally
- Announcing support for Intel® Core™ Ultra Processor Series 3.
- Encrypted blob format support added for secure model deployment with OpenVINO™ GenAI. Model weights and artifacts are stored and transmitted in an encrypted format, reducing risks of IP theft during deployment. Developers can deploy with minimal code changes using OpenVINO GenAI pipelines.
- OpenVINO™ Model Server and OpenVINO™ GenAI now extend support for Agentic AI scenarios with new features such as output parsing and improved chat templates for reliable multi-turn interactions, and preview functionality for the Qwen3-30B-A3B model. OVMS also introduces a preview for audio endpoints.
- NPU deployment is simplified with batch support, enabling seamless model execution across Intel® Core™ Ultra processors while eliminating driver dependencies. Models are reshaped to batch_size=1 before compilation.
- The improved NVIDIA Triton Server* integration with OpenVINO backend now enables developers to utilize Intel GPUs or NPUs for deployment.