OpenVINO 2024.0 Release

OpenVINO 2024.0 がリリースされています
https://www.intel.com/content/www/us/en/developer/tools/openvino-toolkit/whats-new.html

The OpenVINO™ toolkit version 2024.0 release enhances generative AI accessibility with improved large language model (LLM) performance and expanded model coverage. It also boosts portability and performance for deployment anywhere: at the edge, in the cloud, or locally.

ということで、本家のリリースノートはこちらです。
なんと日本語ページもあるようですね

JavaScriptからのOpenVINO APIへのアクセスがシームレスになったとのことなので、この辺りはブラウザから直接アクセスする場合に面白いことができるのかもしれませんね

下記はリリースノートのコピペです。

What’s new

More Generative AI coverage and framework integrations to minimize code changes.

Improved out-of-the-box experience for TensorFlow* sentence encoding models through the installation of OpenVINO™ toolkit Tokenizers.
OpenVINO™ toolkit now supports Mixture of Experts (MoE), a new architecture that helps process more efficient generative models through the pipeline.
JavaScript developers now have seamless access to OpenVINO API. This new binding enables a smooth integration with JavaScript API.
New and noteworthy models validated: Mistral, StableLM-tuned-alpha-3b, and StableLM-Epoch-3B.

Broader Large Language Model (LLM) support and more model compression techniques.

Improved quality on INT4 weight compression for LLMs by adding the popular technique, Activation-aware Weight Quantization, to the Neural Network Compression Framework (NNCF). This addition reduces memory requirements and helps speed up token generation.
Experience enhanced LLM performance on Intel® CPUs, with internal memory state enhancement, and INT8 precision for KV-cache. Specifically tailored for multi-query LLMs like ChatGLM.
The OpenVINO™ 2024.0 release makes it easier for developers, by integrating more OpenVINO™ features with the Hugging Face* ecosystem. Store quantization configurations for popular models directly in Hugging Face to compress models into INT4 format while preserving accuracy and performance.

More portability and performance to run AI at the edge, in the cloud, or locally.

A preview plugin architecture of the integrated Neural Processor Unit (NPU) as part of Intel® Core™ Ultra processor is now included in the main OpenVINO™ package on PyPI.
Improved performance on ARM* by enabling the ARM threading library. In addition, we now support multi-core ARM platforms and enabled FP16 precision by default on MacOS*.
New and improved LLM serving samples from OpenVINO™ Model Server for multi-batch inputs and Retrieval Augmented Generation (RAG).

OpenVINO™ Runtime

Common

The legacy API for CPP and Python bindings has been removed.
StringTensor support has been extended by operators such as Gather, Reshape, and Concat, as a foundation to improve support for tokenizer operators and compliance with the TensorFlow Hub.
oneDNN has been updated to v3.3 for CPU device and to v3.4 for GPU device targets. (oneDNN release notes: https://github.com/oneapi-src/oneDNN/releases).

CPU Device Plugin

LLM performance on Intel® CPU platforms has been improved for systems based on AVX2 and AVX512, using dynamic quantization and internal memory state optimization, such as INT8 precision for KV-cache. 13th and 14th generations of Intel® Core™ processors and Intel® Core™ Ultra processors use AVX2 for CPU execution, and these platforms will benefit from speedup.
Enable these features by setting “DYNAMIC_QUANTIZATION_GROUP_SIZE”:”32″ and “KV_CACHE_PRECISION”:”u8″ in the configuration file.
The “ov::affinity” API configuration is now deprecated and will be removed in release 2025.0.
The following have been improved and optimized:
- Multi-query structure LLMs (such as ChatGLM 2/3) for BF16 on the 4th and 5th generation Intel® Xeon® Scalable processors.
- Mixtral model performance.
- 8-bit compressed LLM compilation time and memory usage, valuable for models with large embeddings like Qwen.
- Convolutional networks in FP16 precision on ARM platforms.

GPU Device Plugin

The following have been improved and optimized:
- Average token latency for LLMs on integrated GPU (iGPU) platforms, using INT4-compressed models with large context size on Intel® Core™ Ultra processors.
- LLM beam search performance on iGPU. Both average and first-token latency decrease may be expected for larger context sizes.
- Multi-batch performance of YOLOv5 on iGPU platforms.
Memory usage for LLMs has been optimized, enabling 7B models with larger context on 16 Gb platforms.

NPU Device Plugin (preview feature)

The NPU plugin for OpenVINO™ is now available through PyPI (run “pip install openvino”).

OpenVINO Python API

.add_extension method signatures have been aligned, improving API behavior for better user experience.

OpenVINO C API

ov_property_key_cache_mode (C++ ov::cache_mode) now enables the optimize_size and optimize_speed modes to set/get model cache.
The VA surface on Windows* exception has been fixed.

OpenVINO Node.js API

OpenVINO – JS bindings are consistent with the OpenVINO C++ API.
A new distribution channel is now available: Node Package Manager (npm) software registry (check the installation guide)
JavaScript API is now available for Windows* users, as some limitations for platforms other than Linux* have been removed.

TensorFlow Framework Support

String tensors are now natively supported, handled on input, output, and intermediate layers #22024
- TensorFlow Hub universal-sentence-encoder-multilingual inferred out of the box.
- String tensors supported for Gather, Concat, and Reshape operations.
- Integration with openvino-tokenizers module – importing openvino-tokenizers automatically patches TensorFlow Frontend with the required translators for models with tokenization.
Fallback for Model Optimizer by operation to the legacy frontend is no longer available. Fallback by .json config will remain until Model Optimizer is discontinued #21523
Support for the following has been added:
- Mutable variables and resources such as HashTable*, Variable, VariableV2 #22270
- New tensor types: tf.u16, tf.u32, and tf.u64 #21864
- 14 NEW Ops*. Check the list here (marked as NEW).
- TensorFlow 2.15 #22180
The following issues have been fixed:
- UpSampling2D conversion crashed when input type as int16 #20838
- IndexError list index for Squeeze #22326
- Correct FloorDiv computation for signed integers #22684
- Fixed bad cast error for tf.TensorShape to ov.PartialShape #22813
- Fixed reading tf.string attributes for models in memory #22752

ONNX Framework Support

ONNX* Frontend now uses the OpenVINO API 2.0.

PyTorch Framework Support

Names for outputs unpacked from dict or tuple are now clearer. #22821
FX Graph (torch.compile) now supports kwarg inputs, improving data type coverage. #22397

OpenVINO Model Server

OpenVINO™ Runtime backend used is now 2024.0.
Text generation demo now supports multi batch size, with streaming and unary clients.
The REST client now supports servables based on mediapipe graphs, including python pipeline nodes.
Included dependencies have received security-related updates.
Reshaping a model in runtime based on the incoming requests (auto shape and auto batch size) is deprecated and will be removed in the future. Using OpenVINO’s dynamic shape models is recommended instead.

Neural Network Compression Framework (NNCF)

The Activation-aware Weight Quantization (AWQ) algorithm for data-aware 4-bit weights compression is now available. It facilitates better accuracy for compressed LLMs with a high ratio of 4-bit weights. To enable it, use the dedicated ‘awq’ optional parameter of the nncf.compress_weights() API.
ONNX models are now supported in Post-training Quantization with Accuracy Control, through the nncf.quantize_with_accuracy_control(), method. It may be used for models in the OpenVINO IR and ONNX formats.
A weight compression example tutorial is now available, demonstrating how to find the appropriate hyperparameters for the TinyLLama model from the Hugging Face Transformers, as well as other LLMs, with some modifications.

OpenVINO Tokenizer

Regex support has been improved.
Model coverage has been improved.
Tokenizer metadata has been added to rt_info.
Limited support for Tensorflow Text models has been added: convert MUSE for TF Hub with string inputs.
OpenVINO Tokenizers have their own repository now: https://github.com/openvinotoolkit/openvino_tokenizers

Other Changes and Known Issues

Jupyter Notebooks

The following notebooks have been updated or newly added:

Mobile language assistant with MobileVLM
Depth estimation with DepthAnything
Kosmos-2
Zero-shot Image Classification with SigLIP
Personalized image generation with PhotoMaker
Voice tone cloning with OpenVoice
Line-level text detection with Surya
InstantID: Zero-shot Identity-Preserving Generation using OpenVINO
Tutorial for Big Image Transfer (BIT) model quantization using NNCF
Tutorial for OpenVINO Tokenizers integration into inference pipelines
LLM chatbot and LLM RAG pipeline have received integration with new models: minicpm-2b-dpo, gemma-7b-it, qwen1.5-7b-chat, baichuan2-7b-chat

Known Issues

Component: PyTorch FE.
ID: N/A
Description: Starting with release 2024.0, model inputs and outputs will no longer have tensor names, unless explicitly set to align with the PyTorch framework behavior.

Component: GPU runtime.
ID: 132376
Description: First-inference latency slow down for LLMs on Intel® Core™ Ultra processors. Up to 10-20% drop may occur due to radical memory optimization for processing long sequences (about 1.5-2 GB reduced memory usage).

Component: CPU runtime.
ID: N/A
Description: Performance results (first token latency) may vary from those offered by the previous OpenVINO version, for “latency” hint inference of LLMs with long prompts on Intel® Xeon® platforms with 2 or more sockets. The reason is that all CPU cores of just the single socket running the application are employed, lowering the memory overhead for LLMs when numa control is not used.
Workaround: the behavior is expected but stream and thread configuration may be used to include cores from all sockets.

Deprecation and Support

Using deprecated features and components is not advised. They are available to enable a smooth transition to new solutions and will be discontinued in the future. To keep using Discontinued features, you will have to revert to the last LTS OpenVINO version supporting them.

For more details, refer to the OpenVINO Legacy Features and Components page.

Discontinued in 2024.0:

Runtime components:
- Intel® Gaussian & Neural Accelerator (Intel® GNA). Consider using the Neural Processing Unit (NPU) for low-powered systems like Intel® Core™ Ultra or 14th generation and beyond. 
- OpenVINO C++/C/Python 1.0 APIs (see 2023.3 API transition guide for reference).
- All ONNX Frontend legacy API (known as ONNX_IMPORTER_API)
- ‘PerfomanceMode.UNDEFINED’ property as part of the OpenVINO Python API
Tools:
- Deployment Manager. See installation and deployment guides for current distribution options.
- Accuracy Checker.
- Post-Training Optimization Tool (POT). Neural Network Compression Framework (NNCF) should be used instead.
- a git patch for NNCF integration with huggingface/transformers. The recommended approach is to use huggingface/optimum-intel for applying NNCF optimization on top of models from Hugging Face.
- Support for Apache MXNet, Caffe, and Kaldi model formats. Conversion to ONNX may be used as a solution.

Deprecated and to be removed in the future:

The OpenVINO™ Development Tools package (pip install openvino-dev) will be removed from installation options and distribution channels beginning with OpenVINO 2025.0.
Model Optimizer will be discontinued with OpenVINO 2025.0. Consider using OpenVINO Model Converter (API call: OVC) instead. Follow the model conversion transition guide for more details.
OpenVINO property Affinity API will be discontinued with OpenVINO 2025.0. It will be replaced with CPU binding configurations (ov::hint::enable_cpu_pinning).
OpenVINO Model Server components:
- Reshaping a model in runtime based on the incoming requests (auto shape and auto batch size) is deprecated and will be removed in the future. Using OpenVINO’s dynamic shape models is recommended instead.

Legal Information

You may not use or facilitate the use of this document in connection with any infringement or other legal analysis concerning Intel products described herein.

You agree to grant Intel a non-exclusive, royalty-free license to any patent claim thereafter drafted which includes subject matter disclosed herein.

No license (express or implied, by estoppel or otherwise) to any intellectual property rights is granted by this document.

All information provided here is subject to change without notice. Contact your Intel representative to obtain the latest Intel product specifications and roadmaps.

The products described may contain design defects or errors known as errata which may cause the product to deviate from published specifications. Current characterized errata are available on request.

Intel technologies’ features and benefits depend on system configuration and may require enabled hardware, software or service activation. Learn more at http://www.intel.com/ or from the OEM or retailer.

No computer system can be absolutely secure.

Intel, Atom, Arria, Core, Movidius, Xeon, OpenVINO, and the Intel logo are trademarks of Intel Corporation in the U.S. and/or other countries.

OpenCL and the OpenCL logo are trademarks of Apple Inc. used by permission by Khronos

Other names and brands may be claimed as the property of others.

For more complete information about compiler optimizations, see our Optimization Notice.

Performance varies by use, configuration and other factors. Learn more at www.Intel.com/PerformanceIndex.

Satoshi Masuda

産業用画像処理装置開発、
ゲームコンソール開発、半導体エンジニアなどを経て、
Webエンジニア＆マーケティングをやっています
好きな分野はハードウェアとソフトウェアの境界くらい