Huawei's new open source technique shrinks LLMs to make them run on less powerful, less expensive hardware

Huawei’s Computing Systems Lab in Zurich has announced a new open-source quantization technique for large language models (LLMs) called Sinkhorn-Normalized Quantization (SINQ). This method aims to decrease memory requirements while maintaining output quality. The research team has made the code available on GitHub and Hugging Face under an Apache 2.0 license, facilitating free use and modification for organizations.

SINQ is reported to reduce memory usage by 60-70% across various model architectures, enabling LLMs that previously required over 60 GB of memory to function on approximately 20 GB setups. This development may allow deployment on consumer-grade hardware, such as the Nvidia GeForce RTX 4090, which is significantly less expensive compared to high-end enterprise GPUs like the A100 and H100.

For cloud users, the cost benefits are also notable. Instances based on A100 GPUs typically incur charges of $3–4.50 per hour, while 24 GB GPUs like the RTX 4090 are available at $1–1.50 per hour. Over time, these differences in hourly rates could result in substantial cost reductions for extended workloads and facilitate LLM usage on smaller or less powerful infrastructure.

Quantization, which reduces the precision of model weights, usually presents challenges regarding model quality, particularly at lower precision levels. SINQ aims to address these challenges without requiring calibration data or inter-layer dependencies. It employs dual-axis scaling and Sinkhorn-Knopp-style normalization to improve quantization performance compared to existing methods.

The effectiveness of SINQ has been validated across various models and benchmarks, demonstrating consistent performance enhancements. Its compatibility with non-uniform quantization schemes and integration with calibration methods further broadens its applicability. Huawei has positioned SINQ as a valuable tool for developers and researchers aiming to deploy large models efficiently on less powerful systems, with future enhancements anticipated, including pre-quantized models.

Source: https://venturebeat.com/ai/huaweis-new-open-source-technique-shrinks-llms-to-make-them-run-on-less

Leave a Comment Cancel Reply