Technical Revolution - On-Device AI

Technical Revolution – On-Device AI

Von Jakob Laschober am 21.01.2026

Introduction

On-device artificial intelligence (AI) is a system that works on mobile devices, such as smartphones and laptops, as well as other wearables. It is more independent because it can be used on the go without an internet connection. This is just one of the advantages of these systems. Another is the privacy benefits of local processing, as there is no outgoing data traffic or need to store user or sensitive data on remote servers. Local AI also has the advantage of lower latency compared to cloud models. Due to its built-in nature, it can be very fast (of course, depending on the device), and eliminates the additional costs that a cloud model would inevitably incur.

Another use case for local models would be assistance through “ambient computing”, where the device anticipates user needs based on on-screen context.An example is Magic Cue on Android 16, which runs locally to offer relevant actions based on the user’s current activity.

Technical Background:

Deploying Large Language Models (LLMs) locally requires overcoming significant hardware constraints, specifically limited memory or RAM, battery life, and thermal limits. To run models larger than the device’s physical RAM, frameworks use flash storage to store the needed data, dynamically loading only the “active” weights needed for the current token generation into RAM. This allows devices with limited RAM (e.g., 8GB) to run larger models. Also specific cpus with dedicated Neural Processing Units (NPUs) like the Snapdragon 8 Gen 3 utilizing these to squeeze out a little bit more performance to run even bigger complexer models. They are also faster in delivering higher token-per-second rates. Google famously focused their chips called Tensor to have Tensor Processing Units which results in even better performance and efficiency for their models across the board.

Quantization

But how can you fit large models onto such a tiny device while taking hardware concerns into account? It works with quantisation. This technique compresses a model to a much smaller size. It’s much like compressing a folder or file into a zip file on your PC. AI models are normally trained using a format called FP32. If you have a ‘7 billion parameter’ model (such as Llama 3 or Qwen), each parameter requires 4 bytes. However, when you do the maths with such a model: 7 billion x 4 bytes = 28 GB of RAM. However, most smartphones only have 8 or 12 GB of RAM in total, and even if they share some of the normal SSD memory, the model simply will not fit. Quantisation is the process of converting these large models from FP32 to INT8 or INT4. Going from 32-bit to 4-bit usually results in very little loss of intelligence, often less than a 1% drop in accuracy. Alternatively, you can think of the process as compressing an image. Most of the detail remains; only the extreme details are lost in the process.

Software and Ecosystem:

Local AI is supported by a software stack designed to abstract hardware complexities.
Google’s Gemini Nano is a prime example of a multimodal model developed specifically for on-device efficiency. It comes in different sizes to accommodate different devices and their performance capabilities. Frameworks such as LiteRT-LM enable developers to create custom LLM pipelines that can be used on Android, iOS and web platforms. LiteRT-LM is a relatively low-level tool that is not commonly used by developers. The ML Kit GenAI API is a more commonly used tool for this purpose, leveraging local models. It offers native APIs for Kotlin/Java, Swift and JavaScript, enabling deployment of the same model across different platforms. However, there are also different options available for each ecosystem.

Android

ML Kit GenAI API is probably the most straightforward and user-friendly toolset to implement for Android developers in particular. It completely abstracts away model management by interfacing with Android AICore. As it uses the built-in Gemini Nano model, there is no need to bundle a large model with your app. It offers pre-built APIs for common tasks, so there is no need to design prompts. These include summarising text passages, proofreading, rewriting, and providing smart replies. If you require greater flexibility than that offered by the above-mentioned specific tasks, the GenAI Prompt API enables you to send natural language requests directly to Gemini Nano.

Web
For web applications, Google is integrating high-level AI capabilities directly into the Chrome browser. Similar to the Android approach, this allows web developers to access on-device Gemini Nano models through JavaScript APIs without requiring the user to download heavy libraries or models for each site.

iOS

For iOS 18+, developers can use system-level APIs to access Apple’s on-device foundation models for text generation and embeddings, removing the need for third-party libraries

While Google promotes the ML Kit GenAI API for Gemini, the API explicitly supports “other open models” and “open-weight models” like mistral, llama or gemma. It just abstracts the complexity of the low-level LiteRT-LM engine to be accessible to more developers.

Conclusion:

On-device AI represents a critical step forward for user sovereignty. By processing data locally on dedicated NPUs, we eliminate the need to send sensitive information to remote servers, solving the twin challenges of latency and privacy in one go. While tasks requiring immense computational power will remain in the cloud, the “thinking” that happens on our smartphones is getting smarter, faster, and more personal. For the end user, this means a device that is not just a portal to the internet, but a truly intelligent assistant that works anywhere, anytime.

References:

Technical Revolution – On-Device AI

Beitrag kommentieren

Ähnliche Artikel

Kategorien

Social Media

Testimonials

Über Mfg