Optimizing AI with Small Language Models (SLMs) for On-Device Applications

7 min readAug 31, 2024

In the fast-moving world of AI and language models, it’s easy to think that bigger is always better. However, smaller language models (SLMs) can often offer a more strategic and effective entry point for businesses and enterprises that are just starting to explore this groundbreaking technology. From cost-efficiency to enhanced customization and easy deployment, SLMs are a powerful tool for businesses looking to harness the potential of AI without breaking the budget or sacrificing control.

So, What is an SLM?
A small language model (SLM) is a type of artificial intelligence model designed for natural language processing (NLP) tasks. Unlike their larger counterparts, such as GPT, Llama, Gemini, etc, SLMs are optimized for efficiency and reduced computational requirements. They achieve this by using fewer parameters, streamlined architectures, and techniques like knowledge distillation or quantization. This makes them ideal for deployment on resource-constrained devices or in situations where low latency is critical.

Where can SLMs be used?
The compact nature and efficiency of SLMs open up a wide array of use cases:

On-Device NLP: SLMs can be directly embedded in smartphones, smart speakers, or even wearables, enabling features like voice assistants, text prediction, and on-the-fly translation without relying on cloud connectivity.
Edge Computing: In scenarios where data privacy or network limitations are a concern, SLMs can process information locally on edge devices, reducing the need to send sensitive data to the cloud.
Real-time Applications: Tasks demanding swift responses, such as chatbots, live captioning, or gaming interactions, can benefit from the low-latency processing of SLMs.
Custom Applications: SLMs can be fine-tuned on specific datasets for niche applications like medical text analysis, legal document summarization, or financial sentiment analysis.

Some of the examples, case studies, and use cases of SLMs that we currently utilize in our everyday applications.

Mobile Keyboard Prediction: SwiftKey and Gboard leverage SLMs to provide contextually accurate text suggestions, enhancing typing speed and accuracy.
Voice Assistants: Devices like Amazon Echo and Google Home utilize SLMs for on-device natural language understanding, enabling voice-controlled interactions.
Offline Translation Apps: Applications like Google Translate offer offline translation capabilities powered by SLMs, facilitating communication in areas with limited internet connectivity.
Smart Reply: Email platforms like Gmail employ SLMs to suggest concise, contextually relevant responses, streamlining email communication.

What are the available/ popular SLMs so far?
1. Google AI Edge SDK for Gemini Nano ( more info) — A smaller but efficient AI model that lets you run advanced AI features directly on your Android device. This means no internet connection is needed, and your data stays private. It’s perfect when you need quick responses, low costs, and strong privacy protection.

Although smaller than cloud-based models, Gemini Nano can be customized for specific tasks and works seamlessly with Android’s AICore system. It uses your device’s hardware for fast performance and staying updated.

Currently, Gemini Nano is available on these devices:

Google Pixel 9 Series
Google Pixel 8 Pro
Google Pixel 8
Google Pixel 8a
Samsung Galaxy S24 Series

More devices will be supported soon and in the future. You can access Gemini Nano through the Google AI Edge SDK, a complete set of tools for on-device machine learning.
Advantage — Accessible readily on the device and can be used instantly with very minimal setup on the application.
Disadvantage — Currently, direct on-device fine-tuning of Gemini Nano for specific use cases is not yet available through the publicly released Google AI Edge SDK. The model is designed to be versatile and handle a range of tasks out-of-the-box, but more granular customization options are not yet exposed to developers.

*Google SDK — AI Edge SDK (Helps with SLM implementation) and AI Client SDK (Helps with LLM implementation)*

2. GPT-4o mini ( more info)- OpenAI’s GPT-4o mini is a smaller, more affordable AI model. It’s great for tasks needing lots of data or quick responses, like customer service chatbots or code analysis tools. It understands text and images now, with support for video and audio coming soon. Plus, it’s safer and better at handling different languages than older models.
Advantage— Cost-effectiveness and versatility make it suitable for various industry applications.
Disadvantage — While GPT-4o mini is smaller and more efficient than its predecessors, it’s still a relatively large and complex model. Running it directly on a mobile device would likely be slow and drain the battery quickly.
3. BERT Mobile ( more info) — There are various versions of BERT since it was launched as a research paper by google in 2018. These versions cater to different needs, balancing model size, computational requirements, and task-specific performance. A notable example is BERT Mobile, designed explicitly for on-device deployment.
Advantage— BERT Mobile’s ability to be pre-trained in the cloud based on specific use cases allows for greater customization and potential performance gains compared to Gemini Nano, which currently lacks on-device fine-tuning capabilities. This cloud-based pre-training enables developers to tailor BERT Mobile to their exact requirements before deploying it to mobile devices, optimizing its efficiency and accuracy for specific tasks.
Disadvantage — Pre-training on large datasets in the cloud offers advantages in terms of efficiency and resource utilization, but it might lead to a loss of personalization if the model is constantly fine-tuned on general data.

The architecture of implementing an SLM in an offline mobile application
The implementation involves several key components:

Model Selection: Choose an appropriate pre-trained SLM, considering the specific application requirements and device constraints.
Fine-tuning (Optional): If the task demands domain-specific knowledge, fine-tune the SLM on a relevant dataset.
Optimization: Apply techniques like quantization or pruning to further reduce the model’s size and improve inference speed.
Integration: Embed the optimized model into the mobile app using frameworks like TensorFlow Lite or PyTorch Mobile.
User Interface: Develop a user-friendly interface for interacting with the SLM’s functionality.

Let us see a high-level implementation of a mobile app in Android and iOS using BERT Mobile
A) Project Setup
Open Android project in Android Studio.
Add the TensorFlow Lite dependencies to your build.gradle file:

dependencies { 
implementation 'org.tensorflow:tensorflow-lite:+' 
// Optional: For hardware acceleration 
implementation 'org.tensorflow:tensorflow-lite-support:+' 
implementation 'org.tensorflow:tensorflow-lite-gpu:+' 
}

B) Obtain and Prepare the Model

Download the pre-trained MobileBERT .tflite model from Hugging Face or another reliable source.
Place the model file in your project’s assets folder.

C) Load and Run Inference

import org.tensorflow.lite.Interpreter; 
// ... other imports 
// Load the model 
try (Interpreter interpreter = new Interpreter(loadModelFile(assetManager, "mobilebert.tflite"))) { 

// Prepare input data (tokenize and convert to numerical representation) 
// ... (Use a tokenizer library or implement your own) 
float[][] input = prepareInputData(query); 

// Create output buffer 
float[][] output = new float[1][/*num_labels*/]; // Adjust num_labels 

// Run inference 
interpreter.run(input, output); 

// Process the output and display results 
// ... (Interpret the output according to your task) 
String answer = interpretOutput(output); 
displayAnswer(answer); 
} catch (IOException e) { 
// Handle model loading errors 
} 

// ... Helper methods to load model, prepare input, interpret output

2) iOS (Core ML)

A) Project Setup

Open your Xcode project.
Drag and drop the MobileBERT .mlmodel file into your project. Ensure "Copy items if needed" and "Create groups" are checked.

B) Load and Run Inference
Sample code to load and use the model:

import CoreML 
// Load the model 
guard let model = try? MyMobileBERTModel(configuration: MLModelConfiguration()) else { 
// Handle model loading error 
return 
} 

// Prepare input data (tokenize and convert to numerical representation) 
// ... (Use a tokenizer library or implement your own) 
let input = try? prepareInputData(query) 

// Create an MLFeatureProvider with the input 
let inputFeatures = try? MLDictionaryFeatureProvider(dictionary: input) 

// Run inference 
guard let output = try? model.prediction(input: inputFeatures!) else { 
// Handle prediction error 
return 
} 

// Process the output and display results 
// ... (Interpret the output according to your task) 
let answer = interpretOutput(output) 
displayAnswer(answer) 

// ... Helper methods to prepare input, interpret output

Accessing the Model with Queries
In both Android and iOS:
1) Tokenize the User’s Query: Break down the text into smaller units (words or subwords) that the model understands.
2) Convert Tokens to Numerical Representations: Use the model’s vocabulary to map each token to its corresponding numerical ID.
3) Prepare the Input in the Correct Format: This might involve padding or truncating sequences to match the model’s expected input shape.
4) Feed the Input to the Model and Obtain the Output.
5) Interpret the Output: The interpretation depends on your task (e.g., classification, text generation).

*High-level mobile application architecture using SLM*

Key points to consider:
Hardware Acceleration: Explore using hardware acceleration ( GPU delegate on Android, Core ML on iOS) for faster inference, especially for larger models.
User Interface: Build a user-friendly interface to allow users to input queries and view the model’s responses.
Error Handling: Implement robust error handling for cases like model loading failures, invalid inputs, or prediction errors.
Offline Usage: MobileBERT is designed for on-device use, so your app should function even without an internet connection.

Training an SLM post-implementation

1) Using On-device ML:

Leverage frameworks like TensorFlow Lite or Core ML to enable on-device training with user data.
Personalize the model to individual user preferences and usage patterns.
Challenges include limited device resources.

2) Using Cloud-based ML:

Periodically collect anonymized user data and send it to the cloud for training.
Utilize cloud-based resources for more computationally intensive training.
Updates can be pushed back to the device for improved performance.
Requires careful data management and consideration of user privacy.

In conclusion,
Small language models represent a significant step forward in making AI accessible and practical for a wider range of applications. Their efficiency, adaptability, and potential for on-device deployment open up exciting possibilities across various domains. As research and development in this area continue to advance, we can anticipate even more innovative and impactful uses of SLMs in the future.

Share your thoughts on utilizing SLM in your applications.
Happy Learning!

Originally published at http://shankarkumarasamy.blog on August 31, 2024.

Optimizing AI with Small Language Models (SLMs) for On-Device Applications

Written by Shankar Kumarasamy

No responses yet