Running Large Language Models on your Mac: A practical guide to MLX, oMLX and MTPLX

Introduction

If you own a MacBook Pro with Apple Silicon (M1 through M5 generation), you already have hardware specifically designed for machine learning. The challenge is finding the right software to unlock its potential.

This guide covers three tools that form a natural progression:

Tool	Purpose	Best For
MLX	Core framework by Apple Research	Experimentation & scripting
oMLX	Fast caching inference server	Coding agents & chat apps
MTPLX	Multi-tenant server with API compatibility	Team deployments

Why Run LLMs Locally?

Privacy & Data Security

When you use cloud AI services, your data travels to external servers. Local AI keeps proprietary code, client data and personal information on your machine. This simplifies compliance with HIPAA, GDPR and internal security policies.

Cost Savings

Heavy Codex users often see monthly bills jump from $20 to $100-200. Analysis shows 80% of tokens are consumed by repetitive tasks: commit messages, batch translation, organizing notes and editing email drafts. Local models handle these at zero marginal cost.

Speed & Control

No network round-trip means responses start instantly. No rate limits. No API downtime. Works offline. You choose any model without vendor lock-in.

Understanding Apple Silicon

The Unified Memory Advantage

Traditional computers separate CPU and GPU memory, causing slow data transfers. Apple Silicon uses Unified Memory Architecture (UMA): the CPU, GPU and Neural Engine share one memory pool with instant access.

Why Memory Bandwidth Matters

LLM inference is memory-bandwidth bound, not compute-bound. The GPU spends most of its time waiting for model weights to load from memory. Higher bandwidth means faster token generation. This is why Max and Ultra chips with 400+ GB/s significantly outperform base chips.

Chip Comparison for LLM Workloads

Chip	Max RAM	GPU Cores	Memory Bandwidth	Best Model Size
M1	16GB	7–8	68 GB/s	7B Q4
M1 Pro/Max	32–64GB	14–32	200–400 GB/s	14–30B Q4
M2	24GB	8–10	100 GB/s	7–14B Q4
M2 Pro/Max	32–96GB	16–38	200–400 GB/s	30–70B Q4
M3	24GB	10	100 GB/s	7–14B Q4
M3 Pro/Max	36–128GB	14–40	150–400 GB/s	35–100B Q4
M4	32GB	10	120 GB/s	14–30B Q4
M4 Pro/Max	64–128GB	16–40	273–546 GB/s	35–120B Q4
M5 generation	24–48GB	16+	200+ GB/s	35B Q4

MLX: The Foundation

What Is MLX?

MLX is an open-source array framework developed by Apple Machine Learning Research. Unlike other ML libraries, it was built specifically for Apple Silicon’s Unified Memory Architecture. It is the only major framework that treats shared memory as a native feature rather than a compatibility layer.

Key Features

Feature	Benefit
Unified Memory	No data copying between CPU and GPU
Lazy Computation	Optimizes operations before execution
NumPy-like API	Familiar syntax for Python developers
Quantization Support	Run larger models with less memory

Getting Started

Install MLX LM:

pip install mlx-lm

Generate text with a few lines:


from mlx_lm import load, generate

model, tokenizer = load("mlx-community/Qwen2.5-7B-Instruct-4bit")
text = generate(
    model,
    tokenizer,
    prompt="Write a story about Keralam",
    verbose=True
)

The model downloads automatically and runs on your GPU. The load function handles quantization and caching. The generate function handles tokenization, inference and decoding.

Command-Line Usage

# Generate text
mlx_lm.generate --model mlx-community/Qwen2.5-7B-Instruct-4bit \
  --prompt "Explain quantum computing" \
  --max-tokens 200

# Interactive chat
mlx_lm.chat --model mlx-community/Qwen2.5-7B-Instruct-4bit

Understanding Quantization

What Is Quantization?

Quantization reduces model precision to save memory. Think of it like image compression: you lose some quality but gain huge storage savings.

Precision	Bits per Weight	Memory for 7B Model	Quality
FP32	32 bits	~28GB	Perfect
FP16	16 bits	~14GB	Excellent
Q8_0	8 bits	~7GB	Very Good
Q6_K	6 bits	~5.5GB	Good
Q5_K_M	5 bits	~4.5GB	Good
Q4_K_M	4 bits	~4GB	Acceptable
Q3_K	3 bits	~3GB	Degraded

Decoding Quantization Names

Recommended by Use Case

Use Case	Recommended	Why
Coding assistance	Q4_K_M	Good balance, faster responses
Creative writing	Q5_K_M or Q6_K	Better nuance and coherence
Translation	Q4_K_M	Sufficient quality
Complex reasoning	Q6_K or Q8_0	Preserves model intelligence
Memory-constrained	Q4_K_S	Smallest viable option

The Problem With Local LLMs

1. Memory Constraints

Your Mac shares memory between the OS, apps and AI models. A 30B parameter model at 4-bit uses ~17GB, leaving little room on a 24GB machine for context cache and applications.

2. Context Window Limits

Every token in a conversation needs memory. Long conversations exhaust available RAM. A 16K context window on a 24GB machine running a 35B model is tight. Push to 32K and the model crashes.

3. Speed vs. Quality Trade-off

Model Size	Speed	Capability
7B	~100 tokens/sec	Basic tasks
35B	~15–25 tokens/sec	Complex reasoning

oMLX: The Solution

What oMLX Actually Does

oMLX is a local LLM inference server built natively on MLX. It adds continuous batching, tiered caching and a native macOS interface to the core MLX engine. The project is open source under the Apache 2.0 license and targets Apple Silicon specifically.

Key Features

Feature	What It Does
Two-Tier Cache	Hot cache in RAM, cold cache on SSD
Persistent Context	Survives restarts, no lost conversations
Continuous Batching	Multiple requests share model weights
2× Faster	Compared to Ollama
50% Less Memory	More headroom for context

How oMLX Addresses Each Problem

Memory Efficiency: Uses MLX’s native quantization. A 24GB MacBook Pro can run a 35B model with headroom for context, where Ollama would struggle or crash.
Context Management: The persistent cache stores conversation context to disk. Restart the server or switch conversations and the cache restores instantly. The system promotes frequently used blocks to hot RAM and demotes rarely used blocks to cold SSD storage.
Speed Optimization: Continuous batching allows multiple requests to share model weights. Instead of loading separately for each request, oMLX processes them together, improving throughput by over 2x.
Thermal Awareness: The native macOS menu bar app monitors temperature and throttling in real time. Adjust concurrency or model size to stay within sustainable temperatures.
Tooling Simplicity: Install via Homebrew, start with one command, interact through a web dashboard or API. No configuration files, no format conversions, no manual cache management.

Installation

brew tap jundot/omlx https://github.com/jundot/omlx
brew trust jundot/omlx
brew install omlx

Running the Server

omlx serve \
  --model-dir ~/models \
  --hot-cache-max-size 20% \
  --paged-ssd-cache-dir ~/.omlx/cache \
  --max-concurrent-requests 16

The --model-dir parameter tells oMLX where to find your downloaded models. The --hot-cache-max-size reserves 20 percent of your system memory for the fast RAM cache. The --paged-ssd-cache-dir sets the location for the persistent SSD cache. The --max-concurrent-requests parameter allows up to 16 simultaneous requests, suitable for a single developer running multiple tools.

Note: oMLX includes a native macOS menu bar app for monitoring temperature, model status and downloads.

MTPLX: Multi-Tenant Serving

The Purpose

MTPLX extends oMLX for scenarios where multiple users or services share one inference server. It adds resource isolation and standard API compatibility.

Key Features

OpenAI API compatibility
Anthropic API compatibility
Resource isolation between users
Production-ready deployment

Installation & Usage

brew install youssofal/mtplx/mtplx
mtplx start --profile sustained --port 8000

The –profile sustained option configures the server for continuous operation rather than burst workloads. The –port 8000 exposes the server on port 8000, the standard port for many local development tools.

When to Use Each Tool

Scenario	Recommended Tool
Python scripting & experiments	MLX LM
IDE integrations & coding agents	oMLX
Team deployments & API compatibility	MTPLX

Quick Start with Codex

Step 1: Install Codex

Download from the official Codex site and log in with your ChatGPT account. The Codex desktop app can execute local shell commands, modify files and interact with the operating system. Confirm it works with a quick test: ask it to list your downloads folder.

Step 2: Install oMLX

Send this message to Codex:

“Install omlx for me using Homebrew. The commands are: brew tap jundot/omlx https://github.com/jundot/omlx && brew install omlx. After that, launch the omlx menu bar app and verify that it is running at localhost:8000/admin.”

Codex executes the steps sequentially: check Homebrew, add the tap, install oMLX, launch the menu bar app and verify with a curl test. You glance at each step and press “y” to approve. The oMLX icon appears in your menu bar. Clicking it opens the admin console.

Step 3: Download Models

Send this message, adjusting for your RAM:

“My Mac has 24GB of RAM. Download the appropriate models for omlx: for 24GB use Qwen3.6-35B-A3B with Q4 quantization. After downloading, pin them to omlx memory.”

Codex downloads the models via the oMLX admin API and pins them to memory. The process takes 5 to 30 minutes depending on download speed.

Step 4: Configure Local Inference

“Configure yourself to use the local omlx API at localhost:8000 for all future requests. Default to local inference instead of cloud APIs.”

Codex updates its configuration to point at your local server. Future interactions use the local model. Zero token costs, zero data sent externally.

Smart Routing: Local + Cloud

Not every task needs the cloud. Not every task runs well locally. Smart routing gives you the best of both.

The Strategy

Task Type	Route To	Why
Commit messages	Local	Repetitive, privacy-sensitive
Translations	Local	High-token, low-complexity
Summarization	Local	Privacy-sensitive
Email drafting	Local	Predictable patterns
Code comments	Local	Repetitive, safe
Complex reasoning	Cloud	Requires larger models
Multi-agent workflows	Cloud	Better performance

Configuring Smart Routing in Codex

I sent the following prompt to Codex:

“Configure the Codex desktop app to use the local oMLX API at http://127.0.0.1:8000/v1 with OpenAI compatibility. Set up routing rules so that commit messages, batch translations, document summarization, email drafting and simple code commenting are handled by the local Qwen3.6-35B-A3B model. Route multi-agent workflows, long-form projects and complex reasoning tasks to cloud-based GPT-5.”

Codex generated a configuration similar to this:

[providers.local]
base_url = "http://127.0.0.1:8000/v1"
api_key = "local"
model = "qwen3.6-35b-a3b"

[routing]
commit_message = "local"
translate = "local"
summarize = "local"
email_draft = "local"
code_comment = "local"
default = "cloud"

What Each Routing Rule Does

Commit messages: When you ask Codex to write a commit message, it runs locally. The task is repetitive, low complexity and often contains sensitive code references you may not want in a cloud log.

Batch translation: Translating multiple documents or code comments is a high-token, low-complexity task. Running locally eliminates per-token costs entirely.

Document summarization: Summarizing meeting notes, research papers or legal documents is privacy-sensitive and context-heavy. Local execution keeps the document on your machine.

Email drafting: Writing professional emails follows predictable patterns. A 35B model handles this at 85-90 percent of cloud quality with zero cost.

Simple code commenting: Adding docstrings or inline comments to existing code is repetitive and safe to run locally.

Complex reasoning: Architectural decisions, debugging across multiple files or creative writing tasks route to the cloud where larger models perform better.

Cost Savings

Heavy Codex users often see monthly bills jump from $20 to $100-200. Analysis shows 80 percent of tokens are consumed by the five repetitive tasks listed above. Once smart routing is configured, those tokens cost nothing. For power users, saving $20-200 per month is a realistic outcome.

Best for Casual Use vs Power Users

For casual chats, avoiding manual Codex configuration or demos for friends and family, the simple local setup is sufficient. Point Codex at your oMLX server and use it for everything. The quality is good enough for most conversations.

For power users who rely on Codex for professional development, the routing configuration is essential. It automates the decision of where to send each request, so you never have to think about whether a task should run locally or in the cloud. The system decides based on the task type and you get the best of both worlds without manual intervention.

Model Recommendations by RAM

RAM	Recommended Model	Memory Used	Notes
8GB	Qwen3.5:4b Q4_K_M	~3.4GB	Great for beginners
16GB	Qwen3.5:9b Q4_K_M	~6.6GB	Best balance for 16GB
24GB	Qwen3.6-35B-A3B Q4_K_M	~21–22GB	Sweet spot for 35B models
32GB	Qwen3.6-35B-A3B Q4_K_M	~21–22GB	Comfortable with 32K context
48GB+	Qwen3.6-35B-A3B Q5_K_M	~25–26GB	Higher quality, 128K context
64GB+	Q6_K or Q8_0	~29–36GB	Near-lossless quality

Why Qwen Models?

The Qwen family performs exceptionally well on Apple Silicon. Qwen3.6-35B-A3B uses a Mixture of Experts architecture with 35 billion total parameters and 3 billion active. Its SWE-bench Verified score of 73.4 places it in the same league as Claude Sonnet 4.5. The mlx-community organization on Hugging Face maintains pre-converted versions specifically for MLX, available within days of release.

Important MoE Clarification

Quantization size is calculated based on total parameters (35B), not active parameters (3B). All 64 expert weights must reside in memory because the system cannot predict which expert the next token will route to. So the file size and memory usage are roughly equivalent to a dense 35B model. The speed advantage comes from computing only 3B parameters per token, but the memory footprint remains that of a 35B model.

The Ecosystem Context

Industry Validation

Early 2026: Ollama switched to MLX as its inference engine on Apple Silicon
WWDC 2025: Apple dedicated three sessions to MLX, establishing it as the preferred framework

Tool Comparison

Tool	Strength
MLX LM	Simplest entry point
oMLX	Persistent caching for agents
MTPLX	Multi-tenant + API compatibility
Ollama	Popular but slower
LM Studio	GUI-focused

Conclusion

Your MacBook Pro with Apple Silicon is powerful AI hardware. The right software unlocks it:

Start simple — Use MLX LM to generate text with Python
Level up — Install oMLX for persistent caching and faster responses
Optimize costs — Configure smart routing to use local models for repetitive tasks
Scale if needed — Use MTPLX for team deployments

Quick Wins

Action	Benefit
Install oMLX	2× faster than Ollama
Configure smart routing	Save $20–200/month
Use persistent cache	No context loss on restart

Local inference isn’t a compromise. With these tools, it’s faster, more private and more controllable than cloud APIs.

Your MacBook Pro was designed for this. Now you have the tools to unlock it.

Resources & Links

Official Documentation

Resource	Link	Description
MLX GitHub	github.com/ml-explore/mlx	Core framework
MLX LM	github.com/ml-explore/mlx-lm	Language model tools
oMLX	github.com/jundot/omlx	Inference server
MTPLX	github.com/youssofal/mtplx	Multi-tenant server

Recommended Models

Model	Size	Good For
Qwen2.5-7B-Instruct-4bit	~4GB	General use, fast
Qwen2.5-14B-Instruct-4bit	~8GB	Balanced quality
Qwen3.6-35B-A3B-4bit	~21GB	Complex reasoning
CodeLlama-13B-4bit	~7GB	Code-focused
Mistral-7B-Instruct-4bit	~4GB	General, efficient

Quick Reference Card

Please follow and like us:

Running Large Language Models on your Mac: A practical guide to MLX, oMLX and MTPLX

Introduction

Why Run LLMs Locally?

Privacy & Data Security

Cost Savings

Speed & Control

Understanding Apple Silicon

The Unified Memory Advantage

Why Memory Bandwidth Matters

Chip Comparison for LLM Workloads

MLX: The Foundation

What Is MLX?

Key Features

Getting Started

Command-Line Usage

Understanding Quantization

What Is Quantization?

Decoding Quantization Names

Recommended by Use Case

The Problem With Local LLMs

1. Memory Constraints

2. Context Window Limits

3. Speed vs. Quality Trade-off

oMLX: The Solution

What oMLX Actually Does

Key Features

How oMLX Addresses Each Problem

Installation

Running the Server

MTPLX: Multi-Tenant Serving

The Purpose

Key Features

Installation & Usage

When to Use Each Tool

Quick Start with Codex

Step 1: Install Codex

Step 2: Install oMLX

Step 3: Download Models

Step 4: Configure Local Inference

Smart Routing: Local + Cloud

The Strategy

Configuring Smart Routing in Codex

What Each Routing Rule Does

Cost Savings

Best for Casual Use vs Power Users

Model Recommendations by RAM

Why Qwen Models?

Important MoE Clarification

The Ecosystem Context

Industry Validation

Tool Comparison

Conclusion

Quick Wins

Resources & Links

Official Documentation

Recommended Models

Quick Reference Card

Leave a Reply Cancel reply