Running Large Language Models on your Mac: A practical guide to MLX, oMLX and MTPLX

,

Introduction

If you own a MacBook Pro with Apple Silicon (M1 through M5 generation), you already have hardware specifically designed for machine learning. The challenge is finding the right software to unlock its potential.

This guide covers three tools that form a natural progression:

Tool Purpose Best For
MLX Core framework by Apple Research Experimentation & scripting
oMLX Fast caching inference server Coding agents & chat apps
MTPLX Multi-tenant server with API compatibility Team deployments

Why Run LLMs Locally?

Privacy & Data Security

When you use cloud AI services, your data travels to external servers. Local AI keeps proprietary code, client data and personal information on your machine. This simplifies compliance with HIPAA, GDPR and internal security policies.

Cost Savings

Heavy Codex users often see monthly bills jump from $20 to $100-200. Analysis shows 80% of tokens are consumed by repetitive tasks: commit messages, batch translation, organizing notes and editing email drafts. Local models handle these at zero marginal cost.

Speed & Control

No network round-trip means responses start instantly. No rate limits. No API downtime. Works offline. You choose any model without vendor lock-in.

Understanding Apple Silicon

The Unified Memory Advantage

Traditional computers separate CPU and GPU memory, causing slow data transfers. Apple Silicon uses Unified Memory Architecture (UMA): the CPU, GPU and Neural Engine share one memory pool with instant access.

Why Memory Bandwidth Matters

LLM inference is memory-bandwidth bound, not compute-bound. The GPU spends most of its time waiting for model weights to load from memory. Higher bandwidth means faster token generation. This is why Max and Ultra chips with 400+ GB/s significantly outperform base chips.

Chip Comparison for LLM Workloads

Chip Max RAM GPU Cores Memory Bandwidth Best Model Size
M1 16GB 7–8 68 GB/s 7B Q4
M1 Pro/Max 32–64GB 14–32 200–400 GB/s 14–30B Q4
M2 24GB 8–10 100 GB/s 7–14B Q4
M2 Pro/Max 32–96GB 16–38 200–400 GB/s 30–70B Q4
M3 24GB 10 100 GB/s 7–14B Q4
M3 Pro/Max 36–128GB 14–40 150–400 GB/s 35–100B Q4
M4 32GB 10 120 GB/s 14–30B Q4
M4 Pro/Max 64–128GB 16–40 273–546 GB/s 35–120B Q4
M5 generation 24–48GB 16+ 200+ GB/s 35B Q4

MLX: The Foundation

What Is MLX?

MLX is an open-source array framework developed by Apple Machine Learning Research. Unlike other ML libraries, it was built specifically for Apple Silicon’s Unified Memory Architecture. It is the only major framework that treats shared memory as a native feature rather than a compatibility layer.

Key Features

Feature Benefit
Unified Memory No data copying between CPU and GPU
Lazy Computation Optimizes operations before execution
NumPy-like API Familiar syntax for Python developers
Quantization Support Run larger models with less memory

Getting Started

Install MLX LM:

pip install mlx-lm

Generate text with a few lines:


from mlx_lm import load, generate

model, tokenizer = load("mlx-community/Qwen2.5-7B-Instruct-4bit")
text = generate(
    model,
    tokenizer,
    prompt="Write a story about Keralam",
    verbose=True
)

The model downloads automatically and runs on your GPU. The load function handles quantization and caching. The generate function handles tokenization, inference and decoding.

Command-Line Usage

# Generate text
mlx_lm.generate --model mlx-community/Qwen2.5-7B-Instruct-4bit \
  --prompt "Explain quantum computing" \
  --max-tokens 200

# Interactive chat
mlx_lm.chat --model mlx-community/Qwen2.5-7B-Instruct-4bit

Understanding Quantization

What Is Quantization?

Quantization reduces model precision to save memory. Think of it like image compression: you lose some quality but gain huge storage savings.
Precision Bits per Weight Memory for 7B Model Quality
FP32 32 bits ~28GB Perfect
FP16 16 bits ~14GB Excellent
Q8_0 8 bits ~7GB Very Good
Q6_K 6 bits ~5.5GB Good
Q5_K_M 5 bits ~4.5GB Good
Q4_K_M 4 bits ~4GB Acceptable
Q3_K 3 bits ~3GB Degraded

Decoding Quantization Names

Recommended by Use Case

Use Case Recommended Why
Coding assistance Q4_K_M Good balance, faster responses
Creative writing Q5_K_M or Q6_K Better nuance and coherence
Translation Q4_K_M Sufficient quality
Complex reasoning Q6_K or Q8_0 Preserves model intelligence
Memory-constrained Q4_K_S Smallest viable option

The Problem With Local LLMs

1. Memory Constraints

Your Mac shares memory between the OS, apps and AI models. A 30B parameter model at 4-bit uses ~17GB, leaving little room on a 24GB machine for context cache and applications.

2. Context Window Limits

Every token in a conversation needs memory. Long conversations exhaust available RAM. A 16K context window on a 24GB machine running a 35B model is tight. Push to 32K and the model crashes.

3. Speed vs. Quality Trade-off

Model Size Speed Capability
7B ~100 tokens/sec Basic tasks
35B ~15–25 tokens/sec Complex reasoning

oMLX: The Solution

What oMLX Actually Does

oMLX is a local LLM inference server built natively on MLX. It adds continuous batching, tiered caching and a native macOS interface to the core MLX engine. The project is open source under the Apache 2.0 license and targets Apple Silicon specifically.

Key Features

Feature What It Does
Two-Tier Cache Hot cache in RAM, cold cache on SSD
Persistent Context Survives restarts, no lost conversations
Continuous Batching Multiple requests share model weights
2× Faster Compared to Ollama
50% Less Memory More headroom for context

How oMLX Addresses Each Problem

  • Memory Efficiency: Uses MLX’s native quantization. A 24GB MacBook Pro can run a 35B model with headroom for context, where Ollama would struggle or crash.
  • Context Management: The persistent cache stores conversation context to disk. Restart the server or switch conversations and the cache restores instantly. The system promotes frequently used blocks to hot RAM and demotes rarely used blocks to cold SSD storage.
  • Speed Optimization: Continuous batching allows multiple requests to share model weights. Instead of loading separately for each request, oMLX processes them together, improving throughput by over 2x.
  • Thermal Awareness: The native macOS menu bar app monitors temperature and throttling in real time. Adjust concurrency or model size to stay within sustainable temperatures.
  • Tooling Simplicity: Install via Homebrew, start with one command, interact through a web dashboard or API. No configuration files, no format conversions, no manual cache management.

Installation

brew tap jundot/omlx https://github.com/jundot/omlx
brew trust jundot/omlx
brew install omlx

Running the Server

omlx serve \
  --model-dir ~/models \
  --hot-cache-max-size 20% \
  --paged-ssd-cache-dir ~/.omlx/cache \
  --max-concurrent-requests 16
The --model-dir parameter tells oMLX where to find your downloaded models. The --hot-cache-max-size reserves 20 percent of your system memory for the fast RAM cache. The --paged-ssd-cache-dir sets the location for the persistent SSD cache. The --max-concurrent-requests parameter allows up to 16 simultaneous requests, suitable for a single developer running multiple tools.
Click on image to zoom
Note: oMLX includes a native macOS menu bar app for monitoring temperature, model status and downloads.

MTPLX: Multi-Tenant Serving

The Purpose

MTPLX extends oMLX for scenarios where multiple users or services share one inference server. It adds resource isolation and standard API compatibility.

Key Features

  • OpenAI API compatibility
  • Anthropic API compatibility
  • Resource isolation between users
  • Production-ready deployment

Installation & Usage

brew install youssofal/mtplx/mtplx
mtplx start --profile sustained --port 8000

The –profile sustained option configures the server for continuous operation rather than burst workloads. The –port 8000 exposes the server on port 8000, the standard port for many local development tools.

When to Use Each Tool

Scenario Recommended Tool
Python scripting & experiments MLX LM
IDE integrations & coding agents oMLX
Team deployments & API compatibility MTPLX

Quick Start with Codex

Step 1: Install Codex

Download from the official Codex site and log in with your ChatGPT account. The Codex desktop app can execute local shell commands, modify files and interact with the operating system. Confirm it works with a quick test: ask it to list your downloads folder.

Step 2: Install oMLX

Send this message to Codex:
“Install omlx for me using Homebrew. The commands are: brew tap jundot/omlx https://github.com/jundot/omlx && brew install omlx. After that, launch the omlx menu bar app and verify that it is running at localhost:8000/admin.”
Codex executes the steps sequentially: check Homebrew, add the tap, install oMLX, launch the menu bar app and verify with a curl test. You glance at each step and press “y” to approve. The oMLX icon appears in your menu bar. Clicking it opens the admin console.

Step 3: Download Models

Send this message, adjusting for your RAM:
“My Mac has 24GB of RAM. Download the appropriate models for omlx: for 24GB use Qwen3.6-35B-A3B with Q4 quantization. After downloading, pin them to omlx memory.”
Codex downloads the models via the oMLX admin API and pins them to memory. The process takes 5 to 30 minutes depending on download speed.

Step 4: Configure Local Inference

“Configure yourself to use the local omlx API at localhost:8000 for all future requests. Default to local inference instead of cloud APIs.”
Codex updates its configuration to point at your local server. Future interactions use the local model. Zero token costs, zero data sent externally.

Smart Routing: Local + Cloud

Not every task needs the cloud. Not every task runs well locally. Smart routing gives you the best of both.

The Strategy

Task Type Route To Why
Commit messages Local Repetitive, privacy-sensitive
Translations Local High-token, low-complexity
Summarization Local Privacy-sensitive
Email drafting Local Predictable patterns
Code comments Local Repetitive, safe
Complex reasoning Cloud Requires larger models
Multi-agent workflows Cloud Better performance

Configuring Smart Routing in Codex

I sent the following prompt to Codex:

“Configure the Codex desktop app to use the local oMLX API at http://127.0.0.1:8000/v1 with OpenAI compatibility. Set up routing rules so that commit messages, batch translations, document summarization, email drafting and simple code commenting are handled by the local Qwen3.6-35B-A3B model. Route multi-agent workflows, long-form projects and complex reasoning tasks to cloud-based GPT-5.”

Codex generated a configuration similar to this:

[providers.local]
base_url = "http://127.0.0.1:8000/v1"
api_key = "local"
model = "qwen3.6-35b-a3b"

[routing]
commit_message = "local"
translate = "local"
summarize = "local"
email_draft = "local"
code_comment = "local"
default = "cloud"

What Each Routing Rule Does

Commit messages: When you ask Codex to write a commit message, it runs locally. The task is repetitive, low complexity and often contains sensitive code references you may not want in a cloud log.
Batch translation: Translating multiple documents or code comments is a high-token, low-complexity task. Running locally eliminates per-token costs entirely.
Document summarization: Summarizing meeting notes, research papers or legal documents is privacy-sensitive and context-heavy. Local execution keeps the document on your machine.
Email drafting: Writing professional emails follows predictable patterns. A 35B model handles this at 85-90 percent of cloud quality with zero cost.
Simple code commenting: Adding docstrings or inline comments to existing code is repetitive and safe to run locally.
Complex reasoning: Architectural decisions, debugging across multiple files or creative writing tasks route to the cloud where larger models perform better.

Cost Savings

Heavy Codex users often see monthly bills jump from $20 to $100-200. Analysis shows 80 percent of tokens are consumed by the five repetitive tasks listed above. Once smart routing is configured, those tokens cost nothing. For power users, saving $20-200 per month is a realistic outcome.

Best for Casual Use vs Power Users

For casual chats, avoiding manual Codex configuration or demos for friends and family, the simple local setup is sufficient. Point Codex at your oMLX server and use it for everything. The quality is good enough for most conversations.
For power users who rely on Codex for professional development, the routing configuration is essential. It automates the decision of where to send each request, so you never have to think about whether a task should run locally or in the cloud. The system decides based on the task type and you get the best of both worlds without manual intervention.

Model Recommendations by RAM

RAM Recommended Model Memory Used Notes
8GB Qwen3.5:4b Q4_K_M ~3.4GB Great for beginners
16GB Qwen3.5:9b Q4_K_M ~6.6GB Best balance for 16GB
24GB Qwen3.6-35B-A3B Q4_K_M ~21–22GB Sweet spot for 35B models
32GB Qwen3.6-35B-A3B Q4_K_M ~21–22GB Comfortable with 32K context
48GB+ Qwen3.6-35B-A3B Q5_K_M ~25–26GB Higher quality, 128K context
64GB+ Q6_K or Q8_0 ~29–36GB Near-lossless quality

Why Qwen Models?

The Qwen family performs exceptionally well on Apple Silicon. Qwen3.6-35B-A3B uses a Mixture of Experts architecture with 35 billion total parameters and 3 billion active. Its SWE-bench Verified score of 73.4 places it in the same league as Claude Sonnet 4.5. The mlx-community organization on Hugging Face maintains pre-converted versions specifically for MLX, available within days of release.

Important MoE Clarification

Quantization size is calculated based on total parameters (35B), not active parameters (3B). All 64 expert weights must reside in memory because the system cannot predict which expert the next token will route to. So the file size and memory usage are roughly equivalent to a dense 35B model. The speed advantage comes from computing only 3B parameters per token, but the memory footprint remains that of a 35B model.

The Ecosystem Context

Industry Validation

  • Early 2026: Ollama switched to MLX as its inference engine on Apple Silicon
  • WWDC 2025: Apple dedicated three sessions to MLX, establishing it as the preferred framework

Tool Comparison

Tool Strength
MLX LM Simplest entry point
oMLX Persistent caching for agents
MTPLX Multi-tenant + API compatibility
Ollama Popular but slower
LM Studio GUI-focused

Conclusion

Your MacBook Pro with Apple Silicon is powerful AI hardware. The right software unlocks it:

  1. Start simple — Use MLX LM to generate text with Python
  2. Level up — Install oMLX for persistent caching and faster responses
  3. Optimize costs — Configure smart routing to use local models for repetitive tasks
  4. Scale if needed — Use MTPLX for team deployments

Quick Wins

Action Benefit
Install oMLX 2× faster than Ollama
Configure smart routing Save $20–200/month
Use persistent cache No context loss on restart

Local inference isn’t a compromise. With these tools, it’s faster, more private and more controllable than cloud APIs.

Your MacBook Pro was designed for this. Now you have the tools to unlock it.

Resources & Links

Official Documentation

Resource Link Description
MLX GitHub github.com/ml-explore/mlx Core framework
MLX LM github.com/ml-explore/mlx-lm Language model tools
oMLX github.com/jundot/omlx Inference server
MTPLX github.com/youssofal/mtplx Multi-tenant server

Recommended Models

Model Size Good For
Qwen2.5-7B-Instruct-4bit ~4GB General use, fast
Qwen2.5-14B-Instruct-4bit ~8GB Balanced quality
Qwen3.6-35B-A3B-4bit ~21GB Complex reasoning
CodeLlama-13B-4bit ~7GB Code-focused
Mistral-7B-Instruct-4bit ~4GB General, efficient

Quick Reference Card

Please follow and like us:
0

Leave a Reply

Your email address will not be published. Required fields are marked *