Apple Silicon LLM Fine-Tuning

PMetal

High-performance LLM fine-tuning built for Apple Silicon. 18-crate Rust framework with native Metal GPU and Apple Neural Engine support. LoRA, QLoRA, DoRA, GRPO, knowledge distillation, and 20+ architectures — all on your hardware.

20+
LLM Architectures
18
Specialized Crates
Metal
GPU + ANE Native
v0.3.13
Current Version

Own Your AI

No cloud dependency. No per-token fees. No data leaving your hardware. Fine-tune production LLMs at the cost of electricity.

Data Never Leaves

Your training data, your model weights, your hardware. On-premises by design with zero telemetry.

Cost = Electricity

No per-token API charges, no subscription fees, no cloud egress costs. Run on Mac hardware you already own.

Apple Silicon Native

Optimized for M1 through M5. Metal GPU kernels and Apple Neural Engine acceleration auto-detected at runtime.

Production-Ready

Enterprise security, distributed training, quantization, and model merging — not a research prototype.

Fine-Tuning Methods

State-of-the-art parameter-efficient fine-tuning with sequence packing and reasoning training. Every method runs natively on Metal.

LoRA / QLoRA / DoRA

Full suite of parameter-efficient fine-tuning methods with sequence packing for maximum GPU utilization on Apple Silicon.

Low-Rank Adaptation (LoRA)
Quantized LoRA (QLoRA)
Weight-Decomposed LoRA (DoRA)
Sequence packing
Rank-adaptive training
Metal-accelerated adapters

GRPO / DAPO Reasoning

Group Relative Policy Optimization and Direct Alignment from Preference with custom reward function support for reasoning model training.

GRPO implementation
DAPO alignment
Custom reward functions
Verifiable reward signals
Reasoning chain training
RLHF pipelines

Knowledge Distillation

Transfer knowledge from large teacher models to efficient student models with multiple distillation strategies and RLKD support.

Online distillation
Offline distillation
Progressive distillation
Reinforcement Learning KD (RLKD)
Intermediate layer matching
Response-based distillation

20+ Model Architectures

First-class support for all major LLM families with architecture-specific optimizations baked into Metal kernels.

Llama 3.x series
Qwen 2.5 / QwQ
DeepSeek R1 / V3
Mistral / Mixtral
Gemma 2 / 3
15+ additional architectures

Quantization & GGUF

Export fine-tuned models to GGUF with 13 quantization format options. Directly compatible with llama.cpp and Ollama.

13 GGUF format variants
Q4_K_M, Q8_0, F16, BF16
Dynamic quantization
llama.cpp compatibility
Ollama-ready exports
Size vs quality tradeoffs

Model Merging

Combine multiple fine-tuned adapters or base models using 12 merge strategies including TIES, DARE, and linear interpolation.

12 merge strategies
TIES merging
DARE pruning
Linear interpolation
Task arithmetic
Model soup ensembles

Every Interface, Every Workflow

From polished desktop GUI to scriptable Python SDK — use PMetal the way your workflow demands.

Tauri + Svelte

Desktop GUI

Native macOS desktop application built with Tauri and Svelte. Visual training dashboards, real-time loss curves, hyperparameter controls, and model management — no terminal required.

Real-time training visualization
Hyperparameter configuration UI
Dataset management
Checkpoint browser
Export wizard
Native macOS integration
9 tabs

Terminal TUI

Full-featured terminal user interface with 9 dedicated tabs covering training, evaluation, hardware monitoring, logs, and more. Keyboard-driven with vi-style navigation.

9 specialized tabs
Training progress & metrics
Hardware utilization monitor
Log viewer with filtering
Dataset inspector
vi-style keyboard navigation
20+ commands

CLI

Comprehensive command-line interface with 20+ commands for scripting, CI/CD pipelines, and headless server workflows. Shell completion included.

20+ subcommands
Shell completion (zsh/bash/fish)
JSON output mode
Config file support
Environment variable overrides
Scriptable & composable
pip install pmetal

Python SDK

Pythonic interface for Jupyter notebooks, research scripts, and ML pipelines. Full feature parity with the Rust core through PyO3 bindings.

PyO3 native bindings
Jupyter notebook support
Async/await API
HuggingFace datasets integration
Weights & Biases logging
Type hints throughout

Built for Apple Silicon

PMetal doesn't use Metal as an afterthought — the entire training pipeline is designed around the unified memory architecture of M-series chips.

Metal GPU Kernels

Custom Metal Shading Language kernels for matrix multiplication, attention, and gradient computation. Auto-tuned per chip generation.

M1 through M5MSL kernelsAuto-tuned

Apple Neural Engine

Offload inference and certain training operations to the ANE. Runtime detection routes operations to the fastest compute unit.

ANE offloadingRuntime routingZero config

Unified Memory

Full utilization of Apple Silicon unified memory — no host↔device copies. 16–192 GB addressable depending on your Mac.

Zero copyUp to 192 GBShared bandwidth

Chip Generation Support

Runtime auto-detection selects the optimal execution strategy per chip.

M1 / M1 Pro / Max / Ultra
M2 / M2 Pro / Max / Ultra
M3 / M3 Pro / Max
M4 / M4 Pro / Max
M5 (preview)

18-Crate Modular Architecture

Every concern is a focused crate. Compose exactly what your project needs.

pmetal

Main facade crate

Prelude re-exports
Feature flags
Ergonomic APIs
pmetal-metal

Metal GPU backend

MSL compute kernels
Command queue mgmt
Chip auto-detection
pmetal-ane

Apple Neural Engine

ANE operation routing
CoreML bridge
Runtime fallback
pmetal-lora

LoRA / QLoRA / DoRA

Adapter injection
Rank scheduling
Sequence packing
pmetal-grpo

GRPO / DAPO training

Policy gradient
Custom rewards
Group sampling
pmetal-distill

Knowledge distillation

Online / offline
Progressive stages
RLKD support
pmetal-quant

Quantization & GGUF

13 GGUF formats
Dynamic quant
Export pipelines
pmetal-merge

Model merging

12 strategies
TIES / DARE
Task arithmetic
pmetal-dist

Distributed training

mDNS discovery
Ring All-Reduce
Fault tolerance
pmetal-models

Architecture definitions

20+ architectures
Weight loaders
Attention variants
pmetal-data

Dataset pipeline

Streaming loaders
Tokenizer support
Augmentation
pmetal-gui

Tauri desktop app

Svelte frontend
Training dashboards
Checkpoint browser
pmetal-tui

Terminal interface

9 tab layout
Ratatui backend
vi navigation
pmetal-cli

Command-line tools

20+ subcommands
Shell completions
JSON output
pmetal-py

Python bindings

PyO3 bridge
Async support
HuggingFace compat
pmetal-eval

Evaluation suite

Benchmark runners
Perplexity
Task metrics
pmetal-checkpoint

Checkpoint management

SafeTensors format
Resume training
Version tracking
pmetal-telemetry

Metrics & logging

Prometheus metrics
W&B integration
TensorBoard
Distributed Training

Scale Across Multiple Macs

Connect multiple Apple Silicon machines over your local network with zero configuration. mDNS discovery finds peers automatically; Ring All-Reduce synchronizes gradients efficiently.

Zero-Config Discovery: mDNS finds peers on your LAN automatically — no static IPs, no manual configuration.
Ring All-Reduce: Bandwidth-efficient gradient synchronization scales linearly with the number of nodes.
Fault Tolerance: Automatic checkpoint recovery and peer re-connection if a node drops during training.
Mixed Hardware: Combine M1, M2, M3, M4, and M5 machines in the same training cluster.
Node Discovery
mDNS Automatic
Gradient Sync
Ring All-Reduce
Fault Recovery
Checkpoint Resume
Topology
Peer-to-Peer Mesh
On-Premises
Your data never leaves your network. Cost = electricity.

Quickstart

From zero to fine-tuned model in minutes. Metal acceleration is automatic.

terminal
Install PMetal and verify Apple Silicon detection
# Install via cargo
cargo add pmetal

# Or add to Cargo.toml
[dependencies]
pmetal = "0.3"

# Install the CLI tool
cargo install pmetal-cli

# Verify installation + hardware detection
pmetal info

# Output:
# PMetal v0.3.13
# Hardware: Apple M3 Max
# Metal GPU: 40-core GPU (detected)
# Neural Engine: 16-core ANE (detected)
# Unified Memory: 128 GB
# Compute Strategy: Metal + ANE hybrid

Enterprise Security by Default

Every PMetal deployment is an air-gapped deployment. There is no opt-out because there is no cloud component to opt out of.

Zero Telemetry

PMetal makes zero outbound network requests. No usage metrics, no crash reports, no model weight syncing. Offline-first.

Air-Gapped Ready

Works entirely offline after initial model download. No CDN dependencies, no license checks, no cloud validation.

Data Stays Local

Training data, fine-tuned weights, evaluation results — everything stays on disk under your control.

MIT / Apache-2.0

Dual-licensed. Use commercially, modify freely, audit the full source. No proprietary blobs or binary-only components.

Auditable Codebase

18 focused crates means every piece of the stack is inspectable, replaceable, and independently verifiable.

No Subscription

Buy or build your hardware once. Fine-tune as many times as you need. Your cost model is electricity, nothing else.

Ready to own your AI?

PMetal is open-source and ready today. Talk to us about custom deployment, enterprise support, or fine-tuning your specific domain.