HomeBrowseUpload
← Back to registry
// Skill profile

mac-code — Free Local AI Agent on Apple Silicon

name: mac-code-local-ai-agent

by adisinghstudent · published 2026-04-01

开发工具数据处理
Total installs
0
Stars
★ 0
Last updated
2026-04
// Install command
$ claw add gh:adisinghstudent/adisinghstudent-mac-code-local-ai-agent
View on GitHub
// Full documentation

---

name: mac-code-local-ai-agent

description: Run a free 35B AI coding agent on Apple Silicon Macs using local LLMs via llama.cpp or MLX with web search, shell, and file tools.

triggers:

- "set up mac code local AI agent"

- "run Claude Code alternative on Mac for free"

- "local LLM agent on Apple Silicon"

- "35B model on 16GB Mac"

- "llama.cpp agent with tools on Mac"

- "MLX local coding agent"

- "out of RAM model inference Mac"

- "mac-code setup and usage"

---

# mac-code — Free Local AI Agent on Apple Silicon

> Skill by [ara.so](https://ara.so) — Daily 2026 Skills collection.

Run a 35B reasoning model locally on your Mac for $0/month. mac-code is a CLI AI coding agent (Claude Code alternative) that routes tasks — web search, shell commands, file edits, chat — through a local LLM. Supports llama.cpp (30 tok/s) and MLX (64K context, persistent KV cache) backends on Apple Silicon.

---

What It Does

  • **LLM-as-router**: The model classifies every prompt as `search`, `shell`, or `chat` and routes accordingly
  • **35B MoE at 30 tok/s** via llama.cpp + IQ2_M quantization (fits in 16 GB RAM)
  • **35B full Q4 on 16 GB** via custom MoE Expert Sniper (1.54 tok/s, only 1.42 GB RAM used)
  • **9B at 64K context** via quantized KV cache (`q4_0` keys/values)
  • **MLX backend** adds persistent KV cache save/load, context compression, R2 sync
  • **Tools**: DuckDuckGo search, shell execution, file read/write
  • ---

    Installation

    Prerequisites

    brew install llama.cpp
    pip3 install rich ddgs huggingface-hub mlx-lm --break-system-packages

    Clone the repo

    git clone https://github.com/walter-grace/mac-code
    cd mac-code

    Download models

    **35B MoE — fast daily driver (10.6 GB, fits in 16 GB RAM):**

    mkdir -p ~/models
    python3 -c "
    from huggingface_hub import hf_hub_download
    hf_hub_download(
        'unsloth/Qwen3.5-35B-A3B-GGUF',
        'Qwen3.5-35B-A3B-UD-IQ2_M.gguf',
        local_dir='$HOME/models/'
    )
    "

    **9B — 64K context, long documents (5.3 GB):**

    python3 -c "
    from huggingface_hub import hf_hub_download
    hf_hub_download(
        'unsloth/Qwen3.5-9B-GGUF',
        'Qwen3.5-9B-Q4_K_M.gguf',
        local_dir='$HOME/models/'
    )
    "

    ---

    Starting the Backend

    Option A: llama.cpp + 35B MoE (recommended, 30 tok/s)

    llama-server \
        --model ~/models/Qwen3.5-35B-A3B-UD-IQ2_M.gguf \
        --port 8000 --host 127.0.0.1 \
        --flash-attn on --ctx-size 12288 \
        --cache-type-k q4_0 --cache-type-v q4_0 \
        --n-gpu-layers 99 --reasoning off -np 1 -t 4

    Option B: llama.cpp + 9B (64K context)

    llama-server \
        --model ~/models/Qwen3.5-9B-Q4_K_M.gguf \
        --port 8000 --host 127.0.0.1 \
        --flash-attn on --ctx-size 65536 \
        --cache-type-k q4_0 --cache-type-v q4_0 \
        --n-gpu-layers 99 --reasoning off -t 4

    Option C: MLX backend (persistent context, 9B)

    # Starts server on port 8000, downloads model on first run
    python3 mlx/mlx_engine.py

    Start the agent (all options)

    python3 agent.py

    ---

    Agent CLI Commands

    Inside the agent REPL, type `/` for all commands:

    | Command | Action |

    |---|---|

    | `/agent` | Agent mode with tools (default) |

    | `/raw` | Direct streaming, no tools |

    | `/model 9b` | Switch to 9B model (64K context) |

    | `/model 35b` | Switch to 35B MoE |

    | `/search <query>` | Quick DuckDuckGo search |

    | `/bench` | Run speed benchmark |

    | `/stats` | Session statistics |

    | `/cost` | Show cost savings vs cloud |

    | `/good` / `/bad` | Grade the last response |

    | `/improve` | View response grading stats |

    | `/clear` | Reset conversation |

    | `/quit` | Exit |

    Example prompts

    > find all Python files modified in the last 7 days
    → routes to "shell", generates: find . -name "*.py" -mtime -7
    
    > who won the NBA finals
    → routes to "search", queries DuckDuckGo, summarizes
    
    > explain how attention works
    → routes to "chat", streams directly

    ---

    MLX Backend — Persistent KV Cache API

    The MLX engine exposes a REST API on `localhost:8000`.

    Save context after processing a large codebase

    curl -X POST localhost:8000/v1/context/save \
        -H "Content-Type: application/json" \
        -d '{"name": "my-project", "prompt": "$(cat README.md)"}'

    Load saved context instantly (0.0003s)

    curl -X POST localhost:8000/v1/context/load \
        -H "Content-Type: application/json" \
        -d '{"name": "my-project"}'

    Download context from Cloudflare R2 (cross-Mac sync)

    # Requires R2 credentials in environment
    export R2_ACCOUNT_ID=your_account_id
    export R2_ACCESS_KEY_ID=your_key_id
    export R2_SECRET_ACCESS_KEY=your_secret
    export R2_BUCKET=your_bucket_name
    
    curl -X POST localhost:8000/v1/context/download \
        -H "Content-Type: application/json" \
        -d '{"name": "my-project"}'

    Standard OpenAI-compatible chat

    import requests
    
    response = requests.post("http://localhost:8000/v1/chat/completions", json={
        "model": "local",
        "messages": [{"role": "user", "content": "Write a Python quicksort"}],
        "stream": False
    })
    print(response.json()["choices"][0]["message"]["content"])

    Streaming chat

    import requests, json
    
    with requests.post("http://localhost:8000/v1/chat/completions", json={
        "model": "local",
        "messages": [{"role": "user", "content": "Explain transformers"}],
        "stream": True
    }, stream=True) as r:
        for line in r.iter_lines():
            if line.startswith(b"data: "):
                chunk = json.loads(line[6:])
                delta = chunk["choices"][0]["delta"].get("content", "")
                print(delta, end="", flush=True)

    ---

    KV Cache Compression (MLX)

    Compress context 4x with 99.3% similarity:

    from mlx.turboquant import compress_kv_cache
    from mlx.kv_cache import save_kv_cache, load_kv_cache
    
    # After building a KV cache from a long document
    compressed = compress_kv_cache(kv_cache, bits=4)  # 26.6 MB → 6.7 MB
    save_kv_cache(compressed, "my-project-compressed")
    
    # Load later
    kv = load_kv_cache("my-project-compressed")

    ---

    Flash Streaming — Out-of-Core Inference

    For models larger than your RAM (research mode):

    cd research/flash-streaming
    
    # Run 35B MoE Expert Sniper (22 GB model, 1.42 GB RAM)
    python3 moe_expert_sniper.py
    
    # Run 32B dense flash stream (18.4 GB model, 4.5 GB RAM)
    python3 flash_stream_v2.py

    How F_NOCACHE direct I/O works

    import os, fcntl
    
    # Open model file bypassing macOS Unified Buffer Cache
    fd = os.open("model.bin", os.O_RDONLY)
    fcntl.fcntl(fd, fcntl.F_NOCACHE, 1)  # bypass page cache
    
    # Aligned read (16KB boundary for DART IOMMU)
    ALIGN = 16384
    offset = (layer_offset // ALIGN) * ALIGN
    data = os.pread(fd, layer_size + ALIGN, offset)
    weights = data[layer_offset - offset : layer_offset - offset + layer_size]

    MoE Expert Sniper pattern

    # Router predicts which 8 of 256 experts activate per token
    active_experts = router_forward(hidden_state)  # returns [8] indices
    
    # Load only those experts from SSD (8 threads, parallel pread)
    from concurrent.futures import ThreadPoolExecutor
    
    def load_expert(expert_idx):
        offset = expert_offsets[expert_idx]
        return os.pread(fd, expert_size, offset)
    
    with ThreadPoolExecutor(max_workers=8) as pool:
        expert_weights = list(pool.map(load_expert, active_experts))
    
    # ~14 MB loaded per layer instead of 221 MB (dense)

    ---

    Common Patterns

    Use as a Python library (direct API calls)

    import requests
    
    BASE = "http://localhost:8000/v1"
    
    def ask(prompt: str, system: str = "You are a helpful coding assistant.") -> str:
        r = requests.post(f"{BASE}/chat/completions", json={
            "model": "local",
            "messages": [
                {"role": "system", "content": system},
                {"role": "user", "content": prompt}
            ]
        })
        return r.json()["choices"][0]["message"]["content"]
    
    # Examples
    print(ask("Write a Python function to parse JSON safely"))
    print(ask("Explain this error: AttributeError: NoneType has no attribute split"))

    Process a large file with paged inference

    from mlx.paged_inference import PagedInference
    
    engine = PagedInference(model="mlx-community/Qwen3.5-9B-4bit")
    
    with open("large_codebase.txt") as f:
        content = f.read()  # beyond single context window
    
    # Automatically pages through content
    result = engine.summarize(content, question="What does this codebase do?")
    print(result)

    Monitor server performance

    python3 dashboard.py

    ---

    Model Selection Guide

    | Your Mac RAM | Best Option | Command |

    |---|---|---|

    | 8 GB | 9B Q4_K_M | `--model ~/models/Qwen3.5-9B-Q4_K_M.gguf --ctx-size 4096` |

    | 16 GB | 35B IQ2_M (30 tok/s) | Default Option A above |

    | 16 GB (quality) | 35B Q4 Expert Sniper | `python3 research/flash-streaming/moe_expert_sniper.py` |

    | 48 GB | 35B Q4_K_M native | Download full Q4, `--n-gpu-layers 99` |

    | 192 GB | 397B frontier | Any large GGUF, full offload |

    ---

    Troubleshooting

    Server not responding on port 8000

    # Check if server is running
    curl http://localhost:8000/health
    
    # Check what's on port 8000
    lsof -i :8000
    
    # Restart llama-server with verbose logging
    llama-server --model ~/models/Qwen3.5-35B-A3B-UD-IQ2_M.gguf \
        --port 8000 --verbose

    Model download fails / incomplete

    # Resume interrupted download
    python3 -c "
    from huggingface_hub import hf_hub_download
    hf_hub_download(
        'unsloth/Qwen3.5-35B-A3B-GGUF',
        'Qwen3.5-35B-A3B-UD-IQ2_M.gguf',
        local_dir='$HOME/models/',
        resume_download=True
    )
    "

    Slow inference / RAM pressure on 16 GB Mac

    # Reduce context size to free RAM
    llama-server --model ~/models/Qwen3.5-35B-A3B-UD-IQ2_M.gguf \
        --port 8000 --ctx-size 4096 \   # reduced from 12288
        --cache-type-k q4_0 --cache-type-v q4_0 \
        --n-gpu-layers 99 -t 4
    
    # Or switch to 9B for lower RAM usage
    python3 agent.py
    # Then: /model 9b

    MLX engine crashes with memory error

    # MLX uses unified memory — check pressure
    vm_stat | grep "Pages free"
    
    # Reduce batch size in mlx_engine.py
    # Edit: max_batch_size = 512  →  max_batch_size = 128

    F_NOCACHE not bypassing page cache (macOS Sonoma+)

    # Verify F_NOCACHE is active
    import fcntl, os
    fd = os.open(model_path, os.O_RDONLY)
    result = fcntl.fcntl(fd, fcntl.F_NOCACHE, 1)
    assert result == 0, "F_NOCACHE failed — check macOS version and SIP status"

    `ddgs` search fails

    pip3 install --upgrade ddgs --break-system-packages
    # ddgs uses DuckDuckGo — no API key required, but may rate-limit
    # Retry after 60 seconds if you get a 202 response

    Wrong reshape on GGUF dequantization

    # GGUF tensors are column-major — correct reshape:
    weights = dequantized_flat.reshape(ne[1], ne[0])   # CORRECT
    # NOT: dequantized_flat.reshape(ne[0], ne[1]).T     # WRONG

    ---

    Architecture Summary

    agent.py
      ├── Intent classification → "search" | "shell" | "chat"
      ├── search → ddgs.DDGS().text() → summarize
      ├── shell  → generate command → subprocess.run()
      └── chat   → stream directly
    
    Backends (both expose OpenAI-compatible API on :8000)
      ├── llama.cpp  → fast, standard, no persistence
      └── mlx/       → KV cache save/load/compress/sync
    
    Flash Streaming (research/)
      ├── moe_expert_sniper.py  → 35B Q4, 1.42 GB RAM
      └── flash_stream_v2.py    → 32B dense, 4.5 GB RAM
          └── F_NOCACHE + pread + 16KB alignment
    // Comments
    Sign in with GitHub to leave a comment.
    // Related skills

    More tools from the same signal band