How much VRAM do I need to run a 70B parameter model locally?

With Q4 quantization, a 70B parameter model requires approximately 35-40GB of VRAM. This typically means dual RTX 5090s (64GB total) or a professional card like the RTX A6000 (48GB). Single consumer GPUs with 24GB are insufficient for 70B models.

Is it cheaper to run AI locally or pay for cloud AI subscriptions?

For teams of 10+ users, local infrastructure often saves money within 18-24 months. Cloud subscriptions like Microsoft Copilot cost $30/user/month ($3,600/year for 10 users). A capable local AI server costs $5,000-$15,000 upfront with minimal ongoing costs.

Do I need 10GbE networking for local AI?

For production AI workloads, yes. Moving a 100GB dataset over 1GbE takes 15 minutes; over 10GbE it takes 90 seconds. Vector databases and RAG systems benefit significantly from low-latency, high-bandwidth connections to storage.

What's the difference between the Best Performance and Smartest Buy options?

Best Performance (custom builds) offers maximum power and upgradeability at lower per-component cost but requires technical expertise. Smartest Buy (turnkey solutions) provides enterprise support, tested designs, and easier maintenance at a premium price.

Why does my AI server need a NAS?

For Retrieval-Augmented Generation (RAG), the AI searches your company documents to answer questions. This requires fast storage for vector databases. A NAS provides centralized, high-speed storage that the AI workstation can access over 10GbE.

Can I start small and upgrade later?

Yes. Start with a Starter tier ($2,500) for testing and expand to Pro ($6,500) or Enterprise ($15K+) as needs grow. Choose expandable platforms like Dell Precision workstations or modular NAS systems that support future upgrades.

What's the role of PCIe lanes in an AI workstation?

PCIe lanes connect GPUs, NVMe storage, and network cards to the CPU. AMD Threadripper 9000 offers 80-128 PCIe 5.0 lanes, allowing multiple high-end GPUs without bandwidth bottlenecks—critical for multi-GPU inference setups.

Should I use fiber or copper for my AI network?

Use SFP+ fiber for the server-to-switch backbone (lower latency, longer runs) and Cat6A copper for client connections. This hybrid approach balances performance and cost. See our Cat6 vs Fiber guide for detailed comparisons.

Building a Private Cloud for Local AI: Complete Hardware Guide (2026)

Affiliate Disclosure: This article contains affiliate links. If you make a purchase through these links, we may earn a small commission at no extra cost to you.

The Cloud Hangover

With Copilot at $30/user and ChatGPT Enterprise at ~$60/user, a 20-person team spends over $15,000 annually on AI subscriptions alone. Add the privacy concern—every document you analyze flows through third-party servers—and you understand why companies are building Private AI infrastructure.

Running models like Llama 3 or DeepSeek on your own hardware means your data stays on your network and your costs become predictable capital expenses.

But you can't do this on a laptop. You need infrastructure.

🚀 Updated for CES 2026

The hardware landscape shifted this week. UGREEN launched the NASync iDX series—NAS devices with built-in Intel Core Ultra processors and NPUs for local AI inference. Plugable revealed the TBT5-AI, a Thunderbolt 5 enclosure that brings desktop-class GPU power to laptops. We've updated our recommendations below to reflect these "AI-Native" devices alongside proven workstation options.

The Private AI Architecture: Anatomy of a Build

A local AI cloud isn't just a powerful computer. It's a triad of three interconnected systems. If one bottlenecks, the whole stack suffers.

The Three Pillars

The Brain (Compute): Your workstation or server with GPU and CPU
The Memory (Storage): Your NAS for vector database and document storage
The Nervous System (Network): 10GbE connections tying everything together

Consider a typical AI workflow: Your workstation runs inference on a 70B model. That model queries a vector database containing your company documents. The database lives on your NAS. Every query travels over your network.

A fast GPU with slow storage? Bottleneck. Fast storage with a 1 Gigabit connection? Bottleneck. The system is only as fast as its weakest link.

For each component category, we'll present two options:

The Best Performance: Maximum power, often custom or "whitebox," suited for technical teams
The Smartest Buy: Enterprise reliability, better support, easier maintenance—suited for business environments

"The Brain": Workstation & GPU Selection

The workstation handles the computational heavy lifting—running AI model inference. The critical metric here isn't CPU cores or RAM (though both matter). It's VRAM.

Understanding VRAM Requirements

VRAM (Video RAM) is the memory on your graphics card where AI models actually live during inference. Unlike system RAM, VRAM directly determines which models you can run and how fast.

Running a 70B parameter model—comparable to GPT-4 for many business tasks—requires substantial VRAM:

Quantization	VRAM Required	Hardware Needed
FP16 (full precision)	~140 GB	Multi-GPU server
INT8 (8-bit)	~70 GB	Dual professional GPUs
INT4 (4-bit)	~35-40 GB	Dual RTX 5090 or RTX A6000
Smaller models (7-30B)	4-18 GB	Single consumer GPU

For serious business applications, Q4 (4-bit quantization) on a 70B model hits the sweet spot: near-full-precision quality with reasonable hardware requirements. But "reasonable" still means 35-40GB of VRAM—more than any single consumer graphics card provides.

Future-Proofing

"Agentic AI"—models that take autonomous actions rather than just answering questions—will require even more VRAM and compute power. Plan for expandability.

Option 1: The Best Performance (Custom Build)

For maximum power and upgradeability, a custom workstation built around AMD Threadripper and NVIDIA RTX GPUs offers the best performance per dollar.

AMD Threadripper 9000 Series (Zen 5)

The Threadripper 9000 series, released July 2025, brings Zen 5 architecture to high-end workstations. For AI builds, the key specification is PCIe lanes.

Spec	Threadripper 9000 (HEDT)	Threadripper PRO 9000 WX
Max Cores	64 (9980X)	96 (9995WX)
PCIe 5.0 Lanes	80	128
Memory	Quad-channel DDR5, up to 1TB	Octa-channel DDR5 ECC, up to 2TB
TDP	350W	350W
Starting Price	$1,499 (24-core)	$11,699 (96-core)

Why do PCIe lanes matter? Each RTX 5090 GPU requires 16 PCIe lanes for full bandwidth. Add NVMe storage and a 10GbE network card, and mainstream platforms run out of lanes quickly. Threadripper's 80-128 lanes support multi-GPU configurations without bottlenecks.

NVIDIA RTX 5090 (Blackwell Architecture)

The RTX 5090, launched January 2025, represents the current pinnacle of consumer AI hardware:

Spec	RTX 5090
VRAM	32 GB GDDR7
Memory Bandwidth	1,792 GB/s
CUDA Cores	21,760
Tensor Cores	680 (5th generation)
TGP	575W
MSRP	$1,999

GPU Pricing Reality

Due to AI demand and memory shortages, RTX 5090 street prices currently range from $3,000-$4,000—well above the $1,999 MSRP. Plan budgets accordingly.

A single RTX 5090 handles 30B models comfortably. For 70B models, dual RTX 5090s provide 64GB of combined VRAM—enough headroom for the model plus working memory.

Custom Build Advantages:

Fastest inference speeds available
Fully upgradable as needs grow
Lower per-component cost
Full CUDA ecosystem compatibility

Custom Build Considerations:

Requires technical expertise for assembly and maintenance
Power requirements: Dual RTX 5090s + Threadripper demand 1600W+ PSU (ATX 3.1 standard) and a dedicated 20A circuit. Standard 15A office outlets will trip breakers.
Noise: This build sounds like a jet engine under load. Install in a server closet or dedicated room—not under a desk.

Option 2: The Smartest Buy (Turnkey Workstation)

For businesses prioritizing reliability, support, and ease of deployment, enterprise workstations from Dell or HP offer tested configurations with professional support.

Dell Precision 5860

The Precision 5860 tower workstation balances professional-grade components with upgradeability:

Spec	Dell Precision 5860
Processor	Intel Xeon W-2400 series (up to 24 cores)
RAM	Up to 2TB DDR5 ECC (8 DIMM slots)
GPU Support	Up to 2x double-wide professional GPUs
GPU Options	RTX A4500 (20GB), RTX A5000 (24GB), RTX A6000 (48GB)
Storage	Up to 72TB (NVMe + SATA)
Network	Dual Ethernet (1G + 10G)
Starting Price	$2,049

A mid-range configuration with Xeon processor, 64GB RAM, and an RTX A6000 typically runs $4,500-$8,900 depending on specifications.

View Dell Precision 5860

HP Z8 Fury G5

For maximum GPU capacity, the HP Z8 Fury G5 supports up to four professional graphics cards:

Spec	HP Z8 Fury G5
Processor	Intel Xeon W-series 4th/5th Gen (up to 60 cores)
RAM	Up to 2TB DDR5 ECC (16 DIMM slots)
GPU Support	Up to 4x double-wide GPUs
GPU Options	4x NVIDIA RTX 6000 Ada (48GB each)
Storage	Up to 120TB
Power	2250W (dual 1,125W supplies)
Starting Price	$2,945

For the most demanding AI workloads—multi-model inference, model fine-tuning, or future expansion—the Z8 Fury's four-GPU capacity provides headroom no consumer platform matches.

View HP Z8 Fury G5

Turnkey Workstation Advantages:

Enterprise support with next-business-day onsite service
ECC memory prevents silent data corruption
Tested thermal designs and power delivery
Validated driver and firmware combinations

Turnkey Workstation Considerations:

Higher cost than equivalent custom builds
Limited to manufacturer-supported configurations
Some upgrades may void warranty

For additional guidance on workstation specifications, see our Business Computer Specs Guide.

Option 3: Mobile AI (Laptop + Thunderbolt 5)

For laptop-first offices, the newly announced Plugable TBT5-AI bridges the gap between portability and local AI power.

Plugable TBT5-AI Enclosure (CES 2026)

Spec	Plugable TBT5-AI
Interface	Thunderbolt 5 (80Gbps, up to 120Gbps)
GPU Support	Customer-selectable NVIDIA/AMD/Intel
Max VRAM	Up to 96GB (depending on installed GPU)
Power Supply	850W internal (600W to GPU)
USB Power Delivery	96W to host laptop
Network	2.5GbE Ethernet
AI Stack	Microsoft Foundry Local, Google MCP

When to Choose

The TBT5-AI turns a MacBook Pro, Dell XPS, or any Thunderbolt 5 laptop into an AI workstation with desktop-class GPU power. Your data stays strictly within the office firewall—no cloud required. Ideal for consultants who need to demonstrate AI solutions on-site, or creative studios wanting portable + powerful.

Compatibility Note: Requires Windows 11 + Thunderbolt 5 host system for full performance. Thunderbolt 4 works with reduced bandwidth.

The Software Stack: What Runs on This Hardware

Hardware is only half the equation. Once your infrastructure is ready, you need software to run AI models and make them useful.

Layer	Tool	Purpose
Model Runtime	Ollama	Download and run open-source LLMs locally. Simple CLI, no cloud account required.
Beginner GUI	LM Studio	Drag-and-drop model management with a polished interface. Ideal for the Starter/Lab tier.
Production Inference	vLLM	High-throughput inference server for multi-user deployments.
RAG Pipeline	AnythingLLM	Connect your documents to local models. Handles embeddings and vector search.
Web Interface	Open WebUI	ChatGPT-style browser interface for non-technical users.

For a single-user lab, LM Studio or Ollama + Open WebUI gets you running in under an hour. For production RAG deployments, AnythingLLM or similar tools index your documents and enable natural language queries against your file archive.

"The Memory": High-Speed Data Storage

Your AI workstation runs the model, but where does it get the information to answer questions about your business? That's where storage comes in.

Understanding RAG (Retrieval-Augmented Generation)

RAG allows your AI to search your company documents before answering. Ask "What were the key terms in the Anderson contract?" and the system:

Searches a vector database of your indexed documents
Retrieves relevant sections
Provides them to the AI model as context
Generates an informed answer

This requires fast storage. Not just for model files (a 70B model is roughly 40GB), but for vector databases that can grow significantly larger than the source documents—vector embeddings amplify storage needs by approximately 10x.

Storage Reality

100TB of company documents can become 1PB of vector embeddings. Start with fast flash storage; plan for expansion.

Mechanical hard drives are fine for archiving source documents, but the vector database must live on NVMe flash. The random I/O patterns of vector search demand it.

Option 1: The AI-Native Choice (CES 2026)

UGREEN's CES 2026 announcement changes the game for local AI storage. The NASync iDX series integrates compute power directly into the NAS.

UGREEN NASync iDX6011 Pro

Spec	UGREEN NASync iDX6011 Pro
Processor	Intel Core Ultra 7 255H with NPU
RAM	64GB LPDDR5x
Drive Bays	6x SATA + 2x NVMe
Max Storage	196TB
Network	Dual 10GbE (20Gbps aggregated)
Expansion	Thunderbolt 4, OCuLink for eGPU
AI Capability	Built-in NPU, runs vector databases locally
Price	$1,599 (early-bird) / $2,599 (MSRP)

Game Changer

Unlike traditional NAS devices, the iDX6011 Pro can run your vector database (Weaviate, Chroma) and RAG pipeline directly on the NAS—no workstation required for basic AI workloads. The Intel Core Ultra NPU handles embeddings and inference locally. For teams wanting a single "AI Lab in a Box," this is the new gold standard.

Best For: Technical teams who want to run AI workloads directly on storage, startups consolidating infrastructure, "Lab in a Box" deployments.

View UGREEN NASync iDX6011 Pro

UGREEN NASync iDX Series - AI-Native NAS Announcement

Option 2: Pure Speed (All-Flash NAS)

For maximum throughput with a proven platform, an all-NVMe NAS saturates a 10GbE connection easily.

TerraMaster F8 SSD Plus

The F8 SSD Plus packs 8 NVMe slots into a palm-sized enclosure:

Spec	TerraMaster F8 SSD Plus
Drive Bays	8x M.2 2280 NVMe
Processor	Intel Core i3 N305 (8-core, up to 3.8GHz)
RAM	16GB DDR5 (expandable to 32GB)
Max Storage	64TB raw (8x 8TB NVMe)
Network	10GbE RJ45
Performance	1,020 MB/s read/write
Price	$799-$849 (diskless)

The F8 SSD Plus delivers consistent, low-latency performance ideal for vector databases. Its compact size (177 x 60 x 140 mm) fits easily on a desk or shelf.

Best For: Dedicated vector database storage, RAG workloads requiring sub-100ms latency.

View TerraMaster F8 SSD Plus

Option 3: The Reliable Ecosystem (File Storage)

For document archives, backups, and source file storage, Synology's software ecosystem is unmatched—but it's designed for file storage, not AI compute.

Synology DS925+

Released April 2025, the DS925+ updates Synology's popular 4-bay lineup:

Spec	Synology DS925+
Drive Bays	4x SATA (3.5" or 2.5")
M.2 Slots	2x NVMe (cache or storage pool)
Processor	AMD Ryzen V1500B quad-core 2.2GHz
RAM	4GB DDR4 ECC (expandable to 32GB)
Network	2x 2.5GbE (link aggregation)
Expansion	Up to 9 drives with DX525
Performance	522 MB/s read, 565 MB/s write
Price	$620

Important: Vector Database Location

The V1500B processor dates from 2018 (Zen 1 architecture). While excellent for file serving, backup, and document archives, it is too weak for AI-specific workloads like running vector databases (Weaviate, Chroma, Milvus) or containerized RAG pipelines.

Recommendation: Use the Synology for source document storage and backups. Run your vector database on NVMe storage inside your workstation, where the CPU/GPU can handle the indexing and retrieval workloads.

Synology's value isn't just hardware—it's software:

Synology Drive: Client file sync across Windows, Mac, and mobile
Active Backup: Centralized backup for endpoints and servers
Synology Hybrid RAID (SHR): Mix drive sizes with automatic optimization
QuickConnect: Secure remote access without port forwarding
Surveillance Station: If you add cameras later

Add NVMe cache drives to the M.2 slots for accelerated read performance on frequently-accessed files.

View Synology DS925+

For a detailed comparison of NAS options, see our UGREEN vs Synology NAS Comparison and Synology NAS Business Guide.

"The Nervous System": 10GbE Networking

Fast compute and fast storage mean nothing if the connection between them bottlenecks. This is where many AI deployments fail.

The Speed Gap

Consider real transfer times for a 100GB dataset (common for large document archives or AI model files):

Network Speed	Transfer Time
1 Gigabit Ethernet	~13-15 minutes
2.5 Gigabit Ethernet	~5-6 minutes
10 Gigabit Ethernet	~80-90 seconds

For iterative AI development—testing queries, refining prompts, updating vector databases—that difference compounds into hours saved per week.

For comprehensive networking planning, our 10 Gigabit Ethernet Guide covers the full landscape.

The Smartest Buy: UniFi Pro Max

For small and medium businesses, Ubiquiti's UniFi platform delivers enterprise-grade features at a fraction of enterprise cost. (Enterprise options like Cisco Catalyst exist, but their pricing puts them outside typical SMB budgets.)

UniFi Switch Pro Max 24 PoE

Spec	USW-Pro-Max-24-PoE
Ports	8x 2.5GbE PoE++, 16x 1GbE PoE+/++, 2x 10G SFP+
PoE Budget	400W total
Switching Capacity	112 Gbps
Layer 3	DHCP, inter-VLAN routing, static routes
Feature	Etherlighting™ per-port LED status
Management	UniFi Network application
Price	$799

Etherlighting illuminates each port with customizable colors indicating VLAN assignment, link speed, or device type. For troubleshooting—finding which port connects to which device—it's remarkably useful.

View UniFi Pro Max 24

Deployment Recommendation:

Use SFP+ fiber for the workstation-to-switch backbone (lower latency, cleaner runs)
Use Cat6A copper for client connections and cameras
Consider PoE for powering access points and cameras without separate power runs

For cabling guidance, see our Cat6 vs Fiber Business Guide.

For more on UniFi's Etherlighting feature, see our UniFi Pro Max Etherlighting deep dive.

The Build Tiers: Bill of Materials

Based on research and real-world deployments—now updated with CES 2026 announcements—here are three tiers for private AI infrastructure:

Component	Starter (The Lab)	Pro (The Agency)	Enterprise (HQ)
Compute + Storage	UGREEN NASync iDX6011 Pro	Dell Precision 5860	Custom TR 9000 + 2x RTX 5090
Storage	(built-in)	Synology DS925+	TerraMaster F8 SSD Plus
Network	UniFi Switch Lite 8 PoE	UniFi Pro Max 24 PoE	UniFi Enterprise XG 24
Est. Cost	~$1,700	~$6,500	~$15,000+
Best For	Solo labs, startups	Small agencies (5-15)	Growing companies (15+)

Starter Tier (~$1,700): The Lab in a Box (CES 2026)

Use Case: Individual experimentation, proof-of-concept testing, startups on a budget

Component	Product	Est. Price
Compute + Storage	UGREEN NASync iDX6011 Pro	~$1,599 (early-bird)
Network	UniFi Switch Lite 8 PoE	~$110

The iDX6011 Pro is the star of CES 2026's AI hardware: an all-in-one NAS + AI compute device. The Intel Core Ultra 7 processor with built-in NPU runs vector databases and small LLMs (up to ~13B parameters) directly on the device—no separate workstation required.

One Box, Zero Compromise

For solo labs and small teams, the iDX6011 Pro eliminates the "compute vs. storage" trade-off. Install Ollama, index your documents, and start querying. Total cost under $1,800 including a network switch.

Limitations: The NPU handles embeddings and small models well, but for 70B models, you'll still need a dedicated GPU workstation (see Pro tier).

Pro Tier (~$6,500): The Agency

Use Case: Small teams, production RAG deployments, client-facing AI applications

Component	Product	Est. Price
Compute	Dell Precision 5860 (Xeon, 64GB, RTX A6000)	~$4,500
Storage	Synology DS925+ with 4x 8TB drives	~$1,200
Network	UniFi Pro Max 24 PoE	~$800

This configuration runs 70B models on the RTX A6000's 48GB VRAM. The Synology DS925+ handles file storage, backups, and Synology Drive sync—while your vector database runs on the Dell's internal NVMe for maximum performance. The UniFi Pro Max handles PoE for access points and cameras while providing 10G uplinks.

Dell's enterprise support means next-business-day service if hardware fails—critical for production systems.

Enterprise Tier (~$15,000+): The HQ

Use Case: Larger organizations, multi-user inference, model fine-tuning, redundancy requirements

Component	Product	Est. Price
Compute	Custom Threadripper 9980X + 2x RTX 5090	~$12,000+
Storage	TerraMaster F8 SSD Plus + 8x 2TB NVMe (TLC, high endurance)	~$2,500
Network	UniFi Enterprise XG 24	~$1,370

Dual RTX 5090s provide 64GB VRAM for 70B models with headroom. Use TLC (triple-level cell) NVMe drives with high endurance ratings—cheap QLC drives will burn out under vector database write intensity. The Enterprise XG 24 offers 24 ports of 10GbE—every connection runs at full speed.

This tier supports multiple concurrent users and provides foundation for model fine-tuning on company data.

Noise level: Significant. Install in a server closet or dedicated room.

Conclusion: Owning vs. Renting

Building private AI infrastructure represents a fundamental shift: from renting intelligence to owning it.

The SaaS model—$30/user/month, data flowing through third-party servers—works for many use cases. But for organizations prioritizing:

Privacy (keeping sensitive data on-premises)
Predictability (capital expense vs. subscription sprawl)
Control (no vendor lock-in, no usage limits)

...private AI infrastructure delivers compounding value over time.

Start with whatever tier fits your current needs. The key is choosing expandable platforms—workstations that support more GPUs, NAS systems that add drives, networks that scale to 10GbE everywhere.

And remember: this hardware is complex. If you're in the Miami area and prefer professional design and installation, we can help architect a solution matching your specific requirements.

Request Design Consultation Shop UniFi Networking

Local AI Server Small Business Guide — Mac Studio vs custom PC comparison
UGREEN vs Synology NAS Comparison — Detailed NAS showdown
10 Gigabit Ethernet Guide — Complete 10GbE implementation guide
Business Computer Specs Guide — General workstation guidance