Module 1: Intelligent Computing Center Architecture Basics
1.1 Overview of Intelligent Computing Centers
• What is an Intelligent Computing Center ?
• Computing Architectures (CPU, GPU, FPGA, TPU, ASIC)
• Computing Network Architecture (InfiniBand vs. Ethernet vs. RDMA)
• Storage Architecture (HPC File Systems, NVMe, Distributed Storage)
1.2 Computing Resources and Power Pooling
• Physical Computing Resources (GPU Servers, FPGA Acceleration Cards)
• Virtualization and Containerization (Docker, Kubernetes, Singularity)
• GPU Resource Pooling (MIG, Multi-Tenant GPU Resource Management)
• Elastic Computing (Auto Scaling, Serverless Computing)
Module 2: Power Platform Development and Management
2.1 Power Scheduling System
• Task Scheduling Principles (FIFO, Fair Sharing, Gang Scheduling)
• HPC/AI Task Management Tools (SLURM, Kubernetes, Ray)
• GPU/TPU Scheduling Optimization (NVIDIA DCGM, K8s GPU Operator)
2.2 Power Virtualization and Resource Isolation
• GPU Sharing Technologies (vGPU, MIG, Passthrough)
• Automatic Resource Allocation (Helm, Kubernetes Scheduler)
• QoS (Quality of Service) and SLA Management
2.3 Multi-Cluster Power Scheduling
• Cross-Data Center Resource Integration (Federated Learning, Distributed Computing)
• Cloud + On-Premise Hybrid Scheduling (Hybrid Cloud Computing)
• Comparison of Major Scheduling Systems (Kubernetes vs. SLURM vs. Mesos)
Module 3: AI Computing Optimization
3.1 AI Training Task Scheduling Optimization
• PyTorch/TensorFlow Distributed Training Optimization
• Data Preprocessing Pipeline Optimization (TFData, DALI, DataLoader)
• Mixed Precision Training (FP16, BF16, INT8 Computing)
3.2 High-Performance Computing (HPC) Optimization
• Computational Task Optimization (MPI Parallel Computing, OpenMP)
• Parallel Storage Optimization (Lustre, CephFS, BeeGFS)
• InfiniBand High-Speed Network Optimization (RDMA, NVLink)
Module 4: Power Platform Operations and Monitoring
4.1 Computing Resource Monitoring
• GPU/CPU Monitoring (NVIDIA SMI, Prometheus + Grafana)
• Node Health Check (DCGM, K8s Node Problem Detector)
• Task Performance Analysis (NVIDIA Nsight, TensorBoard Profiler)
4.2 Power Efficiency Management
• GPU/CPU Load Balancing
• Low Power Computing Mode (NVIDIA PowerMizer, Dynamic Clocking)
• Green Computing (Energy-Efficient Scheduling, Carbon Emission Optimization)
Module 5: Power Platform Security and Permissions Management
5.1 Data and Computing Security
• AI Task Isolation (Kubernetes RBAC, Multi-Tenant)
• Data Encryption (TPM, Confidential Computing)
• GPU Access Control (GPU Sandbox, vGPU RBAC)
5.2 User Permissions and Billing Management
• Computing Resource Quotas (Quota & Limit Management)
• Task Priority Management (Preemption & Fair Scheduling)
• Computing Billing Statistics (GPU Usage Billing, Cost Optimization)
Module 6: Cloud Intelligent Computing Platform Architecture and Practice
6.1 Public Cloud Power Platform Setup
• AWS SageMaker, GCP AI Platform, Azure ML
• Kubernetes + Kubeflow for AI Task Management
6.2 Self-Built Intelligent Computing Center Case Studies
• Enterprise AI Power Center Construction (Nova Tech, Alibaba Cloud PAIS, Baidu AI Cloud)
• University AI Computing Platforms (Tsinghua Intelligent Computing Center, Berkeley HPC Center)
• Government and Research Intelligent Computing Platforms (China National Supercomputing Center, NASA Ames HPC)