Module 1: Basics of Accelerated Computing
1.1 Overview of Computing Architectures
• CPU vs. GPU vs. FPGA vs. TPU
• Parallel Computing vs. Serial Computing
• Vectorized Computing and SIMD Instruction Set
1.2 Applications of Accelerated Computing
• Deep Learning Training and Inference Optimization
• Scientific Computing (High Energy Physics, Genome Sequencing)
• Financial Modeling (High-Frequency Trading, Quantitative Analysis)
• Industrial Simulation (Computational Fluid Dynamics, Finite Element Analysis)
Module 2: GPU Accelerated Computing
2.1 Basics of GPU Computing
• GPU Computing Architecture (CUDA / ROCm / OpenCL)
• CUDA Core (CUDA C/C++ Programming)
• Tensor Core, RT Core Accelerating AI Computation
2.2 Deep Learning GPU Acceleration
• PyTorch / TensorFlow GPU Computing Optimization
• cuDNN, cuBLAS, TensorRT Introduction
• Multi-GPU Training (Data Parallel / Model Parallel)
2.3 GPU Programming and Optimization
• CUDA Thread Model (Blocks, Threads, Warps)
• Memory Optimization (Global Memory vs. Shared Memory)
• GPU Profiler to Monitor Bottlenecks (Nsight, nvprof)
Module 3: FPGA Accelerated Computing
3.1 Principles of FPGA Computing
• FPGA vs. CPU vs. GPU
• Reconfigurable Computing and Hardware Acceleration
3.2 FPGA Programming Basics
• Introduction to Verilog / VHDL
• HLS (High-Level Synthesis) Acceleration Development
• FPGA Applications in AI and High-Performance Computing
3.3 FPGA Optimization in Specific Industries
• High-Speed Signal Processing (5G, Millimeter-Wave Radar)
• Low Latency Financial Trading (High-Frequency Trading HFT)
• Edge Computing in IoT (Smart Cameras, Autonomous Driving)
Module 4: TPU / AI-Specific Accelerators
4.1 TPU Computing Architecture
• TPU vs. GPU for Deep Learning Acceleration
• Google TPU Development Framework (JAX, TensorFlow)
• TPU Training vs. TPU Inference
4.2 AI-Specific Accelerators
• Huawei Ascend (Sheng Teng)
• Cerebras Wafer-Scale Engine
• Graphcore IPU
Module 5: Software Layer Optimization and Acceleration Libraries
5.1 Deep Learning Acceleration Libraries
• cuDNN / ROCm MIOpen: Deep Learning Acceleration
• TensorRT: AI Inference Optimization
• DeepSpeed / Horovod: Distributed AI Training
5.2 Parallel Computing Optimization
• OpenMP / OpenACC Parallel Optimization
• MPI (Message Passing Interface) for Accelerating Large-Scale Computing
5.3 Numerical Computing Acceleration
• NumPy / SciPy Optimization (Intel MKL, OpenBLAS)
• JIT Compilation (Numba, TensorFlow XLA)
Module 6: High-Performance Computing (HPC) Cluster Optimization
6.1 HPC Hardware Architecture
• InfiniBand High-Speed Interconnect
• RDMA / NVLink Remote Data Access
• High-Speed Storage (NVMe, Lustre File System)
6.2 Large-Scale Computational Task Management
• SLURM / Kubernetes Task Scheduling
• Cloud GPU Computing (AWS, Azure, GCP)
• Spot Instance Cost Optimization