Posts
AboutPosts
Sharding Large Models with Tensor Parallelism
State-of-the-art language models are too large to fit on a single GPU, even if you use data parallelism. This post explains tensor parallelism, a technique that splits large models across multiple GPUs.Read More →
Training Deep Networks with Data Parallelism in Jax
Train deep nets efficiently by parallelizing batch data in jax.Read More →