Conclusions¶

Learned lessons¶

CUDA Out of Memory (OOM) errors. These errors occur when the model's memory requirements exceed the GPU's VRAM. For this project, an NVIDIA RTX 3090 with 24GB of VRAM was used, providing a substantial footprint for both static components (model weights and optimizer states) and dynamic components (activations). Memory consumption scales linearly with batch size and quadratically with sequence length due to the self-attention mechanism, which can still lead to OOM scenarios. Standard mitigation strategies such as gradient accumulation allow for simulating larger effective batch sizes by splitting them into smaller micro-batches, although in this implementation it was not employed since a reduction of the batch size to 64 coupled with the small variant of the model have been enough. Additionally, leveraging lower precision formats like BF16 or FP16 significantly reduces the memory overhead for activations and weights by half.
Maximizing CUDA utilization. To ensure the GPU reaches its full computational capacity, it is essential to maximize the throughput of the data pipeline. Leveraging customized kernels and torch.compile can significantly speed up execution by fusing operations and reducing overhead. Data loading bottlenecks are mitigated by setting the dataloader to use 4 worker processes with a prefetch factor of 2, ensuring that batches are ready for the GPU as soon as it completes a computation. Additionally, tokenized sequences (and their masks) are pre-calculated and cached during dataset initialization. While this increases system RAM consumption, it removes tokenization overhead from the data retrieval process, further accelerating batch preparation. This caching approach is feasible for this project due to the relatively small dataset size, but it would be unfeasible for significantly larger datasets that exceed system RAM. Finally, padding all sequences in the dataset to a uniform length helps achieve higher and more consistent GPU usage. To fully benefit from this when using torch.compile, dynamic CUDA graph compilation should be disabled (e.g., via torch._inductor.config.triton.cudagraph_skip_dynamic_graphs = False) to prevent unnecessary recompilations for varying input shapes, thereby stabilizing performance.
Limited data and budget. Training a transformer model from scratch traditionally requires massive datasets and extreme computing power. This project was developed on a workstation featuring a single NVIDIA RTX 3090 (24GB VRAM), 128GB of system RAM, and an AMD EPYC 7642 (48 cores, 96 threads). While this setup is remarkably powerful for single-GPU development, it represented a significant architectural limitation compared to industrial-scale clusters. Training on datasets like Europarl (~80M tokens) provides a practical middle ground: it is large enough to demonstrate the model's learning capacity while remaining small enough to achieve convergence within 2 hours on this hardware. This allowed for extensive experimentation and rapid architectural iteration without the prohibitive costs or time requirements of scaling to billions of tokens.
Mixed precision training. Using lower precision formats like BFloat16 (BF16) or Float16 (FP16) halves the memory required for activations during backpropagation. While BF16 maintains training stability naturally due to its wide dynamic range, FP16 has a narrower range and requires gradient scaling. This technique scales the loss before backpropagation to prevent gradients with small magnitudes from underflowing to zero. Before the weights are updated by the optimizer, these gradients are explicitly unscaled. Unscaling is crucial because it ensures the gradient norm is calculated at the correct magnitude, which prevents excessively aggressive gradient clipping and maintains training stability. Finally, to preserve numerical precision during the updates, critical variables like the master weights and optimizer states are kept in higher precision.

Modern training and inference advances¶

The training and inference procedures described in this work reflect the foundational methods introduced alongside the original Transformer architecture. The field has evolved significantly since then and several paradigms have become standard in the development of modern large language models. This section briefly outlines the most relevant advances:

Supervised finetuning. Pre-training on large corpora provides a model with broad linguistic knowledge, but the resulting behavior is not always aligned with specific downstream tasks or user expectations, especially when used as language model. Supervised finetuning (SFT) addresses this by continuing training on a curated dataset of high quality input–output pairs, such as instruction–response examples. This step narrows the model's behavior towards the desired task format and improves the quality of generated text for practical applications. SFT has become a standard intermediate step between pre-training and deployment in modern language model pipelines¹.
Reinforcement learning. While supervised finetuning aligns models with specific formats, Reinforcement Learning (RL) enables models to optimize for complex, non-differentiable objectives and long-term goals. Reinforcement Learning from Human Feedback (RLHF)² incorporates human preferences to improve alignment, safety, and helpfulness by training a reward model on human judgments. More recently, Reinforcement Learning from Verifiable Rewards (RLVR)³ has emerged as a powerful paradigm for technical domains like mathematics or programming, where model outputs can be checked against ground-truth verifiers (e.g., compilers, unit tests or mathematical results). This enables models to learn complex reasoning through trial and error, facilitating the emergence of self correction and chain of thought behaviors, without requiring dense human-labeled reasoning paths.
LoRA finetuning. Standard finetuning requires updating all parameters of a model, which is computationally prohibitive for large architectures. Low-Rank Adaptation (LoRA)⁴ addresses this by freezing the pretrained weights and injecting small, trainable low-rank decomposition matrices into each layer of the Transformer. This significantly reduces the number of trainable parameters and GPU memory requirements during training, while maintaining performance comparable to full finetuning and introducing no additional inference latency.
Quantization aware training. To reduce memory footprint and accelerate inference, models are often quantized to lower precision (e.g., 4-bit or 8-bit integers). Quantization Aware Training (QAT)⁵ simulates the effects of low-precision arithmetic during the training process. By accounting for rounding errors and quantization noise during backpropagation, QAT allows the model to adjust its parameters to preserve accuracy even when deployed in heavily compressed formats, typically outperforming simple post-training quantization.
Stochastic sampling methods. The decoding strategies discussed in this work, greedy decoding and beam search, are deterministic and tend to produce repetitive or generic outputs. Modern language models commonly employ stochastic sampling methods that introduce controlled randomness into the generation process. Temperature scaling divides the logits by a temperature parameter \(\tau\) before applying softmax: lower temperatures (\(\tau \approx 0\)) sharpen the distribution, making the model more confident and deterministic, while higher temperatures flatten it, increasing diversity. Top-\(k\) sampling restricts sampling at each step to the \(k\) most probable tokens, preventing the model from selecting tokens from the low-probability tail of the distribution. Top-\(p\) sampling instead defines the candidate set as the smallest set of tokens whose cumulative probability exceeds a threshold \(p\), adaptively adjusting the size of the candidate pool depending on how peaked or flat the distribution is at each step. These methods can be combined and are particularly important for open-ended generation tasks such as dialogue and creative writing, where deterministic decoding produces unnaturally rigid text.
Speculative decoding. Speculative decoding⁶ is an inference-time optimization technique that accelerates autoregressive generation without changing the output distribution. The core idea is to use a smaller, faster draft model to generate several candidate tokens ahead, and then verify them in parallel using a larger target model. Since the target model can evaluate multiple tokens simultaneously in a single forward pass, accepted tokens effectively skip the sequential bottleneck of autoregressive decoding. Rejected tokens are discarded and generation resumes from the last accepted position. This technique can yield significant speedups, 2 to 3 times faster in real world applications, while the output tokens distribution is mathematically equivalent to the one of the target model alone.

To wrap up¶

This documentation presented the complete pipeline for training, evaluating, and deploying a Transformer model for machine translation. The training phase centered on the next-token prediction task using the cross-entropy learning objective under teacher forcing. Key components for stable training included learning rate scheduling strategies, spanning the original inverse square root schedule and the modern Warmup-Stable-Decay approach, alongside regularization techniques such as dropout, label smoothing, and gradient norm clipping.

In the inference phase, the model's autoregressive generation process was detailed. To select the best tokens from the output distribution, two decoding strategies were discussed: greedy decoding, which is straightforward but locally optimal, and beam search, which maintains multiple hypotheses to find globally superior sequences while applying length normalization.

Finally, the experimental results confirmed the effectiveness of the implementation. Over two epochs of training on the Europarl corpus, the training loss appropriately converged. Evaluation metrics computed on the test set revealed that the model achieves strong in-domain translation performance, as measured by steady improvements in BLEU and ROUGE scores alongside decreasing perplexity. Qualitative assessments further highlighted the model's fluency on institutional domains, while also acknowledging limitations on out-of-domain text containing unseen vocabulary.

Wei, Jason, et al. 2022. Finetuned Language Models Are Zero-Shot Learners. https://arxiv.org/abs/2109.01652. ↩
Ouyang, Long, et al. 2022. Training language models to follow instructions with human feedback. https://arxiv.org/abs/2203.02155. ↩
DeepSeek-AI. 2025. DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning. https://arxiv.org/abs/2501.12948. ↩
Hu, Edward J., et al. 2022. LoRA: Low-Rank Adaptation of Large Language Models. https://arxiv.org/abs/2106.09685. ↩
Jacob, Benoit, et al. 2018. Quantization and Training of Neural Networks for Efficient Integer-Arithmetic-Only Inference. https://arxiv.org/abs/1712.05877. ↩
Leviathan, Yaniv, Matan Kalman, and Yossi Matias. 2023. Fast Inference from Transformers via Speculative Decoding. https://arxiv.org/abs/2211.17192. ↩