PACI removes the bubbles that cripple asynchronous pipeline parallelism and shaves as much as 1. 69× off time‑to‑accuracy compared with the fastest synchronous flush baseline. The paper demonstrates this gain on GPT‑2 Medium pre‑training while preserving the same peak memory usage.
Source: [Dev.to](https://dev.to/olaughter/local-gradient-accumulation-speeds-training-17x-2mdk)