By doing the planning / thinking step in abstract latent space and only autoregressively decoding the results of those thought-processes, we can optimize such a planning by optimizing the representation of the answer via gradient descent, following a global energy function, performing a search across the space of answers.
The key is to do the optimization iteratively in representation space … not just producing tons of outputs and selecting the best ones.

world model