An input sentence goes through the encoder blocks, and the output of the last encoder block becomes the input features to the decoder. The encoder in the transformer consists of multiple encoder blocks. The encoder extracts features from an input sentence, and the decoder uses the features to produce an output sentence (translation).
The transformer uses an encoder-decoder architecture. For example, we can train it to translate English into French sentences. The original transformer published in the paper is a neural machine translation model. So, this article starts with a bird-view of the architecture and aims to introduce essential components and give an overview of the entire model architecture. The transformer is an encoder-decoder network at a high level, which is very easy to understand. In other words, it uses other common concepts like an encoder-decoder architecture, word embeddings, attention mechanisms, softmax, and so on without the complication introduced by recurrent neural networks or convolutional neural networks. The paper’s author says the architecture is simple because it has no recurrence and convolutions.