Jack W. Rae
Anna Potapenko
Siddhant M. Jayakumar
Chloe Hillier
Timothy P. Lillicrap
arXiv:1911.05507v1 [cs.LG] 13 Nov 2019
Abstract
We present the Compressive Transformer, an attentive sequence model which compresses past memories for long-range sequence learning. We find the Compressive Transformer obtains state-of-the-art language modelling results in the WikiText-103 and Enwik8 benchmarks, achieving 17.1 ppl and 0.97 bpc respectively. We also find it can model high-frequency speech effectively and can be used as a memory mechanism for RL, demonstrated on an object matching task. To promote the domain of long-range sequence learning, we propose a new open-vocabulary language modelling benchmark derived from books, PG-19.
1 Introduction
Humans have a remarkable ability to remember information over long time horizons. When reading a book, we build up a compressed representation of the past narrative, such as the characters and events that have built up the story so far. We can do this even if they are separated by thousands of words from the current text, or long stretches of time between readings. During daily life, we make use of memories at varying time-scales: from locating the car keys, placed in the morning, to recalling the name of an old friend from decades ago. These feats of memorisation are not achieved by storing every sensory glimpse throughout one’s lifetime, but via lossy compression. We aggressively select, filter, or integrate input stimuli based on factors of surprise, perceived danger, or repetition — amongst other signals
Memory systems in artificial neural networks began with very compact representations of the past. Recurrent neural networks
However since the LSTM, there has been great benefit discovered in not bottlenecking all historical information in the state, but instead in keeping past activations around in an external memory and attending to them. The Transformer
One drawback in storing everything is the computational cost of attending to every time-step and the storage cost of preserving this large memory. Several works have focused on reducing the computational cost of attention with sparse access mechanisms
We propose the Compressive Transformer, a simple extension to the Transformer which maps past hidden activations (memories) to a smaller set of compressed representations (compressed memories). The Compressive Transformer uses the same attention mechanism over its set of memories and compressed memories, learning to query both its short-term granular memory and longer-term coarse memory. We observe this improves the modelling of text, achieving state-of-the-art results in character-based language modelling —
We show the Compressive Transformer works not only for language, but can also model the waveform of high-frequency speech with a trend of lower likelihood than the TransformerXL and Wavenet
Furthermore we present a new book-level language-modelling benchmark PG-19, extracted from texts in Project Gutenberg1, to further promote the direction of long-context sequence modelling. This is over double the size of existing LM benchmarks and contains text with much longer contexts.
2 RELATED WORK
There have been a variety of recent attempts to extend the range of attention, particularly in the Transformer, or to replace the attention operation with something less expensive.
The Sparse Transformer
The use of dynamic attention spans is explored in
3 MODEL
We present the Compressive Transformer, a long-range sequence model which compacts past activations into a compressed memory. The Compressive Transformer is a variant of the Transformer

3.1 DESCRIPTION
We define and
| Algorithm 1 Compressive Transformer |
|---|
| At time zero |
| 1: // Initialize memory to zeros |
| 2: // Initialize compressed memory to zeros |
| At time t |
| 3: // Embed input sequence |
| 4: for layer do |
| 5: // |
| 6: // MHA over both mem types |
| 7: // Regular skip + layernorm |
| 8: // Oldest memories to be forgotten |
| 9: // Compress oldest memories by factor |
| 10: // Update memory |
| 11: // Update compressed memory |
| 12: // Mixing MLP |
Algorithm 2 Attention-Reconstruction Loss
| 1: | |
| 2: | for layer do |
| 3: | // Stop compression grads from passing… |
| 4: | // …into transformer network. |
| 5: | // Re-use attention weight matrices. |
| 6: | // Use content-based attention (no relative). |
| 7: | // Compression network (to be optimized). |
| 8: |
3.2 COMPRESSION FUNCTIONS AND LOSSES
For choices of compression functions
One can train the compression network using gradients from the loss; however for very old memories this requires backpropagating-through-time (BPTT) over long unrolls. As such we also consider some local auxiliary compression losses. We consider an auto-encoding loss where we reconstruct the original memories from the compressed memories
3.3 TEMPORAL RANGE
The TransformerXL with a memory of size
4 PG-19 BENCHMARK
As models begin to incorporate longer-range memories, it is important to train and benchmark them on data containing larger contexts. Natural language in the form of text provides us with a vast repository of data containing long-range dependencies, that is easily accessible. We propose a new language modelling benchmark, PG-19, using text from books extracted from Project Gutenberg2. We select Project Gutenberg books which were published over 100 years old, i.e. before 1919 (hence the name PG-19) to avoid complications with international copyright, and remove short texts. The dataset contains books, or of text — which makes it over double the size of BookCorpus and Billion Word Benchmark.
Table 1: Comparison to existing popular language modelling benchmarks.
| Avg. length (words) | Train Size | Vocab | Type | |
|---|---|---|---|---|
| 1B Word | 27 | 4.15GB | 793K | News (sentences) |
| Penn Treebank | 355 | 5.1MB | 10K | News (articles) |
| WikiText-103 | 3.6K | 515MB | 267K | Wikipedia (articles) |
| PG-19 | 69K | 10.9GB | (open) | Books |
4.1 RELATED DATASETS
The two most benchmarked word-level language modelling datasets either stress the modelling of stand-alone sentences
Books are a natural choice of long-form text, and provide us with stylistically rich and varied natural language. Texts extracted from books have been used for prior NLP benchmarks; such as the Children’s Book Test
CBT and LAMBADA are useful for probing the linguistic intelligence of models, but are not ideal for training long-range language models from scratch as they truncate text extracts to at most a couple of paragraphs, and discard a lot of the books’ text. There has been prior work on training models on book data using BookCorpus directly
The NarrativeQA Book Comprehension Task
4.2 STATISTICS
A brief comparison of PG-19 to other LM datasets can be found in Table 1. We intentionally do not limit the vocabulary by unk-ing rare words, and release the dataset as an open-vocabulary benchmark. To compare models we propose to continue measuring the word-level perplexity. This can still be computed for any chosen character-based, byte-based or subword-based scheme. To do this, one calculates the total cross-entropy loss
Alongside quantitative analyses, we build an LDA topic model
Table 2: PG-19 statistics split by subsets.
| Train | Valid. | Test | |
|---|---|---|---|
| # books | 28,602 | 50 | 100 |
| # words | 1,973,136,207 | 3,007,061 | 6,966,499 |
Table 3: Eval. perplexities on PG-19.
| Valid. | Test | |
|---|---|---|
| 36L TransformerXL | 45.5 | 36.3 |
| 36L Compressive Transf. | 43.4 | 33.6 |
Table 4: State-of-the-art results on Enwik8.
| Model | BPC |
|---|---|
| 7L LSTM | 1.67 |
| LN HyperNetworks | 1.34 |
| LN HM-LSTM | 1.32 |
| ByteNet | 1.31 |
| RHN | 1.27 |
| mLSTM | 1.24 |
| 64L Transf. | 1.06 |
| 24L TXL | 0.99 |
| Sparse Transf. | 0.991 |
| Adaptive Transf. | 0.98 |
| --- | --- |
| 24L TXL (ours) | 0.98 |
| 24L Compressive Transformer | 0.97 |
Table 5: Compression approaches on Enwik8.
| Compression fn | Compression loss | BPC |
|---|---|---|
| Conv | BPTT | 0.996 |
| Max Pooling | N/A | 0.986 |
| Conv | Auto-encoding | 0.984 |
| Mean Pooling | N/A | 0.982 |
| Most-used | N/A | 0.980 |
| Dilated conv | Attention | 0.977 |
| Conv | Attention | 0.973 |
5 EXPERIMENTS
We optimised all models with Adam
5.1 PG-19
We benchmark the Compressive Transformer against the TransformerXL on the newly proposed PG-19 books dataset. Because it is open-vocabulary, we train a subword vocabulary of size 32000 with SubwordTextEncoder from the tfds package in TensorFlow and use the dataset statistics to compute word-level perplexity, as described in Section 4.2. We train a 36 layer Compressive Transformer with a window size of 512, both memory and compressed memory size of 512, and compression rate
5.2 ENWIK8
We compare the TransformerXL and the Compressive Transformer on the standard character-level language modelling benchmark Enwiki8 taken from the Hutter Prize
with compression rate
We compare compression functions and the use of auxiliary losses in Table 5. We sweep over compression rates of 2, 3, and 4 and report results with the best performing value for each row. BPTT signifies that no auxiliary compression loss was used to train the network other than the overall training loss. To feed gradients into the compression function we unrolled the model over double the sequence length and halved the batch size to fit the larger unroll into memory.
5.3 WIKITEXT-103
We train an eighteen-layered Compressive Transformer on the closed-vocabulary word-level language modelling benchmark WikiText-103, which contains articles from Wikipedia. We train the model with a compressed memory size, memory size, and a sequence window size all equal to 512. We trained the model over 64 Tensor Processing Units
It is worth noting that in Table 6 we do not list methods that use additional training data, or that make use of test-time labels to continue training the model on the test set
We break perplexity down by word frequency in Table 7 and see the Compressive Transformer makes only a small modelling improvement for frequent words (2.6% over the TransformerXL baseline) but obtains a much larger improvement of
5.4 COMPRESSIBILITY OF LAYERS
We can use compression to better understand the model’s mode of operation. We inspect how compressible Transformer’s activations are as they progress through higher layers in the network. One may expect representations to become more difficult to compress at higher layers, if more semantic information is represented there. We monitor the compression loss at each layer of our best-performing Compressive Transformer models trained on Enwik8 and WikiText-103 and display these in Supplementary Section A Figure 6. We note that the compression loss is about one order of magnitude higher for word-level language modelling (WikiText-103) over character-level language modelling (Enwik8). Furthermore the first layer of the Transformer is highly compressible. However there is not a clear trend of compression cost increasing with layer depth.
Table 6: Validation and test perplexities on WikiText-103.
| Valid. | Test | |
|---|---|---|
| LSTM | - | 48.7 |
| Temporal CNN | - | 45.2 |
| GCNN-14 | - | 37.2 |
| Quasi-RNN | 32 | 33 |
| RMC | 30.8 | 31.9 |
| LSTM+Hebb. | 29.0 | 29.2 |
| Transformer | - | 18.7 |
| 18L TransformerXL, M=384 | - | 18.3 |
| 18L TransformerXL, M=1024 (ours) | - | 18.1 |
| 18L Compressive Transformer, M=1024 | 16.0 | 17.1 |
Table 7: WikiText-103 test perplexity broken down by word frequency buckets. The most frequent bucket is words which appear in the training set more than times, displayed on the left. For reference, a uniform model would have perplexity
| All | |||||
|---|---|---|---|---|---|
| LSTM* | 12.1 | 219 | 1,197 | 9,725 | 36.4 |
| TransformerXL (ours) | 7.8 | 61.2 | 188 | 1,123 | 18.1 |
| Compressive Transformer | 7.6 | 55.9 | 158 | 937 | 17.1 |
| Relative gain over TXL | 2.6% | 9.5% | 21% | 19.9% | 5.8% |
5.5 ATTENTION
We inspect where the network is attending to on average, to determine whether it is using its compressed memory. We average the attention weight over a sample of sequences from a trained model on Enwik8. We aggregate the attention into eighteen buckets, six for each of the compressed memory, memory, and sequence respectively. We set the size of the sequence, memory and compressed memory all to be 768. We plot this average attention weight per bucket in Figure 2 with a
5.5.1 OPTIMISATION SCHEDULE
We make an observation about an interesting but undesirable meta-learning phenomenon during long-context training. When the learning rate is tuned to be much smaller (or set to zero) during training, performance degrades drastically both for the TransformerXL and the Compressive Transformer. This is displayed in Figure 3.
Usually we consider distributional shift from the training data to the test data, but we can also observe a shift in the model when transferring from a training to evaluation mode (even when the model is evaluated on the training data). In this case, this is due to the online updating of parameters whilst processing long contiguous articles. We would like the model to generalise well to scenarios where it is not continuously optimised. Updating the parameters only at article boundaries (and then resetting the state) could be one solution for long-range memory models, but this would slow down learning significantly.
Instead, we propose reducing the frequency of optimisation updates during training. We find this allows for the best of both worlds — fast initial learning with frequent updates, and better generalisation near the end of training with less frequent updates (e.g. every 4 steps). Reducing the optimisation frequency increases the effective batch size, which has also been shown to be prefer


able to learning rate decay in image modelling
5.6 SPEECH
We train the Compressive Transformer on the waveform of speech to assess its performance on different modalities. Speech is interesting because it is sampled at an incredibly high frequency, but we know it contains a lot of information on the level of phonemes and entire phrases.
To encourage long-term reasoning, we refrain from conditioning the model on speaker identity or text features, but focus on unconditional speech modelling. We train the model on 24.6 hours of 24kHz North American speech data. We chunk the sequences into windows of size 3840, roughly 80ms of audio, and compare a 20-layer Compressive Transformer to a 20-layer TransformerXL and a 30-layer WaveNet model
WaveNet processes an entire chunk in parallel, however the TransformerXL and Compressive Transformer are trained with a window size of 768 and a total memory size of 1,568 (for the Compressive Transformer we use 768 memory + 768 compressed). We thus unroll the model over the sequence. Despite this sequential unroll, the attention-based models train at only half the speed of WaveNet. We see the test-set negative-log-likelihood in Figure 4, and observe that a Compressive Transformer with a compression rate of 4 is able to outperform the TransformerXL and maintain a slim advantage over WaveNet. However we only trained models for at most one week (with 32GPUs) and it would be advantageous to continue training until full convergence — before definitive conclusions are made.
5.7 REINFORCEMENT LEARNING
Compression is a good fit for video input sequences because subsequent frames have high mutual information. Here we do not test out the Compressive Transformer on video, but progress straight to a reinforcement learning agent task that receives a video stream of visual observations — but must ultimately learn to use its memory to reason over a policy.


We test the Compressive Transformer as a drop-in replacement to an LSTM in the IMPALA setup
We fix both the memory and compressed memory sizes to 64. In Figure 5, we present results for a range of compression rates, averaged over 3 seeds. We see that the best performing agents endowed with the Compressive Transformer are able to solve the task to human-level. We note that the model with compression rate 1 is unable to learn the task to the same proficiency. The speed of learning and stability seem to increase proportionally with higher rates of compression (up to a limit) – i.e. the effective memory window of the agent – and we find compression rate 4 to once again be the best performing. We see this as a promising sign that the architecture is able to efficiently learn, and suitably use, compressed representations of its visual input and hope to test this more widely in future work.
6 CONCLUSION
In this paper we explore the notion of compression as a means of extending the temporal receptive field of Transformer-based sequence models. We see a benefit to this approach in the domain of text, with the Compressive Transformer outperforming existing architectures at long-range language modelling. To continue innovation in this area, we also propose a new book-level LM benchmark, PG-19. This may be used to compare long-range language models, or to pre-train on other long-range reasoning language tasks, such as NarrativeQA
We see the idea of compressive memories is applicable not only to the modality of text, but also audio, in the form of modelling the waveform of speech, and vision, within a reinforcement-learning agent trained on a maze-like memory task. In both cases, we compare to very strong baselines (Wavenet
The main limitation of this work is additional complexity, if the task one wishes to solve does not contain long-range reasoning then the Compressive Transformer is unlikely to provide additional benefit. However as a means of scaling memory and attention, we do think compression is a simpler approach to dynamic or sparse attention — which often requires custom kernels to make efficient. One can build effective compression modules from simple neural network components, such as convolutions. The compression components are immediately efficient to run on GPUs and TPUs.
Memory systems for neural networks began as compressed state representations within RNNs. The recent wave of progress using attention-based models with deep and granular memories shows us
that it is beneficial to refrain from immediately compressing the past. However we hypothesise that more powerful models will contain a mixture of granular recent memories and coarser compressed memories. Future directions could include the investigation of adaptive compression rates by layer, the use of long-range shallow memory layers together with deep short-range memory, and even the use of RNNs as compressors. Compressive memories should not be forgotten about just yet.
Acknowledgements
Author Contributions
Funding
Competing Interests
References
A. Baevski and M. Auli. Adaptive input representations for neural language modeling. arXiv, 2019.
D. Bahdanau, K. Cho, and Y. Bengio. Neural machine translation by jointly learning to align and translate. arXiv, 2014.
S. Bai, J. Z. Kolter, and V. Koltun. Convolutional sequence modeling revisited, 2018a. URL OpenReview.
S. Bai, J. Z. Kolter, and V. Koltun. Trellis networks for sequence modeling. arXiv, 2018b.
C. Beattie, J. Z. Leibo, D. Teplyashin, T. Ward, M. Wainwright, H. Küttler, A. Lefrancq, S. Green, V. Valdés, A. Sadik, J. Schrittwieser, K. Anderson, S. York, M. Cant, A. Cain, A. Bolton, S. Gaffney, H. King, D. Hassabis, S. Legg, and S. Petersen. Deepmind lab. CoRR, abs/1612.03801, 2016. URL arXiv.
D. M. Blei, A. Y. Ng, and M. I. Jordan. Latent dirichlet allocation. J. Mach. Learn. Res., 3:993–1022, Mar. 2003. ISSN 1532-4435.
J. Bradbury, S. Merity, C. Xiong, and R. Socher. Quasi-recurrent neural networks. arXiv, 2016.
C. Chelba, T. Mikolov, M. Schuster, Q. Ge, T. Brants, P. Koehn, and T. Robinson. One billion word benchmark for measuring progress in statistical language modeling. arXiv, 2013.
R. Child, S. Gray, A. Radford, and I. Sutskever. Generating long sequences with sparse transformers. arXiv, 2019.
J. Chung, S. Ahn, and Y. Bengio. Hierarchical multiscale recurrent neural networks. arXiv, 2016.
Z. Dai, Z. Yang, Y. Yang, W. W. Cohen, J. Carbonell, Q. V. Le, and R. Salakhutdinov. Transformer-xl: Attentive language models beyond a fixed-length context. arXiv, 2019.
Y. N. Dauphin, A. Fan, M. Auli, and D. Grangier. Language modeling with gated convolutional networks. arXiv, 2016.
J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv, 2018.
L. Espeholt, H. Soyer, R. Munos, K. Simonyan, V. Mnih, T. Ward, Y. Doron, V. Firoiu, T. Harley, I. Dunning, et al. Impala: Scalable distributed deep-rl with importance weighted actor-learner architectures. In International Conference on Machine Learning, pages 1406–1415, 2018.
E. Grave, A. Joulin, and N. Usunier. Improving neural language models with a continuous cache. arXiv, 2016.
A. Graves. Generating sequences with recurrent neural networks. arXiv, 2013.
A. Graves, G. Wayne, and I. Danihelka. Neural turing machines. arXiv, 2014.
A. Graves, G. Wayne, M. Reynolds, T. Harley, I. Danihelka, A. Grabska-Barwińska, S. G. Colmenarejo, E. Grefenstette, T. Ramalho, J. Agapiou, et al. Hybrid computing using a neural network with dynamic external memory. Nature, 538(7626):471, 2016.
D. Ha, A. Dai, and Q. V. Le. Hypernetworks. arXiv, 2016.
F. Hill, A. Bordes, S. Chopra, and J. Weston. The goldilocks principle: Reading children’s books with explicit memory representations. arXiv, 2015.
A. Holtzman, J. Buys, M. Forbes, and Y. Choi. The curious case of neural text degeneration. arXiv preprint arXiv:1904.09751, 2019.
M. Hutter. The human knowledge compression contest. URL http://prize.hutter1.net, 6, 2012.
N. Kalchbrenner, L. Espeholt, K. Simonyan, A. v. d. Oord, A. Graves, and K. Kavukcuoglu. Neural machine translation in linear time. arXiv preprint arXiv:1610.10099, 2016.
D. P. Kingma and J. Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
T. Kočiský, J. Schwarz, P. Blunsom, C. Dyer, K. M. Hermann, G. Melis, and E. Grefenstette. The narrativeqa reading comprehension challenge. Transactions of the Association for Computational Linguistics, 6:317–328, 2018.
B. Krause, L. Lu, I. Murray, and S. Renals. Multiplicative lstm for sequence modelling. arXiv preprint arXiv:1609.07959, 2016.
B. Krause, E. Kahembwe, I. Murray, and S. Renals. Dynamic evaluation of transformer language models. CoRR, abs/1904.08378, 2019. URL http://arxiv.org/abs/1904.08378.
G. Lample, A. Sablayrolles, M. Ranzato, L. Denoyer, and H. Jégou. Large memory layers with product keys. arXiv preprint arXiv:1907.05242, 2019.
S. Merity, C. Xiong, J. Bradbury, and R. Socher. Pointer sentinel mixture models. arXiv preprint arXiv:1609.07843, 2016.
T. Mikolov, M. Karafiát, L. Burget, J. Černocký, and S. Khudanpur. Recurrent neural network based language model. In Eleventh Annual Conference of the International Speech Communication Association, 2010.
A. Oord, Y. Li, I. Babuschkin, K. Simonyan, O. Vinyals, K. Kavukcuoglu, G. Driessche, E. Lockhart, L. Cobo, F. Stimberg, et al. Parallel wavenet: Fast high-fidelity speech synthesis. In International Conference on Machine Learning, pages 3915–3923, 2018.
A. v. d. Oord, S. Dieleman, H. Zen, K. Simonyan, O. Vinyals, A. Graves, N. Kalchbrenner, A. Senior, and K. Kavukcuoglu. Wavenet: A generative model for raw audio. arXiv preprint arXiv:1609.03499, 2016.
D. Paperno, G. Kruszewski, A. Lazaridou, Q. Pham, R. Bernardi, S. Pezzelle, M. Baroni, G. Boleda, R. Fernández, K. Erk, et al. The lambada dataset: Word prediction requiring a broad discourse context. Association for Computational Linguistics, 2016.
J. Rae, J. J. Hunt, I. Danihelka, T. Harley, A. W. Senior, G. Wayne, A. Graves, and T. Lillicrap. Scaling memory-augmented neural networks with sparse reads and writes. In Advances in Neural Information Processing Systems, pages 3621–3629, 2016.
J. W. Rae, C. Dyer, P. Dayan, and T. P. Lillicrap. Fast parametric learning with activation memorization. arXiv preprint arXiv:1803.10049, 2018.
B. A. Richards and P. W. Frankland. The persistence and transience of memory. Neuron, 94(6): 1071–1084, 2017.
D. E. Rumelhart, G. E. Hinton, and R. J. Williams. Learning representations by back-propagating errors. Nature, 323(6088):533, 1986.
A. Santoro, R. Faulkner, D. Raposo, J. Rae, M. Chrzanowski, T. Weber, D. Wierstra, O. Vinyals, R. Pascanu, and T. Lillicrap. Relational recurrent neural networks. In Advances in Neural Information Processing Systems, pages 7299–7310, 2018.
M. Shoeybi, M. Patwary, R. Puri, P. LeGresley, J. Casper, and B. Catanzaro. Megatron-lm: Training multi-billion parameter language models using model parallelism, 2019.
S. Smith, P. jan Kindermans, C. Ying, and Q. V. Le. Don’t decay the learning rate, increase the batch size. 2018. URL https://openreview.net/pdf?id=B1Yy1BxCZ.
S. Sukhbaatar, E. Grave, P. Bojanowski, and A. Joulin. Adaptive attention span in transformers. arXiv, 2019.
A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin. Attention is all you need. In Advances in neural information processing systems, pages 5998–6008, 2017.
F. Wu, A. Fan, A. Baevski, Y. N. Dauphin, and M. Auli. Pay less attention with lightweight and dynamic convolutions. arXiv, 2019.
Z. Yang, Z. Dai, Y. Yang, J. Carbonell, R. Salakhutdinov, and Q. V. Le. Xlnet: Generalized autoregressive pretraining for language understanding. arXiv, 2019.
L. Zhou, Y. Zhou, J. J. Corso, R. Socher, and C. Xiong. End-to-end dense video captioning with masked transformer. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 8739–8748, 2018.
Y. Zhu, R. Kiros, R. Zemel, R. Salakhutdinov, R. Urtasun, A. Torralba, and S. Fidler. Aligning books and movies: Towards story-like visual explanations by watching movies and reading books. In Proceedings of the IEEE international conference on computer vision, pages 19–27, 2015.
J. G. Zilly, R. K. Srivastava, J. Koutník, and J. Schmidhuber. Recurrent highway networks. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pages 4189–4198. JMLR. org, 2017.
SUPPLEMENTARY MATERIALS
A COMPRESSION ACROSS LAYERS
We inspect the compression loss broken down by the layer index, to investigate whether there is a trend in network depth with how compressible the representations are. The compression loss here refers to the attention-reconstruction attention loss. We plot this for a 24 layer trained model on Enwik8, and an 18 layer model trained on WikiText-103. The compression loss for character-based language modelling is about one order of magnitude lower than that of word-level language modelling. The first layer’s representations are highly compressible, however from then on there is no fixed trend. Some non-contiguous layers have a very similar compression loss (e.g. 4 & 6, 5 & 7) which suggests information is being routed from these layer pairs via the skip connection.

B COMPARISON OF COMPRESSED MEMORY SIZES
We compare the best test perplexity obtained for the Compressive Transformer trained on WikiText-103 and Enwik8 across a range of compressed memory sizes. For both models, the best model used a 1D convolution compression network with a compression rate of 3. The Enwik8 model was trained with an embedding size of 1024, 8 attention heads, 24 layers, an mlp hidden size of 3072, a sequence window size of 768, and a memory size of 768. We see the best compressed memory size is 3,072 in this sweep, facilitating a total attention window of 3840. The WikiText-103 model was trained with an embedding size of 1024, adaptive inputs using the same parameters as
| Compressed Memory Size | 512 | 1024 | 2048 | 3072 | 4096 |
|---|---|---|---|---|---|
| Enwik8 BPC | 1.01 | 0.99 | 0.98 | 0.97 | 1.00 |
Table 8: Compressed memory size vs test performance for Enwik8
| Compressed Memory Size | 256 | 512 | 1024 | 1536 | 2048 |
|---|---|---|---|---|---|
| WikiText-103 Perplexity | 18.2 | 17.9 | 17.6 | 17.1 | 17.7 |
Table 9: Compressed memory size vs test performance for WikiText-103
C PG-19 PREPROCESSING
The raw texts from the Gutenberg project were minimally pre-processed by removing boilerplate license text. We then also replaced discriminatory words with a unique
D PG-19 TOPICS
We present top-words for some of the topics on the PG-19 corpus. These were generated with LDA topic model
| Geography | War | Civilisations | Human Condition | Naval | Education | Art |
|---|---|---|---|---|---|---|
| water | people | roman | love | island | work | poet |
| river | emperor | rome | religion | ship | school | music |
| feet | war | greek | religious | sea | life | one |
| miles | army | city | life | men | children | poetry |
| north | death | gods | moral | captain | may | work |
| south | battle | king | human | coast | social | literature |
| mountains | city | first | society | land | child | art |
| sea | soldiers | caesar | man | great | education | great |
| lake | power | great | virtue | found | conditions | poem |
| rock | thousand | romans | nature | islands | well | written |
| mountain | arms | athens | marriage | shore | study | english |
| country | empire | greece | women | voyage | best | author |
| valley | upon | temple | christian | vessels | years | play |
| ice | country | son | age | time | possible | genius |
| west | time | egypt | law | english | class | style |
E PG-19 SAMPLES
We show a few different samples from the Compressive Transformer trained on PG-19. We use Nucleus Sampling with
As the Compressive Transformer is trained without state resetting, it is actually slightly out of sample when provided with the (relatively) short contexts. This is because its memory and compressed memory may be still empty (whereas they are always full during training). However we see a trend of the samples usually improving towards the end.
GRAY
Context from Baby Mine by Maragaret Mayo:
CHAPTER I
Even in college Alfred Hardy was a young man of fixed ideas and high ideals and proud of it.
His friend, Jimmy Jinks, had few ideas and no ideals, and was glad of it, and before half of their first college term had passed, Jimmy had ridded himself of all such worries as making up his own mind or directing his own morals. Alfred did all these things so much better, argued Jimmy, furthermore, Alfred LIKED to do them–Jimmy owed it to his friend to give him that pleasure.
Compressive Transformer:
He did not like doing them in the most elegant of all possible marble walks–where freshmen and sophomores stand and talk and laugh and chatter till evening falls and look cool and still and cool and beautifully white in spite of the hot sun, but had to go there for a month before going home. It happened one day that he and Jimmy were strolling together in the green shade of St. Martin’s, and Jimmy naturally asked why he was so silent. He was very talkative as most boys are who have once got the habit of it on shore, so the need for answering was soon supplied.
”I am wondering,” he said, ”why my old friend and chum, Jimmy Jinks, is living a single life and so has no boys. Have you not noticed that many of his old friends had their own boys to follow in their footsteps and help, and often helped successfully too?”
”Oh, yes,” said Alfred, ”it was plain enough then that he liked to live by himself and got none of his present pleasures or entertainments because of the other fellow’s giving it to him.”
”Then it IS sad,” said Jimmy, ”that an old friend like the friend of Big Lights, Jimmy Reynolds, should be living the single life and having no boys.”
”And it is pretty true, too,” said Alfred, ”yet I believe it is the best thing that could happen even now.”
”And so do I,” said Jimmy, ”and I also believe that it is time someone did something to make the things worse for the other’s partner.”
”And so do I,” said Alfred, ”and so do I, and so do I.”
”What can we do?” asked Jimmy.
”I don’t know,” said Alfred, ”but if someone will come down to the valley and go through a regular mill with the old forces at work, doing the work of the two, and doing it better than any other Miller he ever saw or ever hoped to see, I think there will be a good deal of change in the things and things to be cared for for and for driven and counter-driven.”
”I will come and do that mill,” said Jimmy.
GRAY
Context from The Diary of Samuel Pepys:
June 1st. Having taken our leaves of Sir W. Batten and my Lady, who are gone this morning to keep their Whitsuntide, Sir W. Pen and I and Mr. Gauden by water to Woolwich, and there went from ship to ship to give order for and take notice of their forwardness to go forth, and then to Deptford and did the like, having dined at Woolwich with Captain Poole at the tavern there. From Deptford we walked to Redriffe, calling at the half-way house, and there come into a room where there was infinite of new cakes placed that are made against Whitsuntide, and there we were very merry. By water home, and there did businesses of the office. Among others got my Lord’s imprest of L1000 and Mr. Creed’s of L10,000 against this voyage their bills signed. Having wrote letters into the country and read some things I went to bed.
Compressive Transformer:
So I rose and did some business at the office, and then to dinner, where great abundance of good things of the kitchen I only being partaker in the feast. After dinner I went to look at the yard and found it in a very bad condition, so I went home and there finding no delight in seeing things I should make shift to get in my way and so I will leave off.
So I took leave of my great friend, and the next day being Sunday, my wife and I and a great number of friends went by water to Deptford, and there took the oaths of allegiance and supremacy and promised such obedience to the Lord as should hereafter turn to the advantage of our State. Being come back we found the shops and warehouses in a pitiable condition, so my wife and I and all the rest of the friends went home and there did the like and by Saturday day I being alone in my office I found myself in a very bad situation, for my Lord, and Mr. Creed, and Pett were all standing ready to receive and receive the oaths of allegiance and right to make use of the ten days’ licence I had allowed for this purpose by virtue of the law that was such a help to the servant. Having looked at the ships, I told my wife what I had seen and that I was greatly troubled at my situation and, I hope that will mend it.
So that is the end of a long and patient seeking after those things which in my opinion I deserve and may get if I can only by fair means get them.
Context from The Patrol of the Sun Dance Trail by Ralph Connor:
High up on the hillside in the midst of a rugged group of jack pines the Union Jack shook out its folds gallantly in the breeze that swept down the Kicking Horse Pass. That gallant flag marked the headquarters of Superintendent Strong, of the North West Mounted Police, whose special duty it was to preserve law and order along the construction line of the Canadian Pacific Railway Company, now pushed west some scores of miles.
Along the tote-road, which ran parallel to the steel, a man, dark of skin, slight but wiry, came running, his hard panting, his streaming face, his open mouth proclaiming his exhaustion. At a little trail that led to the left he paused, noted its course toward the flaunting flag, turned into it, then struggled up the rocky hillside till he came to the wooden shack, with a deep porch running round it, and surrounded by a rustic fence which enclosed a garden whose neatness illustrated a characteristic of the British soldier. The runner passed in through the gate and up the little gravel walk and began to ascend the steps.
Compressive Transformer:
As though by mutual consent, all the three members of the Superintendent Force were standing on the veranda watching his progress, a man came strolling slowly down the path, stopping often to gaze curiously about him, then stopping once more when he saw the four standing on the veranda. His face was dark, his eyes prominent, and the lines that furrowed his forehead showed marks of many woes, yet his step was alert and he began to mount the steps with steady strides.
The man’s attention was diverted by the unexpected appearance of the two policemen; a third man came slipping up, grumbling under his breath, and stopped long enough to look doubtfully around him, then slipped away again. In the meantime, the four on the veranda continued to watch the approach of the new arrival. In spite of himself, the Superintendent Force became conscious of a curious feeling—uneasiness that was almost fear—in the presence of that strange man. He would not allow himself to admit the fact, yet he felt it in his bones. But to the watchers, the plain, everyday features of that stranger and his coming, seemed only just what the Seven White Shee owed him—their weight, their hurry, their blast.
Had a chariot been made for the good of the three horses, and had the driver been given orders that he should speed them that he might win, they would have been heartening things in the sight of the veteran and the victor. To you they would have been unintelligible to the root of your understanding. When you gaze up in the faces of those four gray horses, you can see clearly through the clouds of dust that rise from their hoofs, and discern plainly where the banker is and where the hobo. Then you will understand why you shall not press the bitter grapes and why you shall not spurn the generous doctrines. You will understand why you shall not praise the lash or the spur, for you will know where the true would be and where the false would be. Then you will understand why you, a man with reason and heart, need not tear your hair over-bitter and why you need not laugh over the blunders of an ignorant man.
About nine o’clock that morning, two buggies, drawn by powerful horses, crossed the Rubicon and turned the railroad from Sandhurst into the Hollow of the Mountains. And though the charioteers stood at their horses’ heads, and their drivers cried at their loudest, there was not a man in the four teams who did not feel that his day was worth all the toil and all the peril that he had undergone. And if there were a man in them who did not know that—who did not feel that the road through the Hollow of the Mountains is made easy by the arrival of travelers and by the coming of government, there was one who did not at that moment care whether his day’s work were worth all the toil and all the danger that he had had to endure or whether it were not worth more than all.
Footnotes
-
The authors intend to release the PG-19 dataset along with the split into train, validation and test subsets. ↩