GENERATIVE TRANSFER LEARNING BETWEEN RECURRENT ...

7 downloads 0 Views 3MB Size Report
Oct 13, 2016 - FREEMAN WIGTON TURNED HIS HANDS WITH TWENTY PEOPLE TO RESPOND TO HIS OPINIONS. Figure 2: Examples of text from original ...
Under review as a conference paper at ICLR 2017

G ENERATIVE T RANSFER L EARNING BETWEEN R ECURRENT N EURAL N ETWORKS

arXiv:1608.04077v2 [cs.LG] 13 Oct 2016

Sungho Shin, Kyuyeon Hwang & Wonyong Sung Department of Electrical and Computer Engineering Seoul National University Seoul, 08826 Korea [email protected], [email protected], [email protected]

A BSTRACT Training a neural network demands a large amount of labeled data. Keeping the data after the training may not be allowed because of hardware or power restrictions for on-device learning. In this study, we train a new RNN, called a student network, using a previously developed RNN, the teacher network, without using the original data. The teacher network is used for generating data to train the student network. The softmax output of the teacher RNN is used as for the soft target when training a student network. The performance evaluation is conducted using a character-level language model. The experimental results show that the proposed method yields good performance approaching that of the original data based training. This work not only gives the insight to connect between the learning and generation but also can be useful when the original training data is not available.

1

I NTRODUCTION

Training a recurrent neural network (RNN) demands a vast amount of sequence data for obtaining high performance (Werbos, 1990). Keeping the training data for future use demands large sized storages. This problem would be one of the reasons that training is not performed on-device but server systems. Also, the complexity of RNNs deployed for a certain application depends on the capability of hardware platforms. Many embedded systems can only accommodate fairly limited sized RNNs because of hardware or power restrictions. Thus, we need to train new RNNs for different hardware platforms using a previously trained network, without using the original data. In this work, a new RNN, the student network, is trained only using the previously developed RNN, the teacher network. The data for the student network training is obtained by operating the teacher RNN as a training data generator. The student RNNs are trained only using the generated data without consulting the original data. Thus, the original data needs not to be stored after developing the teacher network that contains most of the knowledge in the training set. Here, we show how the student networks can be well trained only by using the teacher network. The most significant contribution of our paper is connecting the generation and the training. Our model can mimic the original training data; it has a few following advantages. First, it increases the security of data. Second, this approach helps to utilize the on-line training results when the data is hard to store because of hardware restrictions. Third, this approach can be used for compressing the information because the number of weights for an RNN is much smaller than the size of the original data. RNNs have been used as generative models in several types of research (Graves, 2013; Oord et al., 2016; Gregor et al., 2015; Radford et al., 2016). Texts are generated resembling Wikipedia, and handwritings are successfully simulated (Graves, 2013). The raw speech signal is generated well, and music is composed by using audio waveform as the training data (Oord et al., 2016). Images are generated mimicking training samples (Gregor et al., 2015; Radford et al., 2016). Several types of research have been conducted to train a network using a previously developed network, which is often called the knowledge transfer (Ba & Caruana, 2014; Hinton et al., 2014; 1

Under review as a conference paper at ICLR 2017

Romero et al., 2015; Tang et al., 2016; Chan et al., 2015). In these works, the previously trained network is referred to as the teacher network, while the network the learns from the teacher network is called the student network. Training a student network only by using the generated data of the teacher network usually, does not show good results. To improve the results, the student networks are trained utilizing the softmax output of the teacher networks as the soft-target in these works. The training procedure for the knowledge transfer not only uses the trained network but also the original data. Thus, the knowledge transfer is a means of improved training of student RNNs when the original data is also available. The proposed work only uses the previously trained network for training the student networks under the assumption that the original data is not available. This paper is organized as follows. Section 2 explains the related work, and Section 3 describes the proposed training method. The experimental results are presented in Section 4, and future works are shown in Section 5. Section 6 concludes this paper.

2

R ELATED WORK

Utilizing the knowledge contained in the previously trained networks has been of much interest for the application to network compression or pre-training. In an early work for network compression, a previously trained model is used to label a large unlabeled dataset for producing much larger training set (Bucilu et al., 2006). Another related work is a knowledge transfer through the hidden Markov model (HMM) (Pasa et al., 2014). An HMM is trained using the original data, and then the generated sequence from the HMM is used for pre-training of an RNN, which is then fine-tuned using the original data. Our work is inspired by the Hintons knowledge distillation (Hinton et al., 2014). In this paper, the output probabilities of a well-trained network are used as the soft target for training a small network. In the FitNet, a thick-shallow model is transformed to a thin-deep model (Romero et al., 2015). They employ a guided layer in the student network and learned from the teachers hidden layer. The guided layer can be well pre-trained, and it would be fine-tuned using knowledge distillation. Also, the model change from a fully connected deep neural network (FCDNN) to an RNN or the opposite direction has been tried (Tang et al., 2016; Chan et al., 2015). The main difference between the previous works and ours is the use of the original training data. The previous works try to improve the performance of training by generating more data using the developed model, but the original data is also used. Our study conducts training only by using the trained network. Thus, our approach can be considered a pure knowledge transfer from a trained large network to a small one. To the best of our knowledge, ours is the first work for knowledge transfer between RNNs.

3

T RAINING DATA TRANSFER FROM TEACHER TO STUDENT NETWORKS

The proposed method consists of three steps: (1) teacher network training using the original data, (2) generating a sequence using the teacher RNN as a generative model, and (3) training the student RNN by using the generated sequence from the teacher network. These three steps and the experimental environments are as follows. 3.1

T EACHER NETWORK TRAINING

The teacher network training is not much different from that for ordinary RNNs. A long short-term memory (LSTM) RNN (Hochreiter & Schmidhuber, 1997) for character-level language modeling is used for this experiment. The character-level language model (LM) predicts the probabilities of the next output characters based on the current and past input characters (Graves, 2013). The input for this RNN is a 30-dimensional vector that is one hot encoded for representing A ∼ Z alphabets and four special symbols. The output vector represents the probabilities of characters and symbols and is also represented in 30-dimension. Since the input and output use the same labels, the RNN can easily be used as a generative model (Sutskever et al., 2011). The original data is the Wall Street Journal (WSJ) LM training text with non-verbalized punctuation and has about 215 mega characters (Paul & Baker, 1992). As for the development and test sets, about one percent of randomly chosen data are used for each purpose. The same development and test sets are also used for training the student network. The teacher network needs to be large 2

Under review as a conference paper at ICLR 2017

1st seq: MR. STEIN SAID THE COMPANY HAS BEEN A STRONG CONTENDER FOR THE COMPANY AND IS CONSIDERING A PLAN TO REDUCE ITS STAKE 2nd seq: THE COMPANY SAID IT WILL REPORT A LOSS FOR THE FISCAL FOURTH QUARTER ENDED JUNE THIRTIETH rd 3 seq: THE COMPANY SAID IT WILL RECEIVE ABOUT HUNDRED FIFTY MILLION DOLLARS IN CASH AND THE REST WILL BE USED FOR GENERAL PURPOSES 4th seq: THE COMPANY SAID IT WILL RECEIVE ABOUT HUNDRED FIFTY MILLION DOLLARS IN CASH AND THE REST WILL BE USED FOR GENERAL PURPOSES 5th seq: THE COMPANY SAID IT WILL RECEIVE ABOUT HUNDRED FIFTY MILLION DOLLARS IN CASH AND THE REST WILL BE USED FOR GENERAL PURPOSES

Figure 1: Example of text when choosing the maximum probability. enough to capture most of the information contained in the training set. As for the teacher network, we train a deep LSTM RNN that consists of four layers and each layer contains 1024 memory cells, which will be denoted as the 1024x4 RNN. The total number of parameters for this network is about 29.54 M. The teacher network is trained with Adadelta based stochastic gradient descent (SGD) algorithm (Zeiler, 2012). The initial learning rate starts from 1e-5 and is reduced by half when the average cross entropy (ACE) on the development set is not improved for ten consecutive evaluations. Training continues until the learning rate reaches 1e-7. The teacher network predicts the next output probabilities as depicted in Eq. 1. Pθteacher (xt | xt−n:t−1 ) = Softmaxθteacher (yt−1 )

(1)

where, θteacher is the trained teacher model, x = (x1 , x2 , ...) is the original training data vector sequences, y = (y1 , y2 , ...) is the model output vector sequences (before applying a softmax layer), t is the time step and n is the starting time step of a window. Thus, a sequence x0 = (x01 , x02 , ...) can be generated by assigning some initial data sequences into the teacher model. 3.2

S EQUENCE GENERATION USING THE TEACHER NETWORK

A text sequence is generated by feeding-back the output of the character RNN to the input. Note that the input is one hot encoded and the output represents the probabilities of the labels. Thus, we need to select one character or a symbol among the output labels, and then apply the selected one to the input of the RNN using one hot encoding. The simplest way of mimicking the original training sequence is choosing the output label that shows the highest probability (Eq. 2), which may work for generating short sequences, but this results in repeating sequences when producing an extended length of text as depicted in Figure 1. x0t = max(Softmaxθteacher (yt−1 ))

(2)

As the Figure 1 shows, the sequences repeat from the third one. The original data cause the reason for this repetition problem. Top five appearing word list in the original dataset is “THE (2,016,721), OF (954,112), TO (921,680), A (893,331), AND (809,062)”. It is observed that the word beginning with “THE” appear most frequently in the original data. As a result, if we choose the next character using the maximum softmax score, the generated sequence would always start with the word “THE” whatever the start character was. To solve this problem, we select the output label in an indeterministic way by considering the probability values. A random selection by its probability is employed for the selection process. Since the selection is somewhat randomized, the sequence repeating problem does not appear. This randomized selection technique is critical for our work because the training needs a very long length of data that should not repeat (Bucilu et al., 2006). The generated text using randomized method is very different from the original one as compared in Figure 2. We can hardly find any textual meaning from the generated sequence. However, out of vocabulary (OOV) words are not frequently found. It would be the simplest if we can train the student network only by using the generated text data, which however does not show good results. Thus, the probabilities of the labels generated from the teacher network are also used for training the student network. It has been shown that utilizing the softmax information improves the results of knowledge distillation very much (Hinton et al., 2014). Thus, we store not only the character sequences but also all the output probabilities (softmax output) of the teacher RNN. The final approximation for student network θstudent could be formulated as Eq. 3. θstudent (x, hard target) ≈ θstudent (x0 , soft target) 3

(3)

Under review as a conference paper at ICLR 2017

Original data FREEMANS HASN’T DECIDED WHETHER TO APPEAL THE RULING FREEMAN WIGTON AND TABOR FREEMAN WIGTON TABOR AND MILKEN AND FOR DRESEL COULDN’T BE REACHED Generated data FREEMAN HAS HAD A HORRIBLE IMPACT ON PLAYBOY MAGAZINE’S ALLEGED HIS OWN EARLY FREEMAN ALSO ACTED AS ADVERSARY AT ONE MAJOR STUDY THAT CITED THE SIZE OF HIS WORK IN THE SOUTH FREEMAN WIGTON TURNED HIS HANDS WITH TWENTY PEOPLE TO RESPOND TO HIS OPINIONS

Figure 2: Examples of text from original training data and generated data from the teacher network. The first word is a name ”FREEMAN” which is included in the original training data.

Note that the hard target is from the original training data, and the soft target is generated from the teacher network. 3.3

G ENERATIVE TRANSFER LEARNING FROM THE TEACHER NETWORK

The student network is now trained using the generated text sequence and the corresponding softmax output of the teacher network. The student network is assumed to have a limited complexity when compared to the teacher network. Each character or symbol of the generated text is one-hot encoded and applied to the input of the student RNN, one by one sequentially. At the same time, the softmax output, which is obtained while generating the text, is applied to the output of the student RNN as the soft target for training.

Data sequence for student RNN

IN

Trained RNN (teacher)

0. 217 0. 018 0. 027

Selection & one-hot encoding

OUT

0. 553 0. 152 0. 001

IN

RNN for OUT training (student)

Soft-target Figure 3: Overview of the generative transfer learning. In the text generation process using the teacher network, the character or the symbol that shows the largest probability (softmax output value) is not always selected. Sometimes, a label with a quite low probability can be chosen. However, the probability values are directly applied to the output of the student RNN as the soft target, without any modification. Thus, the temperate ‘T ’ used in Hinton’s work is 1 in our experiments (Hinton et al., 2014). Figure 3 shows the overall process of student network training. Note that the original data is not used for student network training. The student networks are also developed with Adadelta based stochastic gradient descent (SGD). The initial learning rate starts from 1e-5 and is reduced by half when the average cross entropy (ACE) on the development set is not improved for ten consecutive evaluations. Training continues until the learning rate reaches 1e-7.

4

E XPERIMENTAL RESULTS

The teacher network has the configuration of 1024x4 LSTM cells, which can be considered a quite large model (Hwang & Sung, 2016). This network is trained using the WSJ LM training text with non-verbalized punctuation (Paul & Baker, 1992). The training applies the current character and the next one as the input and the hard target, respectively. In order to evaluate the trained student networks, small RNNs are also trained in the same way. 4

Under review as a conference paper at ICLR 2017

market

eighty

seventy

forty

yesterday stock york trading

new

quarter

nineteen

rate

fifty

percent

point

share

exchange

first

eighty

cent

dollar

interest

rise

security year its you

last week

it

I

month

company

say

which

be

have

incorporate not

do

that

he

unit

they

MR

corporation

Percentage (%)

(a) Co-occurrence network of the original data

(b) Co-occurrence network of the generated data

0.4

original data (215M) generated data (215M)

0.2

0

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

Word frequency in the data set (c) Occupation percentage of words that appear less than 21 times.

Figure 4: Comparison of original and generated data sequences. Table 1 shows the training results of a few networks, where the original training data is used and the loss function utilizes the hard target. The table indicates the number of parameters in each network and the performances. We estimate the performance using bit per character (BPC). From Table 1, we can find that the 256x2 LSTM network shows a limited performance when compared to the other ones. The large size, 1024x4, network and the middle ones, 512x4 and 512x2, do not show much performance difference. Thus, we can consider that the 256x2 network has too limited resource, but the 512x2 RNN has a relatively good capability. Based on this experiment, the 256x2 and the 512x2 configurations are chosen as the student networks. 4.1

C OMPARISON OF ORIGINAL AND GENERATED DATA

The co-occurrence networks and word occurrence statistics of the original and generated data are compared to know the quality of the data mimicking (Higuchi, 2001). The original data, which is the WSJ LM, contains about 36.5 million words and the generated data using the teacher network consists of about 36.8 million words. When considering the number of words that appear only once, the original one has about 55,000 words, which is about 0.15% of the total word count. The Table 1: Training result using original data sequences and their hard targets

# of parameters BPC (Bit-per-character)

256x2 LSTM 828,446

512x2 LSTM 3,229,726

512x4 LSTM 7,431,198

1024x4 LSTM 29,542,430

1.275

1.148

1.132

1.101

5

Under review as a conference paper at ICLR 2017

generated one possesses about 159,000 single count words, about 0.43%. When considering the frequency of words that appear more than ten times, the original text shows about 99.15%, but the generated data has about 98.77%. Figure 4a and Figure 4b show the co-occurrence networks of words which are frequently coappeared more than 3000 times. Nodes with the same color mean that they are in the same community which is more closely associated with each other. Edges represent that they have a strong co-occurrence. Two different edge styles which are solid and dotted line mean that they are in the same or different communities, respectively. Note that the articles such as “The, A, An” are excluded from the analysis since they did not include important contents. The words in the generated data show quite similar co-occurrence with words in the original one. For instance, the nodes in blue color consist of “month, last, week” are co-appeared in both the original and the generated data. However, the words “percent, fifty, dollar, year” and “rise, point, quarter, first” are separated into different groups (violet and purple nodes) for original data, but they are in the same group (purple node) for generated one. Although they are in different groups for generated data, they still have a strong co-occurrence (dotted lines). The other groups also show the similar distributions. Figure 4c compares the occupation percentage of words that appear less than 21 times. In this figure, it is clear that the generated data contains some noisy words and shows an increased number of single and double count words. The ratio of the noise words estimated is about 0.49%. From these two analyses, the generated sequence data look very similar to the original sequence. If the student network is trained using this generated data without the soft-target values from the teacher network, however, the convergence BPC is unsatisfactory as depicted in Figure 5a (Teacher GS + HT). Therefore, the knowledge distillation from the teacher network is crucial for the student network to obtained the satisfactory performances. The specific results are reported in Section 4.2. 4.2

T RADE - OFF BETWEEN THE NUMBER OF DATA AND CONVERGENCE SPEED

We conduct experiments to know the performance of the student network when the amount of the training data generated by the teacher network is limited, such as 10M, 50M, 100M, 150M, 215M, and 250M characters. Note that the original data have 215M characters. Although it is possible to generate an infinite length of data using the teacher network, we try to know the efficiency of teacher-student knowledge transfer by limiting the data size. The training is conducted employing many epochs until the performance saturates. Figure 5 shows the training curves of the 256x2 and 512x2 LSTM RNN using several different approaches. As for the reference, the training result using the original data, WSJ LM training text, is given, where the hard target is used for training. Also, the training result that uses the generated sequence but not the soft target is also shown. Although the generated data of 900 mega characters is used for this hard target based teacher-student transfer learning, the result is not sufficient in Figure 5a. The average cross entropy (ACE) only reaches to 0.90889 with the validation set. The remaining five graphs show the training results using the generated sequence and the soft target of the teacher network, where the data size is intentionally limited to 10M, 50M, 100M, 215M, and 250M characters1 . The associated hyperparameters are all the same. We can find that almost comparable performance can be achieved by applying the proposed method when the generated sequence size is 100 M characters, which is smaller than that of the original data size. The learning speed is even faster than that using the original data. This speed-up is due to the knowledge transfer effect (Hinton et al., 2014). Finally, the Figure 5a also shows the training result that utilizes both the original data and the soft target, which apparently shows the best results. In Figure 5b, the data size is deliberately limited to 10M, 50M, 100M, 150M, 215M, and 250M characters in 512x2 LSTM network which has more capacity than the 256x2 LSTM network. The comparable performance can be achieved when the generated sequence size is 215 M characters, but it needs more times to reach the baseline performance. With the 250M generated characters, the convergence speed is even faster than that using the original data, and almost reachable the training with the original data and the soft targets. 1 Since the 100M and 150M training curves show almost similar, we omit 150M in Figure 5a to increase legibility.

6

Under review as a conference paper at ICLR 2017

OS (215M) + HT (baseline)

Average Cross Entropy

0.96

OS (50M) + HT OS (215M) + ST Teacher GS + HT Teacher GS (10M) + ST

0.94

Teacher GS (50M) + ST Teacher GS (100M) + ST Teacher GS (215M) + ST Teacher GS (250M) + ST

0.92

0.9

0

2

4

6

8

10

12

14

16

18

20

22

24

26

28

8

Number of frames trained (·5 · 10 ) (a) 256x2 OS (215M) + HT (baseline)

Average Cross Entropy

OS (50M) + HT OS (215M)+ ST Teacher GS (10M) + ST

0.9

Teacher GS (50M) + ST Teacher GS (100M) + ST Teacher GS (150M) + ST Teacher GS (215M) + ST Teacher GS (250M) + ST

0.85

0.8

0

2

4

6

8

10

12

14

16

18

20

8

Number of frames trained (·5 · 10 ) (b) 512x2

Figure 5: Convergence curves in terms of average cross entropy (ACE) on the validation set with the various size of the generated sequences in the fixed learning rate of 1e-5. OS is original training sequence. HT and ST are hard and soft targets. Hard targets follow their network input characters, but soft targets are generated by teacher network. Teacher GS is the generated sequences from teacher network and the number in the parenthesis are the sizes of the sequences for performance evaluation. Note that OS have the total of 215M characters. Table 2: Comparison of generative transfer learning according to the amount of generated data. The BPCs on the test set are measured.

Size 256x2 512x2

10M 1.371 1.314

Generative transfer learning (GS+ST) 50M 100M 150M 215M 250M 1.288 1.283 1.283 1.275 1.272 1.196 1.177 1.168 1.165 1.151

OS+HT (Baseline) 50M 215M 1.329 1.275 1.274 1.148

OS+ST 215M 1.272 1.146

The final training results are reported in Table 2. The baseline which is trained using the original sequence and the hard target is 1.329 and 1.275 with 50M and 215M characters, respectively. The proposed method shows better BPC in 50 M characters and achieves same BPC in 215M characters. Even slightly better BPC in 250M characters, which is the same result with the original sequence and the soft target training. Although the 512x2 network is also trained fairly well with the gen7

Under review as a conference paper at ICLR 2017

Teacher network `

Trained RNN for generation

Generated data

Trained RNN for classification

RNN for training (student)

Generated target

Figure 6: Expanding structure of the generative transfer learning algorithm for general RNN application.

erative transfer learning method, however, when we compare the 256x2 and the 512x2 networks, the generative transfer for the latter network seems less satisfactory. The 512x2 network also shows better BPC in 50M characters, but slightly worse result even if the 250M characters are adopted for the training.

5

D ISCUSSION

In the experiments, the student network is trained only with the generated sequence and the soft target, which leads to comparable performances close to those of the conventional training with the original data (hard target) even when the generated data size is not bigger than that of the original one. However, the performances are slightly lower than that of dark knowledge distilling which uses the original data and the soft target obtained from the teacher network. Note that our approach works without the original data. The developed method needs to be extended to other RNNs that do not use the same labels for the input and the output. One simple approach is shown in Figure 6, where the original function of the RNN is classification. The teacher network is built employing two RNNs; one is for data generation and the other for classification. The RNN for generation needs to have the same data format for the input and the output.

6

C ONCLUDING REMARKS

Generative transfer learning between RNNs is studied in this work, where the previously trained network is used for generating data for training. An indeterministic data generation method is developed to prevent repeating sequences. The student network training procedure also uses the softmax output of the teacher network as the soft target. In our experiments with an RNN for character-level language modeling, the generated data shows quite good word statistics containing only about 0.5% of noisy words. The experiments using the generated training set shows quite good results even when the data size is not bigger than the original one. This work can be extended to combining knowledge among several RNNs that are trained separately. The extension can be possible to other RNN structures whose input and output labels are not the same. ACKNOWLEDGMENTS This work was supported in part by the Brain Korea 21 Plus Project and the National Research Foundation of Korea (NRF) grant funded by the Korea government (MSIP) (No. 2015R1A2A1A10056051). 8

Under review as a conference paper at ICLR 2017

R EFERENCES Jimmy Ba and Rich Caruana. Do deep nets really need to be deep? In Advances in neural information processing systems, pp. 2654–2662, 2014. Cristian Bucilu, Rich Caruana, and Alexandru Niculescu Mizil. Model compression. In Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 535–541. ACM, 2006. William Chan, Nan Rosemary Ke, and Ian Lane. Transferring knowledge from a RNN to a DNN. arXiv preprint arXiv:1504.01483, 2015. Alex Graves. Generating sequences with recurrent neural networks. arXiv:1308.0850, 2013.

arXiv preprint

Karol Gregor, Ivo Danihelka, Alex Graves, and Daan Wierstra. Draw: A recurrent neural network for image generation. In Proceedings of the 32th International Conference on Machine Learning (ICML-15), 2015. Koike Higuchi. Kh coder. A free software for quantitative content analysis or text mining, available at: http://khc. sourceforge. net/en, 2001. Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. Distilling the knowledge in a neural network. In Deep Learning and Representation Learning Workshop, NIPS, 2014. Sepp Hochreiter and J¨urgen Schmidhuber. Long short-term memory. Neural computation, 9(8): 1735–1780, 1997. Kyuyeon Hwang and Wonyong Sung. Character-level incremental speech recognition with recurrent neural networks. In 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5335–5339. IEEE, 2016. Aaron van den Oord, Sander Dieleman, Heiga Zen, Karen Simonyan, Oriol Vinyals, Alex Graves, Nal Kalchbrenner, Andrew Senior, and Koray Kavukcuoglu. Wavenet: A generative model for raw audio. arXiv preprint arXiv:1609.03499, 2016. Luca Pasa, Alberto Testolin, and Alessandro Sperduti. A HMM-based pre-training approach for sequential data. In ESANN. Citeseer, 2014. Douglas B Paul and Janet M Baker. The design for the wall street journal-based csr corpus. In Proceedings of the workshop on Speech and Natural Language, pp. 357–362. Association for Computational Linguistics, 1992. Alec Radford, Luke Metz, and Soumith Chintala. Unsupervised representation learning with deep convolutional generative adversarial networks. In Proceedings of the 4th International Conference on Learning Representations (ICLR), 2016. Adriana Romero, Nicolas Ballas, Samira Ebrahimi Kahou, Antoine Chassang, Carlo Gatta, and Yoshua Bengio. Fitnets: Hints for thin deep nets. In Proceedings of the 3th International Conference on Learning Representations (ICLR), 2015. Ilya Sutskever, James Martens, and Geoffrey E Hinton. Generating text with recurrent neural networks. In Proceedings of the 28th International Conference on Machine Learning (ICML-11), pp. 1017–1024, 2011. Zhiyuan Tang, Dong Wang, and Zhiyong Zhang. Recurrent neural network training with dark knowledge transfer. In 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5900–5904. IEEE, 2016. Paul J Werbos. Backpropagation through time: what it does and how to do it. Proceedings of the IEEE, 78(10):1550–1560, 1990. Matthew D Zeiler. Adadelta: an adaptive learning rate method. arXiv preprint arXiv:1212.5701, 2012. 9