Unconstrained On-line Handwriting Recognition with Recurrent Neural Networks Alex Graves IDSIA, Switzerland and TU Munich, Germany alex@idsia.ch ´ Santiago Fernandez IDSIA, Switzerland santiago@idsia.ch Marcus Liwicki University of Bern, Switzerland liwicki@iam.unibe.ch Horst Bunke University of Bern, Switzerland bunke@iam.unibe.ch ¨ Jurgen Schmidhuber IDSIA, Switzerland and TU Munich, Germany juergen@idsia.ch Abstract On-line handwriting recognition is unusual among sequence labelling tasks in that the underlying generator of the observed data, i.e. the movement of the pen, is recorded directly. However, the raw data can be difficult to interpret because each letter is spread over many pen locations. As a consequence, sophisticated pre-processing is required to obtain inputs suitable for conventional sequence labelling algorithms, such as HMMs. In this paper we describe a system capable of directly transcribing raw on-line handwriting data. The system consists of a recurrent neural network trained for sequence labelling, combined with a probabilistic language model. In experiments on an unconstrained on-line database, we record excellent results using either raw or pre-processed data, well outperforming a state-of-the-art HMM in both cases. 1 Introduction Handwriting recognition is traditionally divided into off-line and on-line recognition [12]. Off-line recognition is performed on images of handwritten text. In on-line handwriting the position of the pen-tip on a surface is recorded at regular time intervals, and the task is to map from the sequence of pen positions to the sequence of words. At first sight, it would seem straightforward to label the raw on-line data directly. Unlike off-line handwriting, a compact and complete feature set is already explicit in the sensor readings, and does not have to be extracted from an image. However, the fact that each letter or word is distributed over many pen positions poses a problem for conventional sequence labelling algorithms, which have difficulty processing data with long-range interdependencies. The problem is especially acute for unconstrained handwriting, where the writing style may be either cursive or printed. The standard solution is to pre-process the data into a set of localised features. These features typically include geometric properties of the trajectory in the vicinity of every data point, off-line information from a generated image, and character level shape characteristics [7, 8]. Delayed strokes (such as the crossing of a "t" or the dot of an "i") require special treatment because they split up the characters and therefore interfere with localisation. HMMs [7] and hybrid systems incorporating time-delay neural networks and HMMs [8] are commonly trained with such features. The issue of classifying pre-processed versus raw data has broad relevance to machine learning, and merits further discussion. Using hand-crafted features often yields superior results, and in some cases can render classification essentially trivial. However, there are three points to consider in 1 favour of raw data. Firstly, designing an effective pre-processer requires considerable time and expertise. Secondly, hand-coded features tend to be more task specific. For example, features designed for English handwriting could not be applied to languages with substantially different alphabets, such as Arabic or Chinese. In contrast, a system trained directly on pen movements could be re-trained on any language. Thirdly, using raw data allows feature extraction to be built into the classifier, and the whole system to be trained together. For example, convolutional neural networks, in which a globally trained hierarchy of network layers is used to extract progressively higher level features, have proved effective at classifying raw images [14, 10]. In this paper, we apply a recurrent neural network (RNN) to the task of on-line handwriting recognition. Experiments are carried out on the IAM database (IAM-OnDB) [11] which contains forms of unconstrained English text acquired from a whiteboard. The RNN architecture is bidirectional Long Short-Term Memory (BLSTM) [4], chosen for its ability to process data with long time dependencies. The RNN is trained with connectionist temporal classification (CTC) [3], an objective function specifically designed for labelling unsegmented sequence data. An algorithm is introduced for applying grammatical constraints to the CTC outputs, thereby providing word level transcriptions. The performance of the RNN system using both raw and pre-processed input data is compared to that of an HMM using pre-processed data only [1]. To the best of our knowledge, this is the first time whole sentences of unconstrained handwriting have been directly transcribed from raw on-line data. Section 2 describes BLSTM and CTC and introduces the algorithm for integrating CTC with grammatical constraints. Section 3 provides experimental results, and conclusions are given in Section 4. 2 Method 2.1 Bidirectional Long Short-Term Memory One of the key benefits of recurrent neural networks (RNNs) is their ability to make use of previous context. However, for standard RNN architectures, the range of context that can in practice be accessed is limited. The problem is that the influence of a given input on the hidden layer, and therefore on the network output, either decays or blows up exponentially as it cycles around the recurrent connections. This is usually referred to as the vanishing gradient problem [5]. Long Short-Term Memory (LSTM) [6] is an RNN architecture specifically designed to address the vanishing gradient problem. An LSTM hidden layer consists of multiple recurrently connected subnets, known as memory blocks. Each block contains a set of internal units, known as cells, whose activation is controlled by three multiplicative `gate' units. The effect of the gates is to allow the cells to store and access information over long periods of time, thereby avoiding the vanishing gradient problem. For many tasks it is useful to have access to future as well past context. Bidirectional RNNs [13] achieve this by presenting the input data forwards and backwards to two separate hidden layers, both of which are connected to the same output layer. Bidirectional LSTM (BLSTM) [4] combines the above architectures to provide maximal access to long-range, bidirectional context. 2.2 Connectionist Temporal Classification Connectionist temporal classification (CTC) [3] is an objective function designed for sequence labelling with RNNs. Unlike previous objective functions it does not require pre-segmented training data, or post-processing to transform the network outputs into transcriptions. Instead, it trains the network to map directly from input sequences to an estimate of the conditional probabilities of the possible labellings. A CTC output layer contains one more unit than there are elements in the alphabet L of labels for the task. The output activations are normalised with the softmax activation function [2]. The first |L| outputs are used to estimate the probabilities of observing the corresponding labels at particular times. The extra output estimates the probability of observing a `blank', or no label. The combined output sequence estimates the joint probability of all possible alignments of the input sequence with labels and blanks. The probability of a particular labelling can then be estimated by summing over the probabilities of the alignments that correspond to it. 2 More precisely, for an input sequence x of length T , choosing a label (or blank) at every timestep according to the probabilities implied by the network outputs defines a probability distribution over the set of length T sequences of labels and blanks. We denote this set L T , where L = L {blank } is the extended alphabet with blank included. To distinguish them from labellings, we refer to the elements of L T as paths. Since the probabilities of the labels at each timestep are conditionally independent given x, the conditional probability of a path L T is given by: p( |x) = tT =1 t where yk is the activation of output unit k at time t. t y t . (1) Paths are mapped onto label sequences l LT by an operator B that removes first the repeated labels, then the blanks. For example, both B (a, -, a, b, -) and B (-, a, a, -, -, a, b, b) yield the labelling (a,a,b). Since the paths are mutually exclusive, the conditional probability of a given labelling l LT is the sum of the probabilities of all paths corresponding to it: p(l|x) = p( |x). (2) B-1 (l) Altough a naive calculation of the above sum would be unfeasible, it can be efficiently evaluated with a graph-based algorithm [3], similar to the forward-backward algorithm for HMMs. To allow for blanks in the output paths, for each label sequence l LT consider a modified label T sequence l L , with blanks added to the beginning and the end and inserted between every pair of labels. The length of l is therefore |l | = 2|l| + 1. For a labelling l, define the forward variable t (s) as the summed probability of all path beginnings reaching index s of l at time t, i.e. t t (s) = P 1:t : B (1:t ) = l1:s/2 , t = ls |x = t yt t , (3) =1 : B(1:t )=l1:s/2 c where, for some sequence s, sa:b is the subsequence (sa , sa+1 , ..., sb-1 , sb ). Note that index s of l orresponds to index s/2 (rounded down) of the original label sequence l. The backward variables t (s) are defined as the summed probability of all path endings that would complete the labelling l if the path beginning had reached index s of l at time t: T t (s) = P t+1:T : B (t:T ) = ls/2:|l| , t = ls |x = t =t+1 yt t ( 4) : B(t:T )=ls/2:|l| Both the forward and backward variables are calculated recursively [3]. The label sequence probability is given by the sum of the products of the forward and backward variables at any timestep: p(l|x) = s|l | t (s)t (s). (5) =1 The objective function for CTC is the negative log probability of the network correctly labelling the entire training set. Let S be a training set, consisting of pairs of input and target sequences (x, z), where the length of sequence z is less than or equal to the length of the input sequence x. Then the objective function is: ( OC T C = - ln (p(z|x)). (6) x,z)S The network can be trained with gradient descent by first differentiating OC T C with respect to the outputs, then using backpropagation through time to find the derivatives with respect to the network weights. 3 Noting that the same label (or blank) may be repeated several times for a single labelling l, we define the set of positions where label k occurs as lab(l, k ) = {s : ls = k }, which may be empty. We can then set l = z and differentiate (5) with respect to the network outputs to obtain: ln (p(z|x)) 1s t - = yk - t (s)t (s), (7) t uk p(z|x) lab(z,k) where ut k and t yk are respectively the unnormalised and normalised outputs of the softmax layer. Once the network is trained, we would like to use it to label some unknown input sequence x by choosing the labelling l with the highest conditional probability: l = arg max p(l|x). l (8) Using the terminology of HMMs, we refer to the task of finding this labelling as decoding. Unfortunately, we do not know of a tractable decoding algorithm that is guaranteed to give optimal results. However a simple and effective approximation is given by assuming that the most probable path corresponds to the most probable labelling: a . l B rg max p( |x) (9) 2.3 Integration with an External Grammar For some tasks we want to constrain the output labellings according to a grammar. For example, in continuous speech and handwriting recognition, the final transcriptions are usually required to form sequences of dictionary words. In addition it is common practice to use a language model to weight the probabilities of particular sequences of words. We can express these constraints by altering the label sequence probabilities in (8) to be conditioned on some probabilistic grammar G, as well as the input sequence x: l = arg max p(l|x, G). l (10) Note that absolute requirements, for example that l contains only dictionary words, can be incorporated by setting the probability of all sequences that fail to meet them to 0. Applying the basic rules of probability, we obtain: p(l|x)p(l|G), p(x) p(l|x, G) = (11) p(x|G)p(l) where we have used the fact that x is conditionally independent of G given l. If we assume that x is independent of G, (11) reduces to: p(l|x, G) = p(l|x)p(l|G) . p(l) (12) Note that this assumption is in general false, since both the input sequences and the grammar depend on the underlying generator of the data, for example the language being spoken. However it is a reasonable first approximation, and is particularly justifiable in cases where the grammar is created using data other than that from which x was drawn (as is common practice in speech and handwriting recognition, where separate textual corpora are used to generate language models). If we further assume that, prior to any knowledge about the input or the grammar, all label sequences are equally probable, (10) reduces to: l = arg max p(l|x)p(l|G). l (13) Note that, since the number of possible label sequences is finite (because both L and |l| are finite), assigning equal prior probabilities does not lead to an improper prior. We now describe an algorithm, based on the token passing algorithm for HMMs [15], that allows us to find an approximate solution to (13) for a simple grammar. Let G consist of a dictionary D containing W words, and a set of W 2 bi-grams p(w|w) that define ^ the probability of making a transition from word w to word w. The probability of any label sequence ^ that does not form a sequence of dictionary words is 0. 4 1: Initialisation: 2: for all words w D do 1 3: tok (w, 1, 1) (ln(yb ), (w)) 1 4: tok (w, 2, 1) (ln(yw2 ), (w)) 5: 6: 7: 8: 9: 10: 11: 12: 13: 14: 15: 16: 17: 18: 19: 20: 21: 22: 23: 24: 25: 26: 27: 28: if |w| = 1 then tok (w, -1, 1) tok (w, 2, 1) else tok (w, -1, 1) (-, ()) tok (w, s, 1) (-, ()) for all other s Algorithm: for time t = 2 to T do sort output tokens tok (w, -1, t - 1) by ascending score for all words w D do tok (w, 0, t) arg maxtok(w,-1,t-1) tok .score + ln (p(w|w)) ^ ^ tok (w, 0, t).score += ln (p(w|w)) ^ tok (w, 0, t).history += w for segment s = 1 to |w | do P {tok (w, s, t - 1), tok (w, s - 1, t - 1)} if ws = blank and s > 2 and ws-2 = ws then P += tok (w, s - 2, t - 1) tok (w, s, t) highest scoring token in P t tok (w, s, t).score += ln(yws ) tok (w, -1, t) highest scoring of {tok (w, |w |, t), tok (w, |w | - 1, t)} Termination: find output token tok (w, -1, T ) with highest score at time T output tok (w, -1, T ).history Algorithm 1: CTC Token Passing Algorithm For each word w, define the modified word w as w with blanks added at the beginning and end and between each pair of labels. Therefore |w | = 2|w| + 1. Define a token tok = (score, history ) as a pair consisting of a real valued score and a history of previously visited words. Each token corresponds to a particular path through the network outputs, and the token score is the log probability of that path. Define a state s of w as an index into the sequence of labels and blanks (1 s |w |). The basic idea of the token passing algorithm is to pass along the highest scoring tokens at every word state, then maximise over these to find the highest scoring tokens at the next state. The transition probabilities are used when a token is passed from the last state in one word to the first state in another. The output word sequence is then given by the history of the highest scoring end-of-word token at the final timestep. At every timestep t of the length T output sequence, each state s of each modified word w holds a single token tok (w, s, t). This is the highest scoring token reaching that state at that time. In addition we define the input token tok (w, 0, t) to be the highest scoring token arriving at word w at time t, and the output token tok (w, -1, t) to be the highest scoring token leaving word w at time t. The CTC token passing algorithm has a worst case complexity of O(T W 2 ), since line 15 requires a potential search through all W words. However, because the output tokens tok (w, -1, T ) are sorted in order of score, the search can be terminated when a token is reached whose score is less than the current best score with the transition included. The typical complexity is therefore considerably lower, with a lower bound of O(T W log W ) to account for the sort. If no bi-grams are used, lines 1517 can be replaced by a simple search for the highest scoring output token, and the complexity reduces to O(T W ). 5 3 Experiments The experimental task was on-line handwriting recognition, using the IAM-OnDB handwriting database [11]. IAM-OnDB can be downloaded from http://www.iam.unibe.ch/ fki/iamondb/ For the CTC experiments, we record the character error rate using standard best-path decoding, and the word error rate using Algorithm 1 with a language model and a dictionary. For the HMM system, the word error rate is quoted from the literature [1]. Both the character and word error rate are defined as the combined number of insertions, deletions and substitutions in the algorithm's transcriptions of the input sequences in the test set, divided by the combined length of the target sequences in the test set. We compare results using two different input representations, one hand-crafted for HMMs, the other consisting of raw data direct from the pen sensor. 3.1 Data and Preprocessing IAM-OnDB consists of pen trajectories collected from 221 different writers using a `smart whiteboard' [11]. The writers were asked to write forms from the LOB text corpus [9], and the position of their pen was tracked using an infra-red device in the corner of the board. The original input data consists of the x and y pen co-ordinates, the points in the sequence when individual strokes (i.e. periods when the pen is pressed against the board) end, and the times when successive position measurements were made. Recording errors in the x, y data was corrected by interpolating to fill in for missing readings, and removing steps whose length exceeded a certain threshold. IAM-OnDB is divided into a training set, two validation sets, and a test set, containing respectively 5364, 1438, 1518 and 3859 written lines taken from 775, 192, 216 and 544 forms. For our experiments, each line was used as a separate sequence (meaning that possible dependencies between successive lines were ignored). The character level transcriptions contain 80 distinct target labels (capital letters, lower case letters, numbers, and punctuation). A dictionary consisting of the 20, 000 most frequently occuring words in the LOB corpus was used for decoding, along with a bi-gram language model optimised on the training and validation sets [1]. 5.6% of the words in the test set were not in the dictionary. Two input representations were used for this task. The first consisted simply of the offset of the x, y co-ordinates from the top left of the line, along with the time from the beginning of the line, and an extra input to mark the points when the pen was lifted off the whiteboard. We refer to this as the raw input representation. The second representation required a large amount of sophisticated pre-processing and feature extraction [1]. We refer to this as the pre-processed input representation. Briefly, in order to account for the variance in writing styles, the pen trajectories were normalised with respect to such properties as the slant, skew and width of the letters, and the slope of the line as a whole. After that, two sets of input features were extracted, the first consisting of `online' features, such as pen position, pen speed, line curvature etc., and the second consisting of `offline' features created from a two dimensional window of the image created by the pen. 3.2 Experimental Setup The CTC network used the BLSTM architecture [4]. The forward and backward hidden layers each contained 100 single cell memory blocks. The input layer was fully connected to the hidden layers, which were fully connected to the output layer. The output layer contained 81 units (80 characters plus the blank label). For the raw input representation, there were 4 input units, giving 100,881 weights in total. For the pre-processed representation, there were 25 input units, giving 117,681 weights in total. Hyperbolic tangent was used for the cell activation functions and logistic sigmoid in the range [0, 1] was used for the gates. For both input representations, the data was normalised so that each input had mean 0 and standard deviation 1 on the training set. The network was trained with stochastic steepest descent, using a learning rate of 10-4 and a momentum of 0.9. Training was stopped after no improvement was recorded on the validation set for 50 training epochs. 6 Table 1: Word Error Rate (WER) on IAM-OnDB. LM = language model. CTC results are a mean over 4 runs, ± standard error. All differences were significant (p < 0.01) System HMM CTC CTC CTC CTC Input pre-processed raw pre-processed raw pre-processed LM WER 35.5% [1] 30.1 ± 0.5% 26.0 ± 0.3% [1] 22.8 ± 0.2% 20.4 ± 0.3% The HMM setup [1] contained a separate, left-to-right HMM with 8 states for each character (8 81 = 648 states in total). Diagonal mixtures of 32 Gaussians were used to estimate the observation probabilities. All parameters, including the word insertion penalty and the grammar scale factor, were optimised on the validation set. 3.3 Results The character error rate for the CTC network with the pre-processed inputs was 11.5 ± 0.05%. From Table 1 we can see that with a dictionary and a language model this translated into a mean word error rate of 20.4%, which is a relative error reduction of 42.5% compared to the HMM. Without the language model, the error reduction was 26.8%. With the raw input data CTC achieved a character error rate of 13.9 ± 0.1%, and word error rates that were comparable to those recorded with the pre-processed data. This result corroborates the ability of the LSTM architecture to access long-range contextual information. When we attempted to train a CTC network with a standard bidirectional RNN on the same task, the character error rate never dropped below 90%. t A useful indicator of contextual sensitivity is provided by the derivatives of the output yk at a park ticular point t in the data sequence with respect to the inputs xt at all points 1 t T in the sequence. We refer to the sequence of these derivatives as the sequential Jacobian. Looking at the relative magnitude of the sequential Jacobian at different timesteps gives an idea of the range of context that is being used. Figure 1 shows a sequential Jacobian for CTC networks using both the raw and pre-processed inputs for a phrase from IAM-OnDB. For both input representations the region of sensitivity extends forwards and backwards from the given timestep, which supports the use of bidirectional networks. For the pre-processed representation, the Jacobian is sharply peaked around the time when the output is emitted. For the raw inputs, the Jacobian is more spread out, which suggests that the network does indeed make more use of long-range contextual information. Note the high sensitivity to the very end of the input sequence: this corresponds to the delayed dot of the letter "i". 4 Conclusion We have combined a BLSTM CTC network with a probabilistic language model. We have applied this system to an on-line handwriting database and obtained results that substantially improve on a state-of-the-art HMM. We have also shown that the network's performance with raw sensor data is comparable to that with specially pre-processed data. As far as we are aware, our system is the first to successfully recognise unconstrained on-line handwriting using raw data only. References [1] A. Anonymous. A novel approach to on-line handwriting recognition based on bidirectional long short-term memory networks. In Proc. 9th Int. Conf. on Document Analysis and Recognition, Curitiba, Brazil, 2007. To appear. [2] J.S. Bridle. Probabilistic interpretation of feedforward classification network outputs, with relationships to statistical pattern recognition. In F.Fogleman Soulie and J.Herault, editors, Neurocomputing: Algorithms, Architectures and Applications, pages 227­236. SpringerVerlag, 1990. 7 Figure 1: Sequential Jacobian for a sequence from the IAM-OnDB, with raw inputs (left) and preprocessed inputs (right). For ease of visualisation, only the derivative with highest absolute value is plotted at each timestep. The reconstructed image was created by plotting the pen co-ordinates recorded by the sensor. The individual strokes are alternately coloured red and black. For both representations, the Jacobian is plotted for the output corresponding to the label `i' at the point when `i' is emitted (indicated by the vertical dashed lines). ´ [3] A. Graves, S. Fernandez, F. Gomez, and J. Schmidhuber. Connectionist temporal classification: Labelling unsegmented sequence data with recurrent neural networks. In Proceedings of the International Conference on Machine Learning, ICML 2006, Pittsburgh, USA, 2006. [4] A. Graves and J. Schmidhuber. Framewise phoneme classification with bidirectional LSTM and other neural network architectures. Neural Networks, 18(5-6):602­610, 2005. [5] S. Hochreiter, Y. Bengio, P. Frasconi, and J. Schmidhuber. Gradient flow in recurrent nets: the difficulty of learning long-term dependencies. In S. C. Kremer and J. F. Kolen, editors, A Field Guide to Dynamical Recurrent Neural Networks. IEEE Press, 2001. [6] S. Hochreiter and J. Schmidhuber. Long Short-Term Memory. Neural Computation, 9(8):1735­1780, 1997. [7] Jianying Hu, Sok Gek Lim, and Michael K. Brown. Writer independent on-line handwriting recognition using an HMM approach. Pattern Recognition, 33:133­147, 2000. [8] S. Jaeger, S. Manke, J. Reichert, and A. Waibel. Online handwriting recognition: the NPen++ recognizer. International Journal on Document Analysis and Recognition, 3:169­180, 2001. [9] S Johansson, R Atwell, R Garside, and G Leech. The tagged LOB corpus user's manual; Norwegian Computing Centre for the Humanities, 1986. [10] Yann LeCun, Fu-Jie Huang, and Leon Bottou. Learning methods for generic object recognition with invariance to pose and lighting. In Proceedings of CVPR'04. IEEE Press, 2004. [11] M. Liwicki and H. Bunke. IAM-OnDB - an on-line English sentence database acquired from handwritten text on a whiteboard. In Proc. 8th Int. Conf. on Document Analysis and Recognition, volume 2, pages 956­961, 2005. ´ [12] Rejean Plamondon and Sargur N. Srihari. On-line and off-line handwriting recognition: a comprehensive survey. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2000. [13] M. Schuster and K. K. Paliwal. Bidirectional recurrent neural networks. IEEE Transactions on Signal Processing, 45:2673­2681, November 1997. [14] Patrice Y. Simard, Dave Steinkraus, and John C. Platt. Best practices for convolutional neural networks applied to visual document analysis. In Proceedings of the International Conference on Document Analysis and Recognition, 2003. [15] S.J. Young, N.H. Russell, and J.H.S. Thornton. Token passing: A simple conceptual model for connected speech recognition systems. Technical report, Cambridge University Engineering Dept. 8