How does the ELMo model for machine learning work

The rules of NLP have been rewritten since then? From word2vec to ELMo to BERT

2018.10.24 10:20 am | Posted by: Refining decimal in gold _ | Show: 6920 | Comments: 0 | Author: Xi Little Yao | By: Meng Yao sell small house evening

Summary: When I speak of these clichés, I never tire of writing them over and over again. Google's word2vec appeared in 2013, which made NLP flourish in all areas. For a while, I felt embarrassed about writing a thesis without using pre-made word vectors. And what is word2vec? Obviously it is a "linear" language model. As I ...


I still remember that Microsoft and Ali outperformed people with R-Net + and SLQA on SQuAD and Baidu outperformed people with V-Net on MS MARCO and BLEU in the field of machine reading comprehension not so long ago. These networks can be said to be more complex than one, and it seems that "designing a more task-specific network" has become the politically correct direction of research in NLP. With this kind of wind, regardless of word2vec, fame or Fasttext, it can only serve as the icing on the cake. What about good transfer learning and pre-training? It never seemed to be the protagonist in NLP.

Xiao Xi was also a little ashamed when he wrote this article. After a long period of expression and migration, he did not get satisfactory experimental results, although intuitively he believed that this should be the core theme of NLP. When BERT came out a few days ago I felt that poverty was limiting (crossed out) my imagination and then I felt that my focus was still too narrow.

Everyone understands BERT differently. This article tries to talk about BERT from the perspective of word2vec and ELMo. Let us briefly go into the essentials of word2vec and ELMo. Anyone who has already thoroughly understood it can quickly move on to the BERT chapter.


Speaking of which, they are all clichés and I never get tired of writing them over and over again. As soon as Google's word2vec came out in 2013, NLP blossomed in all areas. For a while, it seemed embarrassing to write a dissertation without using pre-made word vectors. And what is word2vec?


Obviously it is a "linear" language model. Since our goal is to learn word vectors and the word vectors must semantically support some "linear semantic operations", such as: B. "Emperor-Queen = Male-Female" (ignoring Wu Zetian), the use of a linear model is of course sufficient. It is quick and does the job very elegantly.

Furthermore, one of the essence of word2vec is to optimize the language model's set of Softmax acceleration methods and use a seemingly open-minded "Negative Sampling" method to replace the traditional hierarchical Softmax and NCE methods. And what exactly is the "negative sample" of this name?

Negative scan

We know that the softmax level is very difficult to calculate for training language models. After all, you want to predict which word is the current position. The number of categories corresponds to the size of the dictionary. Therefore, the number of categories of tens of thousands and hundreds of thousands is counted as the softmax. Functions are of course tedious. However, if our goal is not to train an exact language model, but only to train the by-product of the language model word vector, then only a "subtask" needs to be used, the calculation of which is cheaper here. It.

Think about it. Is it difficult to give you 10,000 cards with numbers so you can find the bigger ones? But if you extract the larger value in advance and shuffle it with five cards drawn at random so that you can choose the larger value, is it easier?

Negative sampling is the idea, i.e. instead of letting the model find the most likely word from the entire vocabulary directly, it is given that word (i.e. the positive example) and several randomly sampled noise words (i.e. the sampled negative samples) directly. As long as the model can find the right word from it, the goal is considered complete. The objective function corresponding to this idea is thus:

This negative sample idea was successfully applied to the BERT model, but the granularity changed from words to sentences. Do not worry, slowly look back ~

Drawing plane and context

Although a lot of work was done from 2015 to 2017 to start from the char level and find a new way to get rid of the rules of the game of pre-trained word vectors, the actual measurement was short-lived and quickly baffled [8] [9 ]. However, people also realize that character-level text also contains some patterns that are difficult to describe in word-level text. Therefore there is on the one hand a word vector FastText [5] that can learn functions at the character level. In the task being monitored, the aspect began to introduce character-level text representation over flat CNN, HIghwayNet, RNN, and other networks.

So far, however, word vectors are context-free. That is, the same word is always the same word vector in different contexts, which obviously leads to the lack of WSD (Word Sense Disambiguation) capability of the word vector model. Therefore, in order to make the word vector context sensitive, people started coding based on the word vector sequence in certain downstream tasks.

The most common coding method is of course the use of RNN-based networks. In addition, there are also successful deep CNN coding tasks (such as text classification [6], machine translation [7], machine reading comprehension [4]). Naturally! And! Google said CNN is too vulgar, we want to use a fully connected network! (Crossed out) Self-awareness! Then there is a transformer model adapted for NLP [11]. Transformer was proposed for machine translation tasks, but also played a major role in other areas such as the retrieval dialog [3].

However, since it turns out that coding requirements are fundamentally required for every NLP task, the word vector should not have the context-dependent capability at the beginning. So there was ELMo [2].


Of course, ELMo isn't the first model to try to generate context sensitive word vectors, but it is indeed a model that gives you good reason to give up word2vec (manual smile). After all, the speed of point-thinking is sacrificed in exchange for so many performance improvements.In most cases, the value is ~ ELMo is a stacked bin at the model level (in fact, it trains two unidirectional stacked lstm), so of course it has a good one Coding ability. At the same time, the source code also supports the use of Highway Net or CNN to introduce additional coding at the character level. There is of course a larger probability function of the language model standard when it is trained, i.e. H.

The highlight of this ELMo is of course not the model layer, but it has shown indirectly through experiments that the features learned in different layers of the multi-layer RNN are actually different. Therefore, ELMo suggested completing the pre-training and migrating to the downstream NLP. The task must establish a trainable parameter for the original word vector layer and the hidden layer of each layer of RNN. These parameters are normalized by the Softmax layer and multiplied and summed by their respective layers to play a weighting role. The word vector obtained by the “weighted sum” is then further scaled by a parameter in order to better adapt to the downstream task.

ps: In fact, this last parameter is still very important. For example, in word2vec the variance of the word vector learned from cbow and sg is generally large. At this point in time, the variance and the word vector, which match the variance of the subsequent layers of the downstream task, converge faster. Easier to have better performance

The math expression is as follows

Such a migration strategy will make tasks that require disambiguation of the literal sense more likely to be given heavy weight for the second hidden layer through training, while tasks that have an obvious need for part of speech and syntax may have parameters for the first have hidden layer. Learned relatively large values ​​(experimental conclusions). In short, it is not surprising that a feature rich word vector is obtained that can be customized by downstream tasks. The effect is much better than with word2vec.

After all, the goal of ELMo is to learn only context-sensitive and more powerful word vectors. The aim is still to create a solid basis for downstream tasks. It is not intended to be a king yet.

And we know that it is nowhere near enough to cover all NLP tasks in order to code the text completely and efficiently (i.e. to get very precise and rich functions of each lexeme). Much more complex patterns need to be captured in quality assurance, machine reading comprehension (MRC), natural language thinking (NLI), dialogue and other tasks, such as: B. Relationships between sentences. For this purpose, the network adds various fancy attentions in the downstream task (see SOTAs in NLI, MRC, Chatbot).

With the need to capture more magical patterns, researchers have adapted a variety of network structures for each downstream task, resulting in the same model that, after a small change in the task, hangs up on the same task, and a significant loss of performance occurs as the data set goes through another distribution is replaced which obviously does not correspond to the speech behavior of humans. ~ You need to know that the generalization ability of humans is very strong, which shows that possibly the current development of the whole NLP is wrong. What is the significance of NLP, especially under the leadership of SQuAD, which exhausted various tricks and fancy structures to cross the list?

It seems a long way off, but luckily this road that is becoming more and more biased is eventually blocked by a model, namely the Bidirectional Encoder Representations of Transformers (BERT) [1] published by Google a few days ago.


The main meaning of this paper is not what model is used or how it is trained, but that it suggests a new set of rules of the game.

Before starting the game, please help Xiao Xi by clicking the small advertisement. OK? (// ∇ //) \

As mentioned earlier, it is very unwise to thoroughly customize the complex model structure with poor generalization skills for any NLP task, and of course it is not. Since ELMo shows such a great improvement over word2vec, this shows that the potential of pre-trained models is far more than providing an exact word vector for downstream tasks. So can we train a model directly in advance at keel level? When the characteristics of relationships at the character level, word level, sentence level, and even between sentences have been fully described, all you need to do in various NLP assignments is customize a very simple output level (e.g., a single level) for the assignment. MLP) is okay, after all, the model skeleton is already finished.

And BERT did just that, or in other words, it really did. As a general model at the keel level, it easily challenged the deeply adapted models in 11 tasks. . .

How does it work?

Deep two-way coding

First, it says that previous pre-trained models were insufficient to learn context-sensitive word vectors! Although the coding method is already very extravagant for the downstream monitored tasks, deep two-way coding has become the standard for many complex downstream tasks (such as MRC, dialog). However, in the pre-trained model, the previous more advanced model is based only on the traditional language model, and the traditional language model is one-sided (mathematically defined), i. H.

And it's often very flat (think of the three layers of the LSTM stack, the train can't move and different tricks are required), such as: B. ELMo.

Although ELMo uses two-way RNNs for coding, the RNNs in these two directions are actually trained separately and only a simple addition is done at the loss layer in the end. As a result, for words in either direction, the words on the other side cannot be seen when encoded. Obviously, the semantics of some words in a sentence depend on some words on the left and right. The coding from only one direction cannot be clearly described.

So why not do a real bi-directional coding like with downstream monitoring tasks?

The reason is clear at first glance. After all, traditional language models are designed to predict the next word. However, if bidirectional coding is done, it does not mean that the words to be predicted have been seen. Such predictions are of course not. Importance. Therefore, a new task is proposed in BERT to train a model that can be coded bidirectionally in a monitored task. This task is known as the Masked Language Model (Masked LM).

Masked LM

As the name suggests, Masked LM means that instead of giving a word that has already appeared like traditional LM to predict the next word, we go straight to part of the entire sentence (chosen at random) to make it masked . This model, if not, you can safely do the bidirectional coding and then let the model confidently predict what those words covered are. This task was actually called the gap test (probably translated as "gap test").

Obviously, this causes some minor problems. In this way, these covered markings are also included in the coding en ( ̄ ▽  ̄ ""), although bidirectional coding can be ensured, and these mask markings are not present in downstream tasks. . . What should we do? Because of this, the author tells the model to tune the model as much as possible to ignore the effects of these markings: "These are sounds and noises! Unreliable! Ignore them!" For a hidden word:

80% probability to be replaced by "[Mask]"
has a 10% chance of replacing a randomly sampled word
There is a 10% chance that a replacement will not be performed (although a replacement will not be performed, it is still necessary to predict it).


When choosing the encoder, the author did not use the bad street bi-lstm, but a deeper, better parallel transformer encoder. This way, any word in the lexeme can directly code any word in the sentence, regardless of direction and distance. On the other hand, I have a subjective feeling that Transformer is easier to avoid mask marking than lstm. Finally, the self-attention process can deliberately and completely weaken the mask marking, but how does the input gate in lstm handle the mask marking? It is unknown.

Wait, Xiao Xi also said in a previous article, obviously using the transformer encoder will not lose the position information? Does the author also have a frightening coding position for the Sin and Cos function here as in the original Transformer paper? It's true, the author here simply and roughly trained a position that is directly embedded ╮ ( ̄ ▽  ̄ ””) ╭ This means that if I cut the sentence down to 50, we have 50 positions Words representing the position, ie from position 0 to position 49.. . Then give each positional word a randomly initialized word vector and then train with them (I would like to say that this particular cat can work too? Too simple and rude ...) In addition, in the combination of positional and word embedding, the direct addition in BERT selected.

In terms of depth, the full version of the BERT encoder has mostly been overlaid with a 24-layer multi-head attention block (you need to know that the SOTA model DAM also uses 5 layers in the dialog ...). . . And each block contains 16 taps and 1024 hidden units. ╮ ( ̄ ▽  ̄ ””) Here is the slogan: Money is all you need (crossed out)

Learning to express the sentence and sentence pair relationship

As mentioned earlier, for many tasks, coding alone is not enough to complete the task (this only learns a number of token-level functions) and it is necessary to capture some sentence-level patterns in order to be able to use SLI, QA, dialogue, etc. complete. Sentence display, inter-sentence interaction and matching tasks. In response, BERT introduced another extremely important but extremely easy task to learn this model.

Negative sample at sentence level

Remember, Xiao Xi said in the previous chapter of word2vec that one of the essences of word2vec is to introduce an elegant negative sampling task to learn word-level representation. What if we generalize this negative sampling process at the sentence level? This is key to learning sentence level representation through BERT.

BERT is similar to word2vec here, but creates a sentence-level classification task. That is, a sentence is given first (corresponds to the given context in word2vec), and the next sentence is a positive example (corresponds to the correct word in word2vec), and a sentence is randomly sampled as a negative example (corresponds to the randomly sampled word2vec) Words) and then do a sentence-level binary classification (i.e. determine whether the sentence is the next sentence or the noise of the current sentence). With this simple negative sample at sentence level, BERT can learn sentence representation as easily as word2vec learns word representation.

Representation on sentence level

Wait a minute, it was not so long ago that I said how to express the sentence. . .

BERT is not like the usual practice with downstream monitoring tasks. It does global pooling based on the coding. It starts with each sequence first (for one sentence there are two spelled sentences for the task. It is a sentence for other tasks.) Before that, a special token is added which is marked as [CLS] as shown in the figure

ps: [sep] is the separator between sentences. BERT also supports the display of pairs of learning sentences. Here [SEP] should distinguish the intersection of sentence pairs.

Then let the encoder deep-code [CLS], and the higher hidden layer of depth-coding is the representation of the entire sentence / sentence pair. This approach is a bit puzzling at first glance, but don't forget that Transformer can encode global information in any position regardless of space and distance, and [CLS] as a sentence / sentence pair representation is directly related to the output layer of the classifier "Tor" on the gradient backpropagation path I will of course find a way to learn the upper level features in terms of classification.

In order for the model to distinguish whether each word belongs to a "left sentence" or a "right sentence", the author also introduces the concept of "segment embedding" to distinguish sentences. For sentence pairs, embedding A and embedding B are used to represent the left and right sentences, respectively; only embedding A is used for sentences. Embedding A and B are also trained with the model.

ps: this method is as simple and rude as embedding positions. It is very difficult to understand why BERT can still work on tasks where the network theoretically needs to maintain symmetry. The mood is complicated.

In the end, the representation of each token by BERT consists of the original word vector token embedding of the token, the position embedding mentioned above, and the segment embedding embedded here, as shown in the figure:

Simple to excessive interface for downstream tasks

The BERT model is really a keel-level model and no longer a word vector, i.e. its interface to downstream tasks is designed, or a foreign word is referred to as a migration strategy.

First, since the representations of sentences and sentence pairs of the upper level are naturally obtained for text classification tasks and text matching tasks (text matching is actually a text classification task, but the input is a text pair), you only need to use the representation received (i.e. the encoder gives on the top level the [CLS] position from) plus a layer of MLP ~

After the text has been coded both ways, all that's left to do is add the softmax output layer for the sequence labeling task, not even CRF.

What makes Xiao Xi even more interesting is that in chip extraction tasks like SQuAD, the two main packages of deep coding and deep attention are left out, even if they dare to lose the output layer's pointer mesh directly? Just like with DrQA with two linear classifiers to output the start and end point of the range? Not much to say, already kneeling

Finally, take a look at the experimental results

Well that's google.

As soon as this paper was published, Xiao Xi was very pleased as many earlier ideas did not need to be experimentally verified, as the BERT is stifled (。 ́︿ ̀。), the classification, labeling and migration tasks can be started from scratch , SQuAD The blueprint can also be stopped because BERT did not complete the generation task, which gives people a bit of imagination. Well, manual smiles and tears.

Finally, the younger brothers and sisters who like Xiao Xi can encourage Xiao Xi through the reward button below or click the little ad below, I love you

[1] 2018 | BERT- Pre-training of Deep Bidirectional Transformers for Language Understanding
[2] 2018NAACL | Deep contextualized word representations
[3] 2018 ACL | Multi-Turn Response Selection for Chatbots with Deep Attention Matching Network
[4] 2018ICLR | Fast and Accurate Reading Comprehension by Combining Self-Attention and Convolution
[5] 2017TACL | Enriching Word Vectors with Subword Information
[6] 2017ACL | Deep Pyramid Convolutional Neural Networks for Text Categorization
[7] 2017 | Convolutional Sequence to Sequence Learning
[8] 2017 | Do Convolutional Networks need to be Deep for Text Classification?
[9] 2016 | Convolutional Neural Networks for Text Categorization / Shallow Word-level vs. Deep Character-level
[10] 2013NIPS | Distributed-representations-of-words-and-phrases-and-their-compositionality

Disclaimer: The article is collected from the Internet. If you violate this, please contact the publisher in good time. Many Thanks!