NMT does not learn languages in the same way humans do, even though some breathless reporting has made this claim. Instead, it relies on statistical correlations, much like PbSMT does. The difference is that NMT can make much more complicated inferences from that data and is very good at determining correlations of correlations. Where a PbSMT system might observe that English frog tends to translate into German as Frosch, a neural system could note that if the text mentions words to do with railways, Herzstück is a much more likely translation, even if this translation occurs only a few times in a training corpus.
Older systems look at n-grams, strings of a certain number of words. For example, if a system works with 6-grams, it considers chunks of up to six words. This approach works fine for linguistic structures that are compact, but has trouble with "long-distance dependencies," such as German verb phrases, which may contain entire clauses in between parts of a verb phrase. By contrast, NMT systems look at whole sentences in their entirety, and researchers are now pushing them to work on entire paragraphs or even longer chunks of text. This shift allows them to be more sensitive to context and handle complex grammatical structures more effectively.
NMT looks at individual characters, while phrase-based approaches look at words. This difference makes neural systems particularly good at working with morphologically rich languages, such as German or Hungarian. For example, a PbSMT system would not – without additional language technology – recognize that both speichern and gespeichert are forms of the same verb. By contrast, NMT can work with patterns of characters to predict word forms it may not have previously seen.
Neural systems can extrapolate across multiple languages to fill in gaps in training data. This capability called "zero-shot translation" allows NMT engines to translate language pairs for which they have no data or to fill in gaps in training data from other language pairs. For example, if a NMT engine has English<>Greek and English<>Finnish training data, but no GreekFinnish, it can use the information from its existing language pairs to translate that pair. Although the results will not be as good as for pairs where it has data, this can make the difference between having some translation and no translation at all.
Source: http://www.tcworld.info/e-magazine/translation-and-localization/article/neural-machine-translation-offers-significant-advances-with-remaining-challenges/
[Edited at 2018-03-12 07:45 GMT]