Lemmatization

Lemmatization is the task of determining a canonicalized form of some token. When these tokens are words, we often say that a lemmatizer determines that word’s root form. For instance, The words am, are, and is have the same root form (or lemma) be; and the words dinner and dinners both have the lemma dinner. Formally, we might model lemmatiztion as a function from strings (^*) to some subset of those strings that we will call the roots (\(R\)).

\[\text{lemmatize}: \Sigma^* \rightarrow R \subset \Sigma^*\]

In general, lemmatization is assumed to only strip inflectional morphology from a word. For instance, derivations would be lemmatized to derivation rather than derive, since ation is a derivational morpheme, which we can see from the fact that it changes the words grammatical ctegory from verb to noun.

Lemmatization is definitionally a destructive procedure. If we want to retain the morphological structure of a word, we would need to do morphological analysis. Usually, morphological analysis is assumed to tokenize a string into a subset of strings of strings that we will call the morphemes (\(M\)); and usually we assume that these morphemes include both inflectional and derivational morphemes.

\[\text{analyze-morphology}: \Sigma^* \rightarrow M^* \subset \Sigma^{**}\]

Constructing a morphological analyzer (especially when we want the analyzer to handle derivational morphology correctly) is a much harder problem than constructing a lemmatizer, and there are far fewer off-the-shelf sytems for doing it. In LING281/481 and LING282/482, I cover how to build such a system.

To get the lemmatized forms of words, we need to modify the processors list in the stanza.Pipeline.

# Build an English pipeline, with all processors by default
stanza_en_nlp = stanza.Pipeline(lang='en', processors = 'tokenize,lemma', use_gpu=False)

# Build a Chinese pipeline, with customized processor list, and force it to use CPU
stanza_zh_nlp = stanza.Pipeline(lang='zh', processors='tokenize,lemma', use_gpu=False)

INFO:stanza:Checking for updates to resources.json in case models have been updated.  Note: this behavior can be turned off with download_method=None or download_method=DownloadMethod.REUSE_RESOURCES
INFO:stanza:Loading these models for language: en (English):
========================
| Processor | Package  |
------------------------
| tokenize  | combined |
| lemma     | combined |
========================

INFO:stanza:Use device: cpu
INFO:stanza:Loading: tokenize
INFO:stanza:Loading: lemma
INFO:stanza:Done loading processors!
INFO:stanza:Checking for updates to resources.json in case models have been updated.  Note: this behavior can be turned off with download_method=None or download_method=DownloadMethod.REUSE_RESOURCES
INFO:stanza:"zh" is an alias for "zh-hans"
INFO:stanza:Loading these models for language: zh-hans (Simplified_Chinese):
=======================
| Processor | Package |
-----------------------
| tokenize  | gsdsimp |
| lemma     | gsdsimp |
=======================

INFO:stanza:Use device: cpu
INFO:stanza:Loading: tokenize
INFO:stanza:Loading: lemma
INFO:stanza:Done loading processors!

# Processing English text
en_doc = stanza_en_nlp(en_str)
print(type(en_doc))

# Processing Chinese text
zh_doc = stanza_zh_nlp(zh_str)
print(type(zh_doc))

<class 'stanza.models.common.doc.Document'>
<class 'stanza.models.common.doc.Document'>

for sent in en_doc.sentences:
    stanza_en_sent = [(word.text, word.lemma) for word in sent.words]
    print(stanza_en_sent)

[('Homelessness', 'homelessness'), ('is', 'be'), ('to', 'to'), ('be', 'be'), ('found', 'find'), ('in', 'in'), ('the', 'the'), ('large', 'large'), ('numbers', 'number'), ('of', 'of'), ('mentally', 'mentally'), ('ill', 'ill'), ('and', 'and'), ('substance', 'substance'), ('-', '-'), ('abusing', 'abuse'), ('people', 'people'), ('in', 'in'), ('the', 'the'), ('homeless', 'homeless'), ('population', 'population'), (',', ','), ('do', 'do'), ("n't", 'not'), ('you', 'you'), ('think', 'think'), ('?', '?')]

for i, sent in enumerate(zh_doc.sentences):
    stanza_zh_sent = [(word.text, word.lemma) for word in sent.words]
    print(stanza_zh_sent)

[('林', '林'), ('行止', '行止'), ('表示', '表示'), ('貿易', '貿易'), ('戰', '戰'), ('促使', '促使'), ('在', '在'), ('中國', '中國'), ('的', '的'), ('工廠', '工廠'), ('搬遷', '搬遷'), ('到', '到'), ('越南', '越南'), ('、', '、'), ('寮國', '寮國'), ('、', '、'), ('印度', '印度'), ('、', '、'), ('台灣', '台灣'), ('甚至是', '甚至是'), ('馬', '馬'), ('來西', '來西'), ('亞', '亞')]

Spacy

en_str = "I don't like him."
en_doc = spacy_en_nlp(en_str)
spacy_en_sent = [(w.text, w.lemma_) for w in en_doc]
print(spacy_en_sent)

[('I', 'I'), ('do', 'do'), ("n't", 'not'), ('like', 'like'), ('him', 'he'), ('.', '.')]

Named Entity Recognition

Vanilla tokenization algorithms don’t always handle multi-word expressions (MWEs) well—e.g. idioms like “kick the bucket” or names. Names are somewhat easier than idioms for various reasons.

We can use the spacy package to extract entities from a text string as follows:

en_str = '''
This is an example string. Harry Potter is a name of an individual. Ron Weasley and Draco Malfoy are his school-mates.
Voldemort is another individual who has an army of Death Eaters.
He belongs to the Hogwart school. Nobody actually know if the school actually exists. But, nevertheless, people love the concept.
Do you think the NER-System, Spacy, will capture all the named entities in this string correctly? The New York Times certainly does.
'''

en_doc = spacy_en_nlp(en_str)
en_doc.ents

(Harry Potter,
 Ron Weasley,
 Draco Malfoy,
 Voldemort,
 Death Eaters,
 NER-System,
 The New York Times)

As we can see above, the NER system is pretty good at finding the named entities in a text