Tokenization

Tokenization is the task of segmenting a string into tokens. A token often corresponds to a word (in the context of computational linguistics), though it may correspond to other kinds of strings, such as punctuation.

Formally, we say that a tokenizer maps a string (in \(\Sigma^*\)) into a string of strings (in \(\Sigma^{**}\)).

\[\text{tokenize}: \Sigma^* \rightarrow \Sigma^{**}\]

We can generally do tokenization reasonably well in English by finding substrings that have white space on either side. We could do this in Python using str.split(), though a regular expression will turn out to be more useful.

import re

news = ("Lin Xingzhi said that the trade war prompted the relocation of "
        "factories in China to Vietnam, Laos, India, Taiwan and even Malaysia.")

re.findall("(?:(.*?)(?:\s|$))+?", news)
['Lin',
 'Xingzhi',
 'said',
 'that',
 'the',
 'trade',
 'war',
 'prompted',
 'the',
 'relocation',
 'of',
 'factories',
 'in',
 'China',
 'to',
 'Vietnam,',
 'Laos,',
 'India,',
 'Taiwan',
 'and',
 'even',
 'Malaysia.',
 '']

Note that this implementaton of \(\text{tokenize}\) is furthermore invertible:

\[\text{tokenize}^{-1}: \Sigma^{**} \rightarrow \Sigma^*\]

That is, can recover the original string from the tokenized variant, in this case by simply joining all tokens by whitespace. Not all implementations of \(\text{tokenize}\), viewed as functions from strings to strings of strings, are invertible.

For instance, note that, even for a short piece of text like this, we’re missing certain things we might want to capture:

  1. In the case of the name Lin Xingzhi, we are probably oversegmenting. Really, we probably want to be a single token.
  2. In the case of the list of country names, some country names have punctuation on the end (Vietnam,, Laos,, India,, Malaysia.), while others don’t (Taiwan).

To handle these, we might try to introduce additional tokenization rules by augmenting our regular expression.

# this regular expression won't work for general comma-separated lists
re.findall("(?:([^A-Z].*?|(?:[A-Z].*?(?:\s?))+)(?:,?\s|\.$))+?", news)
['Lin Xingzhi',
 'said',
 'that',
 'the',
 'trade',
 'war',
 'prompted',
 'the',
 'relocation',
 'of',
 'factories',
 'in',
 'China',
 'to',
 'Vietnam',
 'Laos',
 'India',
 'Taiwan',
 'and',
 'even',
 'Malaysia']

This works reasonably well in this case, but we are actually throwing out punctuation, where generally a tokenizer will retain it as its own token. This fact implies that we cannot reconstruct the original string from it’s tokenization. And even if we were to augment our tokenizer to retain the punctuation as separate tokens…

# this regular expression won't work for general comma-separated lists
re.findall("(?:((?<=\s)[^A-Z,\.][^,\.]*?|(?:[A-Z][^,\.]*?(?:\s?))+|[,\.])(?=,?\s|\.$))+?", news)
['Lin Xingzhi',
 'said',
 'that',
 'the',
 'trade',
 'war',
 'prompted',
 'the',
 'relocation',
 'of',
 'factories',
 'in',
 'China',
 'to',
 'Vietnam',
 ',',
 'Laos',
 ',',
 'India',
 ',',
 'Taiwan',
 'and',
 'even',
 'Malaysia']

…we wouldn’t be able to unambiguously reconstruct the string, because we couldn’t be sure whether a whitespace should come before or after a punctuation mark.1 (We could devise heuristics based on our knowledge of English orthographic conventions to get close, but we couldn’t know for sure.)

Thus, if we want to be able to reconstruct the original string (and we don’t necessarily always do), we need to have \(\text{tokenize}\) produce additional information in order to reconstruct the original string: usually some form of character offset, which tells us the beginning and ending points of a token relative to the beginning of the original string.

\[\text{tokenization-offset}: \Sigma^* \rightarrow (\mathbb{N} \times \mathbb{N})^{*}\]

We could then use \(\text{tokenization-offset}\) to produce a tokenization by indexing into the original string. (These offsets either need to cover the entire input string, or we need to know what characters to default to for any spans of the original string, such as white-space characters, we didn’t include in the tokenization.)

Ignoring the reconstruction issue, though, we will see that the regular expression above won’t work in general. For instance, our current regular expression doesn’t handle contractions ('ll, 'd, n't, etc.), which we generally want to treat as separate tokens (since they are semantically equivalent to full words: will, would, not, etc.), and it won’t handle other kinds of punctuation (e.g. -), which could be word-internal or its own token. It also doesn’t handle names (and more generally, multiword expressions) that don’t fit a very specific format involving title-casing.

These issues get much more difficult when we move outside of English. For instance, consider the following Chinese text from this article.

Example from this post.

news = "林行止表示貿易戰促使在中國的工廠搬遷到越南、寮國、印度、台灣甚至是馬來西亞"

Google Translate translates the article as follows:

Lin Xingzhi said that the trade war prompted the relocation of factories in China to Vietnam, Laos, India, Taiwan and even Malaysia.

If you didn’t know anything about Chinese, the lack of white space here might make you think that it is highly synthetic; but it is the opposite: Chinese is highly analytic. Either way, even the a really fancy regular expression won’t work here for.

We could of course try splitting every character apart.

list(news)
['林',
 '行',
 '止',
 '表',
 '示',
 '貿',
 '易',
 '戰',
 '促',
 '使',
 '在',
 '中',
 '國',
 '的',
 '工',
 '廠',
 '搬',
 '遷',
 '到',
 '越',
 '南',
 '、',
 '寮',
 '國',
 '、',
 '印',
 '度',
 '、',
 '台',
 '灣',
 '甚',
 '至',
 '是',
 '馬',
 '來',
 '西',
 '亞']

This doesn’t work for reasons similar to English. Just like in English, words in Chinese can be multiple characters long. For instance, 林行止 (Lin Xingzhi) and 馬來西亞 are names (Malaysia). But even worse, most characters can stand on there own as words as well. For instance, 林行止 separated by character would be the nonsense phrase “forest walk halt” and 馬來西亞 separated by character would be “horse come to Western Asia”.

The more general point is that good tokenization requires knowing the context of a token. This is true even in English: consider the string New York. In the context of the sentence I live in Rochester, New York, we probably want New York to be a token. But in the context of I read the New York Times this morning, we probably don’t: we want New York Times instead.

Again, we might be able to devise heuristics for handling this case–like take the maximal string, but it’s still not going to handle multiword expressions like run the gamut, where we’re generally not going to have orthographic conventions like title-casing to help us out: should we tokenize this as run the gamut or run, the, gamut?

So what should we do? For the purposes of this course, our answer is going to be “use existing packages”–largely because actually solving the tokenization problem, even in English, turns out to be really hard; and our best systems are ones that require fairly heavy machine learning-based approaches (covered in LING281/481 and LING282/482).

I’ll discuss two of the most popular state-of-the-art packages.

You may have also run across the nltk. I strongly advise aginst using nltk for any of the text normalization we’re going to cover today. The packages I’ll discuss are so much better that it’s not even worth considering. (nltk can still be useful for loading particular kinds of resources like wordnet, and we will use it in Assignmnt 6.)

Stanza

To use stanza for a particular language, we then need to download the relevant language-specific models.

import stanza

# Download an English model
stanza.download('en')
  
# Similarly, download a (traditional) Chinese model
stanza.download('zh')
INFO:stanza:Downloading default packages for language: en (English) ...
INFO:stanza:Finished downloading models and saved to /root/stanza_resources.
INFO:stanza:"zh" is an alias for "zh-hans"
INFO:stanza:Downloading default packages for language: zh-hans (Simplified_Chinese) ...
INFO:stanza:Finished downloading models and saved to /root/stanza_resources.

(Source: https://bit.ly/2kb8eJU)

We process text using stanza.Pipelines, which contain different Processor units. The pipeline is language-specific, so again you’ll need to first specify the language. By default, Pipelines run a bunch of processors, including ones we’re not going to be looking at today. For now, we’ll specify that we just want the tokenize processor.2

# Build an English pipeline, with all processors by default
stanza_en_nlp = stanza.Pipeline(lang='en', processors = 'tokenize', use_gpu=False)

# Build a Chinese pipeline, with customized processor list, and force it to use CPU
stanza_zh_nlp = stanza.Pipeline(lang='zh', processors='tokenize', use_gpu=False)
INFO:stanza:Checking for updates to resources.json in case models have been updated.  Note: this behavior can be turned off with download_method=None or download_method=DownloadMethod.REUSE_RESOURCES
INFO:stanza:Loading these models for language: en (English):
========================
| Processor | Package  |
------------------------
| tokenize  | combined |
========================

INFO:stanza:Use device: cpu
INFO:stanza:Loading: tokenize
INFO:stanza:Done loading processors!
INFO:stanza:Checking for updates to resources.json in case models have been updated.  Note: this behavior can be turned off with download_method=None or download_method=DownloadMethod.REUSE_RESOURCES
INFO:stanza:"zh" is an alias for "zh-hans"
INFO:stanza:Loading these models for language: zh-hans (Simplified_Chinese):
=======================
| Processor | Package |
-----------------------
| tokenize  | gsdsimp |
=======================

INFO:stanza:Use device: cpu
INFO:stanza:Loading: tokenize
INFO:stanza:Done loading processors!

Creating document objects

en_str = "Homelessness is to be found in the large numbers of mentally ill and substance-abusing people in the homeless population, don't you think?"
zh_str = "林行止表示貿易戰促使在中國的工廠搬遷到越南、寮國、印度、台灣甚至是馬來西亞"
## Our earlier algo using regex function
import re
print(re.findall('[^ ]+', en_str))  ## based on whitespace seperation
['Homelessness', 'is', 'to', 'be', 'found', 'in', 'the', 'large', 'numbers', 'of', 'mentally', 'ill', 'and', 'substance-abusing', 'people', 'in', 'the', 'homeless', 'population,', "don't", 'you', 'think?']
# Processing English text
en_doc = stanza_en_nlp(en_str)
print(type(en_doc))

# Processing Chinese text
zh_doc = stanza_zh_nlp(zh_str)
print(type(zh_doc))
<class 'stanza.models.common.doc.Document'>
<class 'stanza.models.common.doc.Document'>
for sent in en_doc.sentences:
    stanza_en_sent = [word.text for word in sent.words]
    print(stanza_en_sent)
['Homelessness', 'is', 'to', 'be', 'found', 'in', 'the', 'large', 'numbers', 'of', 'mentally', 'ill', 'and', 'substance', '-', 'abusing', 'people', 'in', 'the', 'homeless', 'population', ',', 'do', "n't", 'you', 'think', '?']
for i, sent in enumerate(zh_doc.sentences):
    stanza_zh_sent = [word.text for word in sent.words]
    print(stanza_zh_sent)
['林', '行止', '表示', '貿易', '戰', '促使', '在', '中國', '的', '工廠', '搬遷', '到', '越南', '、', '寮國', '、', '印度', '、', '台灣', '甚至是', '馬', '來西', '亞']

Spacy

As with stanza, when using spacy, we need to download language-specific models.

import spacy

!python -m spacy download en_core_web_sm
!python -m spacy download zh_core_web_sm
!python -m spacy download it_core_news_sm
/usr/local/lib/python3.8/dist-packages/torch/cuda/__init__.py:497: UserWarning: Can't initialize NVML
  warnings.warn("Can't initialize NVML")
/usr/local/lib/python3.8/dist-packages/torch/cuda/__init__.py:497: UserWarning: Can't initialize NVML
  warnings.warn("Can't initialize NVML")
2023-02-06 17:17:12.352179: E tensorflow/stream_executor/cuda/cuda_driver.cc:271] failed call to cuInit: CUDA_ERROR_NO_DEVICE: no CUDA-capable device is detected
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting en-core-web-sm==3.4.1
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.4.1/en_core_web_sm-3.4.1-py3-none-any.whl (12.8 MB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 12.8/12.8 MB 40.8 MB/s eta 0:00:00
Requirement already satisfied: spacy<3.5.0,>=3.4.0 in /usr/local/lib/python3.8/dist-packages (from en-core-web-sm==3.4.1) (3.4.4)
Requirement already satisfied: pathy>=0.3.5 in /usr/local/lib/python3.8/dist-packages (from spacy<3.5.0,>=3.4.0->en-core-web-sm==3.4.1) (0.10.1)
Requirement already satisfied: requests<3.0.0,>=2.13.0 in /usr/local/lib/python3.8/dist-packages (from spacy<3.5.0,>=3.4.0->en-core-web-sm==3.4.1) (2.25.1)
Requirement already satisfied: setuptools in /usr/local/lib/python3.8/dist-packages (from spacy<3.5.0,>=3.4.0->en-core-web-sm==3.4.1) (57.4.0)
Requirement already satisfied: preshed<3.1.0,>=3.0.2 in /usr/local/lib/python3.8/dist-packages (from spacy<3.5.0,>=3.4.0->en-core-web-sm==3.4.1) (3.0.8)
Requirement already satisfied: srsly<3.0.0,>=2.4.3 in /usr/local/lib/python3.8/dist-packages (from spacy<3.5.0,>=3.4.0->en-core-web-sm==3.4.1) (2.4.5)
Requirement already satisfied: pydantic!=1.8,!=1.8.1,<1.11.0,>=1.7.4 in /usr/local/lib/python3.8/dist-packages (from spacy<3.5.0,>=3.4.0->en-core-web-sm==3.4.1) (1.10.4)
Requirement already satisfied: spacy-legacy<3.1.0,>=3.0.10 in /usr/local/lib/python3.8/dist-packages (from spacy<3.5.0,>=3.4.0->en-core-web-sm==3.4.1) (3.0.12)
Requirement already satisfied: spacy-loggers<2.0.0,>=1.0.0 in /usr/local/lib/python3.8/dist-packages (from spacy<3.5.0,>=3.4.0->en-core-web-sm==3.4.1) (1.0.4)
Requirement already satisfied: catalogue<2.1.0,>=2.0.6 in /usr/local/lib/python3.8/dist-packages (from spacy<3.5.0,>=3.4.0->en-core-web-sm==3.4.1) (2.0.8)
Requirement already satisfied: numpy>=1.15.0 in /usr/local/lib/python3.8/dist-packages (from spacy<3.5.0,>=3.4.0->en-core-web-sm==3.4.1) (1.21.6)
Requirement already satisfied: jinja2 in /usr/local/lib/python3.8/dist-packages (from spacy<3.5.0,>=3.4.0->en-core-web-sm==3.4.1) (2.11.3)
Requirement already satisfied: wasabi<1.1.0,>=0.9.1 in /usr/local/lib/python3.8/dist-packages (from spacy<3.5.0,>=3.4.0->en-core-web-sm==3.4.1) (0.10.1)
Requirement already satisfied: typer<0.8.0,>=0.3.0 in /usr/local/lib/python3.8/dist-packages (from spacy<3.5.0,>=3.4.0->en-core-web-sm==3.4.1) (0.7.0)
Requirement already satisfied: tqdm<5.0.0,>=4.38.0 in /usr/local/lib/python3.8/dist-packages (from spacy<3.5.0,>=3.4.0->en-core-web-sm==3.4.1) (4.64.1)
Requirement already satisfied: murmurhash<1.1.0,>=0.28.0 in /usr/local/lib/python3.8/dist-packages (from spacy<3.5.0,>=3.4.0->en-core-web-sm==3.4.1) (1.0.9)
Requirement already satisfied: smart-open<7.0.0,>=5.2.1 in /usr/local/lib/python3.8/dist-packages (from spacy<3.5.0,>=3.4.0->en-core-web-sm==3.4.1) (6.3.0)
Requirement already satisfied: packaging>=20.0 in /usr/local/lib/python3.8/dist-packages (from spacy<3.5.0,>=3.4.0->en-core-web-sm==3.4.1) (23.0)
Requirement already satisfied: langcodes<4.0.0,>=3.2.0 in /usr/local/lib/python3.8/dist-packages (from spacy<3.5.0,>=3.4.0->en-core-web-sm==3.4.1) (3.3.0)
Requirement already satisfied: thinc<8.2.0,>=8.1.0 in /usr/local/lib/python3.8/dist-packages (from spacy<3.5.0,>=3.4.0->en-core-web-sm==3.4.1) (8.1.7)
Requirement already satisfied: cymem<2.1.0,>=2.0.2 in /usr/local/lib/python3.8/dist-packages (from spacy<3.5.0,>=3.4.0->en-core-web-sm==3.4.1) (2.0.7)
Requirement already satisfied: typing-extensions>=4.2.0 in /usr/local/lib/python3.8/dist-packages (from pydantic!=1.8,!=1.8.1,<1.11.0,>=1.7.4->spacy<3.5.0,>=3.4.0->en-core-web-sm==3.4.1) (4.4.0)
Requirement already satisfied: chardet<5,>=3.0.2 in /usr/local/lib/python3.8/dist-packages (from requests<3.0.0,>=2.13.0->spacy<3.5.0,>=3.4.0->en-core-web-sm==3.4.1) (4.0.0)
Requirement already satisfied: urllib3<1.27,>=1.21.1 in /usr/local/lib/python3.8/dist-packages (from requests<3.0.0,>=2.13.0->spacy<3.5.0,>=3.4.0->en-core-web-sm==3.4.1) (1.24.3)
Requirement already satisfied: certifi>=2017.4.17 in /usr/local/lib/python3.8/dist-packages (from requests<3.0.0,>=2.13.0->spacy<3.5.0,>=3.4.0->en-core-web-sm==3.4.1) (2022.12.7)
Requirement already satisfied: idna<3,>=2.5 in /usr/local/lib/python3.8/dist-packages (from requests<3.0.0,>=2.13.0->spacy<3.5.0,>=3.4.0->en-core-web-sm==3.4.1) (2.10)
Requirement already satisfied: confection<1.0.0,>=0.0.1 in /usr/local/lib/python3.8/dist-packages (from thinc<8.2.0,>=8.1.0->spacy<3.5.0,>=3.4.0->en-core-web-sm==3.4.1) (0.0.4)
Requirement already satisfied: blis<0.8.0,>=0.7.8 in /usr/local/lib/python3.8/dist-packages (from thinc<8.2.0,>=8.1.0->spacy<3.5.0,>=3.4.0->en-core-web-sm==3.4.1) (0.7.9)
Requirement already satisfied: click<9.0.0,>=7.1.1 in /usr/local/lib/python3.8/dist-packages (from typer<0.8.0,>=0.3.0->spacy<3.5.0,>=3.4.0->en-core-web-sm==3.4.1) (7.1.2)
Requirement already satisfied: MarkupSafe>=0.23 in /usr/local/lib/python3.8/dist-packages (from jinja2->spacy<3.5.0,>=3.4.0->en-core-web-sm==3.4.1) (2.0.1)
✔ Download and installation successful
You can now load the package via spacy.load('en_core_web_sm')
/usr/local/lib/python3.8/dist-packages/torch/cuda/__init__.py:497: UserWarning: Can't initialize NVML
  warnings.warn("Can't initialize NVML")
2023-02-06 17:17:29.485403: E tensorflow/stream_executor/cuda/cuda_driver.cc:271] failed call to cuInit: CUDA_ERROR_NO_DEVICE: no CUDA-capable device is detected
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting zh-core-web-sm==3.4.0
  Downloading https://github.com/explosion/spacy-models/releases/download/zh_core_web_sm-3.4.0/zh_core_web_sm-3.4.0-py3-none-any.whl (48.4 MB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 48.4/48.4 MB 16.3 MB/s eta 0:00:00
Collecting spacy-pkuseg<0.1.0,>=0.0.27
  Downloading spacy_pkuseg-0.0.32-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (2.4 MB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 2.4/2.4 MB 21.7 MB/s eta 0:00:00
Requirement already satisfied: spacy<3.5.0,>=3.4.0 in /usr/local/lib/python3.8/dist-packages (from zh-core-web-sm==3.4.0) (3.4.4)
Requirement already satisfied: setuptools in /usr/local/lib/python3.8/dist-packages (from spacy<3.5.0,>=3.4.0->zh-core-web-sm==3.4.0) (57.4.0)
Requirement already satisfied: srsly<3.0.0,>=2.4.3 in /usr/local/lib/python3.8/dist-packages (from spacy<3.5.0,>=3.4.0->zh-core-web-sm==3.4.0) (2.4.5)
Requirement already satisfied: thinc<8.2.0,>=8.1.0 in /usr/local/lib/python3.8/dist-packages (from spacy<3.5.0,>=3.4.0->zh-core-web-sm==3.4.0) (8.1.7)
Requirement already satisfied: numpy>=1.15.0 in /usr/local/lib/python3.8/dist-packages (from spacy<3.5.0,>=3.4.0->zh-core-web-sm==3.4.0) (1.21.6)
Requirement already satisfied: tqdm<5.0.0,>=4.38.0 in /usr/local/lib/python3.8/dist-packages (from spacy<3.5.0,>=3.4.0->zh-core-web-sm==3.4.0) (4.64.1)
Requirement already satisfied: smart-open<7.0.0,>=5.2.1 in /usr/local/lib/python3.8/dist-packages (from spacy<3.5.0,>=3.4.0->zh-core-web-sm==3.4.0) (6.3.0)
Requirement already satisfied: preshed<3.1.0,>=3.0.2 in /usr/local/lib/python3.8/dist-packages (from spacy<3.5.0,>=3.4.0->zh-core-web-sm==3.4.0) (3.0.8)
Requirement already satisfied: typer<0.8.0,>=0.3.0 in /usr/local/lib/python3.8/dist-packages (from spacy<3.5.0,>=3.4.0->zh-core-web-sm==3.4.0) (0.7.0)
Requirement already satisfied: jinja2 in /usr/local/lib/python3.8/dist-packages (from spacy<3.5.0,>=3.4.0->zh-core-web-sm==3.4.0) (2.11.3)
Requirement already satisfied: requests<3.0.0,>=2.13.0 in /usr/local/lib/python3.8/dist-packages (from spacy<3.5.0,>=3.4.0->zh-core-web-sm==3.4.0) (2.25.1)
Requirement already satisfied: murmurhash<1.1.0,>=0.28.0 in /usr/local/lib/python3.8/dist-packages (from spacy<3.5.0,>=3.4.0->zh-core-web-sm==3.4.0) (1.0.9)
Requirement already satisfied: pydantic!=1.8,!=1.8.1,<1.11.0,>=1.7.4 in /usr/local/lib/python3.8/dist-packages (from spacy<3.5.0,>=3.4.0->zh-core-web-sm==3.4.0) (1.10.4)
Requirement already satisfied: catalogue<2.1.0,>=2.0.6 in /usr/local/lib/python3.8/dist-packages (from spacy<3.5.0,>=3.4.0->zh-core-web-sm==3.4.0) (2.0.8)
Requirement already satisfied: spacy-loggers<2.0.0,>=1.0.0 in /usr/local/lib/python3.8/dist-packages (from spacy<3.5.0,>=3.4.0->zh-core-web-sm==3.4.0) (1.0.4)
Requirement already satisfied: langcodes<4.0.0,>=3.2.0 in /usr/local/lib/python3.8/dist-packages (from spacy<3.5.0,>=3.4.0->zh-core-web-sm==3.4.0) (3.3.0)
Requirement already satisfied: wasabi<1.1.0,>=0.9.1 in /usr/local/lib/python3.8/dist-packages (from spacy<3.5.0,>=3.4.0->zh-core-web-sm==3.4.0) (0.10.1)
Requirement already satisfied: spacy-legacy<3.1.0,>=3.0.10 in /usr/local/lib/python3.8/dist-packages (from spacy<3.5.0,>=3.4.0->zh-core-web-sm==3.4.0) (3.0.12)
Requirement already satisfied: pathy>=0.3.5 in /usr/local/lib/python3.8/dist-packages (from spacy<3.5.0,>=3.4.0->zh-core-web-sm==3.4.0) (0.10.1)
Requirement already satisfied: packaging>=20.0 in /usr/local/lib/python3.8/dist-packages (from spacy<3.5.0,>=3.4.0->zh-core-web-sm==3.4.0) (23.0)
Requirement already satisfied: cymem<2.1.0,>=2.0.2 in /usr/local/lib/python3.8/dist-packages (from spacy<3.5.0,>=3.4.0->zh-core-web-sm==3.4.0) (2.0.7)
Requirement already satisfied: typing-extensions>=4.2.0 in /usr/local/lib/python3.8/dist-packages (from pydantic!=1.8,!=1.8.1,<1.11.0,>=1.7.4->spacy<3.5.0,>=3.4.0->zh-core-web-sm==3.4.0) (4.4.0)
Requirement already satisfied: certifi>=2017.4.17 in /usr/local/lib/python3.8/dist-packages (from requests<3.0.0,>=2.13.0->spacy<3.5.0,>=3.4.0->zh-core-web-sm==3.4.0) (2022.12.7)
Requirement already satisfied: urllib3<1.27,>=1.21.1 in /usr/local/lib/python3.8/dist-packages (from requests<3.0.0,>=2.13.0->spacy<3.5.0,>=3.4.0->zh-core-web-sm==3.4.0) (1.24.3)
Requirement already satisfied: chardet<5,>=3.0.2 in /usr/local/lib/python3.8/dist-packages (from requests<3.0.0,>=2.13.0->spacy<3.5.0,>=3.4.0->zh-core-web-sm==3.4.0) (4.0.0)
Requirement already satisfied: idna<3,>=2.5 in /usr/local/lib/python3.8/dist-packages (from requests<3.0.0,>=2.13.0->spacy<3.5.0,>=3.4.0->zh-core-web-sm==3.4.0) (2.10)
Requirement already satisfied: confection<1.0.0,>=0.0.1 in /usr/local/lib/python3.8/dist-packages (from thinc<8.2.0,>=8.1.0->spacy<3.5.0,>=3.4.0->zh-core-web-sm==3.4.0) (0.0.4)
Requirement already satisfied: blis<0.8.0,>=0.7.8 in /usr/local/lib/python3.8/dist-packages (from thinc<8.2.0,>=8.1.0->spacy<3.5.0,>=3.4.0->zh-core-web-sm==3.4.0) (0.7.9)
Requirement already satisfied: click<9.0.0,>=7.1.1 in /usr/local/lib/python3.8/dist-packages (from typer<0.8.0,>=0.3.0->spacy<3.5.0,>=3.4.0->zh-core-web-sm==3.4.0) (7.1.2)
Requirement already satisfied: MarkupSafe>=0.23 in /usr/local/lib/python3.8/dist-packages (from jinja2->spacy<3.5.0,>=3.4.0->zh-core-web-sm==3.4.0) (2.0.1)
Installing collected packages: spacy-pkuseg, zh-core-web-sm
Successfully installed spacy-pkuseg-0.0.32 zh-core-web-sm-3.4.0
✔ Download and installation successful
You can now load the package via spacy.load('zh_core_web_sm')
/usr/local/lib/python3.8/dist-packages/torch/cuda/__init__.py:497: UserWarning: Can't initialize NVML
  warnings.warn("Can't initialize NVML")
2023-02-06 17:17:43.205211: E tensorflow/stream_executor/cuda/cuda_driver.cc:271] failed call to cuInit: CUDA_ERROR_NO_DEVICE: no CUDA-capable device is detected
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting it-core-news-sm==3.4.0
  Downloading https://github.com/explosion/spacy-models/releases/download/it_core_news_sm-3.4.0/it_core_news_sm-3.4.0-py3-none-any.whl (13.0 MB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 13.0/13.0 MB 72.1 MB/s eta 0:00:00
Requirement already satisfied: spacy<3.5.0,>=3.4.0 in /usr/local/lib/python3.8/dist-packages (from it-core-news-sm==3.4.0) (3.4.4)
Requirement already satisfied: spacy-legacy<3.1.0,>=3.0.10 in /usr/local/lib/python3.8/dist-packages (from spacy<3.5.0,>=3.4.0->it-core-news-sm==3.4.0) (3.0.12)
Requirement already satisfied: requests<3.0.0,>=2.13.0 in /usr/local/lib/python3.8/dist-packages (from spacy<3.5.0,>=3.4.0->it-core-news-sm==3.4.0) (2.25.1)
Requirement already satisfied: preshed<3.1.0,>=3.0.2 in /usr/local/lib/python3.8/dist-packages (from spacy<3.5.0,>=3.4.0->it-core-news-sm==3.4.0) (3.0.8)
Requirement already satisfied: murmurhash<1.1.0,>=0.28.0 in /usr/local/lib/python3.8/dist-packages (from spacy<3.5.0,>=3.4.0->it-core-news-sm==3.4.0) (1.0.9)
Requirement already satisfied: pathy>=0.3.5 in /usr/local/lib/python3.8/dist-packages (from spacy<3.5.0,>=3.4.0->it-core-news-sm==3.4.0) (0.10.1)
Requirement already satisfied: langcodes<4.0.0,>=3.2.0 in /usr/local/lib/python3.8/dist-packages (from spacy<3.5.0,>=3.4.0->it-core-news-sm==3.4.0) (3.3.0)
Requirement already satisfied: typer<0.8.0,>=0.3.0 in /usr/local/lib/python3.8/dist-packages (from spacy<3.5.0,>=3.4.0->it-core-news-sm==3.4.0) (0.7.0)
Requirement already satisfied: spacy-loggers<2.0.0,>=1.0.0 in /usr/local/lib/python3.8/dist-packages (from spacy<3.5.0,>=3.4.0->it-core-news-sm==3.4.0) (1.0.4)
Requirement already satisfied: pydantic!=1.8,!=1.8.1,<1.11.0,>=1.7.4 in /usr/local/lib/python3.8/dist-packages (from spacy<3.5.0,>=3.4.0->it-core-news-sm==3.4.0) (1.10.4)
Requirement already satisfied: thinc<8.2.0,>=8.1.0 in /usr/local/lib/python3.8/dist-packages (from spacy<3.5.0,>=3.4.0->it-core-news-sm==3.4.0) (8.1.7)
Requirement already satisfied: catalogue<2.1.0,>=2.0.6 in /usr/local/lib/python3.8/dist-packages (from spacy<3.5.0,>=3.4.0->it-core-news-sm==3.4.0) (2.0.8)
Requirement already satisfied: srsly<3.0.0,>=2.4.3 in /usr/local/lib/python3.8/dist-packages (from spacy<3.5.0,>=3.4.0->it-core-news-sm==3.4.0) (2.4.5)
Requirement already satisfied: smart-open<7.0.0,>=5.2.1 in /usr/local/lib/python3.8/dist-packages (from spacy<3.5.0,>=3.4.0->it-core-news-sm==3.4.0) (6.3.0)
Requirement already satisfied: packaging>=20.0 in /usr/local/lib/python3.8/dist-packages (from spacy<3.5.0,>=3.4.0->it-core-news-sm==3.4.0) (23.0)
Requirement already satisfied: tqdm<5.0.0,>=4.38.0 in /usr/local/lib/python3.8/dist-packages (from spacy<3.5.0,>=3.4.0->it-core-news-sm==3.4.0) (4.64.1)
Requirement already satisfied: setuptools in /usr/local/lib/python3.8/dist-packages (from spacy<3.5.0,>=3.4.0->it-core-news-sm==3.4.0) (57.4.0)
Requirement already satisfied: jinja2 in /usr/local/lib/python3.8/dist-packages (from spacy<3.5.0,>=3.4.0->it-core-news-sm==3.4.0) (2.11.3)
Requirement already satisfied: wasabi<1.1.0,>=0.9.1 in /usr/local/lib/python3.8/dist-packages (from spacy<3.5.0,>=3.4.0->it-core-news-sm==3.4.0) (0.10.1)
Requirement already satisfied: cymem<2.1.0,>=2.0.2 in /usr/local/lib/python3.8/dist-packages (from spacy<3.5.0,>=3.4.0->it-core-news-sm==3.4.0) (2.0.7)
Requirement already satisfied: numpy>=1.15.0 in /usr/local/lib/python3.8/dist-packages (from spacy<3.5.0,>=3.4.0->it-core-news-sm==3.4.0) (1.21.6)
Requirement already satisfied: typing-extensions>=4.2.0 in /usr/local/lib/python3.8/dist-packages (from pydantic!=1.8,!=1.8.1,<1.11.0,>=1.7.4->spacy<3.5.0,>=3.4.0->it-core-news-sm==3.4.0) (4.4.0)
Requirement already satisfied: chardet<5,>=3.0.2 in /usr/local/lib/python3.8/dist-packages (from requests<3.0.0,>=2.13.0->spacy<3.5.0,>=3.4.0->it-core-news-sm==3.4.0) (4.0.0)
Requirement already satisfied: certifi>=2017.4.17 in /usr/local/lib/python3.8/dist-packages (from requests<3.0.0,>=2.13.0->spacy<3.5.0,>=3.4.0->it-core-news-sm==3.4.0) (2022.12.7)
Requirement already satisfied: idna<3,>=2.5 in /usr/local/lib/python3.8/dist-packages (from requests<3.0.0,>=2.13.0->spacy<3.5.0,>=3.4.0->it-core-news-sm==3.4.0) (2.10)
Requirement already satisfied: urllib3<1.27,>=1.21.1 in /usr/local/lib/python3.8/dist-packages (from requests<3.0.0,>=2.13.0->spacy<3.5.0,>=3.4.0->it-core-news-sm==3.4.0) (1.24.3)
Requirement already satisfied: confection<1.0.0,>=0.0.1 in /usr/local/lib/python3.8/dist-packages (from thinc<8.2.0,>=8.1.0->spacy<3.5.0,>=3.4.0->it-core-news-sm==3.4.0) (0.0.4)
Requirement already satisfied: blis<0.8.0,>=0.7.8 in /usr/local/lib/python3.8/dist-packages (from thinc<8.2.0,>=8.1.0->spacy<3.5.0,>=3.4.0->it-core-news-sm==3.4.0) (0.7.9)
Requirement already satisfied: click<9.0.0,>=7.1.1 in /usr/local/lib/python3.8/dist-packages (from typer<0.8.0,>=0.3.0->spacy<3.5.0,>=3.4.0->it-core-news-sm==3.4.0) (7.1.2)
Requirement already satisfied: MarkupSafe>=0.23 in /usr/local/lib/python3.8/dist-packages (from jinja2->spacy<3.5.0,>=3.4.0->it-core-news-sm==3.4.0) (2.0.1)
Installing collected packages: it-core-news-sm
Successfully installed it-core-news-sm-3.4.0
✔ Download and installation successful
You can now load the package via spacy.load('it_core_news_sm')

We then need to load the models.

spacy_en_nlp = spacy.load("en_core_web_sm") #disable=["tagger", "parser", "ner"]
spacy_zh_nlp = spacy.load("zh_core_web_sm") #disable=["tagger", "parser", "ner"]

Creating document objects

en_doc = spacy_en_nlp(en_str)
spacy_en_sent = [w.text for w in en_doc]
print(spacy_en_sent)
['Homelessness', 'is', 'to', 'be', 'found', 'in', 'the', 'large', 'numbers', 'of', 'mentally', 'ill', 'and', 'substance', '-', 'abusing', 'people', 'in', 'the', 'homeless', 'population', ',', 'do', "n't", 'you', 'think', '?']
zh_doc = spacy_en_nlp(zh_str)
spacy_zh_sent = [w.text for w in zh_doc]
print(spacy_zh_sent)
['林行止表示貿易戰促使在中國的工廠搬遷到越南、寮國、印度、台灣甚至是馬來西亞']

Footnotes

  1. The regular expression I use here has some symbols we didn’t cover in the Regular Expressions submodule: ?= and ?<=. These are called the lookahead and lookbehind operators, which have negative variants ?! and ?<!, respectively. Basically, what these do is to say that, we must match the pattern ... within (?=...) or (?<=...), but that in the course of matching that pattern, we shouldn’t forego the possibility that we can match that pattern by some expression in the future (lookahead) or have matched that pattern in the past (lookbehind). Another way to conceive of what these things do is in terms of a pointer into a string: the process of matching can be thought of in terms to moving a pointer through the string to say what we have currently match (everything behind the pointer) and what we still have left to look at. The lookahead operator says “match my pattern against whatever is right after the pointer but don’t move the pointer to do the match”, while the lookbehind operator says “match my pattern against whatever is right before the pointer but don’t move the pointer backward to do the match”. The negative versions say something similar, except they say: “you should be able to match my pattern and to check that, don’t move the pointer”.↩︎

  2. We also need to specify whether we want to use a GPU, since the models in stanza are deep learning models and can benefit from them. We won’t use them hre by setting use_gpu=False.↩︎