Placeholders (PII masking) and tokenization

Kendra Chalkley

unread,

Aug 22, 2020, 8:25:59 PM8/22/20

to nltk-users

Hi! Puzzle for you:

One of my preprocessing steps is to replace PII with <placeholders>.
"My name is Kendra and I live in Portland, OR" --> "My name is <person> and I live in <location>"
The next step is tokenization, which as my coworker who recently left the company coded it, uses NLTKWordTokenizer.tokenize().
--> [my, name, is, <, person, >, and, I, live, in, <, location, >]
That tokenizer, of course, separates my very useful '<>' delimiters from the entity names they surround and will affect later steps which use word embeddings.

Are their special characters I can (tell the NLTK tokenizer to) use to delimit my placeholders which won't be removed and will keep these tokens OOV to my word embeddings?

Thanks for your help!

-Kendra

P.S. Hi all! I just found this group this morning. If you know of any large slack channels or other forums where these sorts of NLP oriented questions come up, I would also be interested in joining those.

sujitpal

unread,

Aug 23, 2020, 10:45:14 AM8/23/20

to nltk-users

I think if you replace the "<" and ">" with "_" it might work for you. I tried this snippet below and it looks like it does what you need.

>>> import nltk

>>> nltk.word_tokenize("My name is _person_ and I live in _location_, _location_ .")

['My', 'name', 'is', '_person_', 'and', 'I', 'live', 'in', '_location_', ',', '_location_', '.']

-sujit

Selenia Anastasi

unread,

Aug 28, 2020, 12:33:53 PM8/28/20

to nltk-...@googlegroups.com

I think you should use regular expressions to build a pattern which recognize '<>' at the end and at the beginning of each word. Or, as an alternative, you can try the NLP library called Spacy and see if there is any function already built to do this.

Anyway, regular expressions are the best way to tokenize your text, but sometimes they could be little tricky. You can find a lot of solution online already setting for specific purposes.

--
You received this message because you are subscribed to the Google Groups "nltk-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to nltk-users+...@googlegroups.com.
To view this discussion on the web, visit https://groups.google.com/d/msgid/nltk-users/68345993-af4a-4846-8cc9-7993b8f7f08bn%40googlegroups.com.

Alexis

unread,

Oct 5, 2020, 4:17:29 PM10/5/20

to nltk-users

You can use a regex to define what counts as a "word", as Selenia suggested. But you don't have to roll your own tokenizer: Try the following, and adjust to taste:

>>> masked_tokenizer = nltk.tokenize.RegexpTokenizer(r"[\w<>]+|[^\w\s]+")
>>> sample = "My name is <person>, and I live in <location>. (Etc.)"
>>> masked_tokenizer.tokenize(sample)
['My', 'name', 'is', '<person>', ',', 'and', 'I', 'live', 'in', '<location>', '.', '(', 'Etc', '.)']