One of the most common tasks in Natural Language Processing (NLP) is to clean text data. In order to maximize your results, it’s important to distill your text to the most important root words in the corpus and clean out unwanted noise. This post will show how I typically accomplish this. The following are general steps in text preprocessing:
The following is a script that I’ve been using to clean a majority of my text data.
import pandas as pd
import re
import string
from bs4 import BeautifulSoup
import nltk
from nltk.stem import PorterStemmer
from nltk.stem.wordnet import WordNetLemmatizer
import spacy
Removing HTML is optional and depending on what your data source is. I’ve found beautiful soup is the best way to clean this versus RegEx.
def clean_html(html):
# parse html content
soup = BeautifulSoup(html, "html.parser")
for data in soup(['style', 'script', 'code', 'a']):
# Remove tags
data.decompose()
# return data by retrieving the tag content
return ' '.join(soup.stripped_strings)
Note: in the for
loop you can specify different HTML Tags you wish to clean. For example the step above includes style
, script
, code
, and a
tags. Play around and augment this list until you get your desired results.
Now the workhorse.
A != a
.NLTK
library. There is a list in the next line to add additional stop words to the function as needed. These might be noisy domain words or anything else that clarifies the context.Lem
. The default is to use none.# Load spacy
nlp = spacy.load('en_core_web_sm')
def clean_string(text, stem="None"):
final_string = ""
# Make lower
text = text.lower()
# Remove line breaks
# Note: that this line can be augmented and used over
# to replace any characters with nothing or a space
text = re.sub(r'\n', '', text)
# Remove punctuation
translator = str.maketrans('', '', string.punctuation)
text = text.translate(translator)
# Remove stop words
text = text.split()
useless_words = nltk.corpus.stopwords.words("english")
useless_words = useless_words + ['hi', 'im']
text_filtered = [word for word in text if not word in useless_words]
# Remove numbers
text_filtered = [re.sub(r'\w*\d\w*', '', w) for w in text_filtered]
# Stem or Lemmatize
if stem == 'Stem':
stemmer = PorterStemmer()
text_stemmed = [stemmer.stem(y) for y in text_filtered]
elif stem == 'Lem':
lem = WordNetLemmatizer()
text_stemmed = [lem.lemmatize(y) for y in text_filtered]
elif stem == 'Spacy':
text_filtered = nlp(' '.join(text_filtered))
text_stemmed = [y.lemma_ for y in text_filtered]
else:
text_stemmed = text_filtered
final_string = ' '.join(text_stemmed)
return final_string
To apply this to a standard data frame, use apply
function from Pandas like below. Let's take a look at the starting text:
<p>
<a
href="https://forge.autodesk.com/#step-6-download-the-item"
rel="nofollow noreferrer"
>https://forge.autodesk.com/en//#step-6-download-the-item</a
>
</p>
\n\n
<p>
I have followed the tutorial and have successfully obtained the contents of
the file, but where is the file being downloaded. In addition, how do I
specify the location of where I want to download the file?
</p>
\n\n
<p>
Result on Postman\n<a
href="https://i.stack.imgur.com/VrdqP.png"
rel="nofollow noreferrer"
><img
src="https://i.stack.imgur.com/VrdqP.png"
alt="enter image description here"
/></a>
</p>
Let’s start by cleaning the HTML.
# To remove HTML first and apply it directly to the source text column.
df['body'] = df['body'].apply(lambda x: clean_html(x))
After applying the function to clean HTML, this is the result — Pretty impressive:
I have followed the tutorial and have successfully obtained the contents
of the file, but where is the file being downloaded. In addition, how
do I specify the location of where I want to download the file? Result
on Postman
Next, let’s apply the clean_string
function.
# Next apply the clean_string function to the text
df['body_clean'] = df['body'].apply(lambda x: clean_string(x, stem='Stem'))
And the final resulting text:
follow tutori success obtain content file file download addit
specifi locat want download file result postman
Fully clean and ready to use in your NLP project. You mostly notice the length is greatly reduced from stop word removal, and the words are stemmed to their root forms.
Note: I often create a new column like above, body_clean
, so I preserve the original in case punctuation is needed.
And that’s about it. The order in the above function does matter. You should complete certain steps before others, such as making lowercase first. The function contains one RegEx example for removing numbers; a solid utility function that you can adjust to remove other items from the text using RegEx.
The above function contains two different ways to Lemmatize your text. The NLTK WordNetLemmatizer
requires a Part of Speech (POS) argument (noun
, verb
) and either requires multiple passes to get each word or will only capture one POS. The alternative is to use Spacy
, which will automatically lemmatize each word and determine which POS it belongs to. The issue is that Spacy's performance will be significantly slower than NLTK's.