Now, we need to tokenize the sentences into words aka terms. While we can do it in a loop, we can take advantage of the split function in the text toolkit for Pandas’ Series; see this manual for all the functions. To get it we just invoke the strip function, which is a part of str, i.e. StringsMethods object. The argument is regular.
- Tokenize Pandas Column Definition
- Tokenize Pandas Columns
- Tokenize Pandas Column Names
- Tokenize Pandas Column Tutorial
- Pandas Pandas: An on-the-go “cheat sheet” PRO TIP: do a ctrl f first python - How to select rows from a DataFrame based on column values - Stack Overflow.
- How to apply postagsents to pandas dataframe efficiently. How to use wordtokenize in data frame. How to apply postagsents to pandas dataframe efficiently. Tokenizing words into a new column in a pandas dataframe. Run nltk senttokenize through Pandas dataframe. Python text processing: NLTK and pandas.
- Edit: You could be thinking the Dataframe df after series.apply(nltk.wordtokenize) is larger in size, which might affect the runtime for the next operation dataframe.apply(nltk.wordtokenize). Pandas optimizes under the hood for such a scenario. I got a similar runtime of 200s by only performing dataframe.apply(nltk.wordtokenize) separately.
Posted By Jakub Nowacki, 18 September 2017
Pandas is a great tool for the analysis of tabulardata via its DataFrame interface. Slightly less known are its capabilities forworking with text data. In this post I’ll present them on some simple examples.As a comparison I’ll use my previous post about TF-IDF in Spark.
First we load data, i.e. books previously downloaded from Project Gutenbergsite. Here we use a handy glob
module to walk over text files ina directory and read in files line by line into DataFrame.
book | lines | |
---|---|---|
0 | dracula | The Project Gutenberg EBook of Dracula, by Br... |
1 | dracula | n |
2 | dracula | This eBook is for the use of anyone anywhere a... |
3 | dracula | almost no restrictions whatsoever. You may co... |
4 | dracula | re-use it under the terms of the Project Guten... |
Since we’re now in Pandas, we can easily use its plottingcapability tolook at the number of lines in the books. Note that value_counts
of bookscreate Series
indexed by the book names, thus, we don’t need to set index forplotting. Note that we can now choose plot styling; see demopagefor available styles.
Now, we need to tokenize the sentences into words aka terms. While we can do itin a loop, we can take advantage of the split function in the text toolkit forPandas’ Series; see this manual for all the functions. To get it we just invoke thestrip
function, which is a part of str
, i.e. StringsMethods
object. Theargument is regular expression pattern, on which the string, in our case thesentence, will be split. In our case it is the regular expression set ofeverything that is not a letter (capital or not) or digit and underscore. Thisis because W
is the reverse of w
which is a set [A-Za-z0-9_]
, i.e.contains an underscore. In this case we want also to split on underscore so wehad to add it to the splitting set. Note that this decision and other you’llmake while deciding on the splitting set, will influence the tokenization, andthus the result. Hence, select the splitting pattern carefully. If you’re notsure about your decision, use practicing pages, likeregex101.com to check your patterns.
book | lines | words | |
---|---|---|---|
0 | dracula | The Project Gutenberg EBook of Dracula, by Br... | [, The, Project, Gutenberg, EBook, of, Dracula... |
1 | dracula | n | [] |
2 | dracula | This eBook is for the use of anyone anywhere a... | [This, eBook, is, for, the, use, of, anyone, a... |
3 | dracula | almost no restrictions whatsoever. You may co... | [almost, no, restrictions, whatsoever, You, ma... |
4 | dracula | re-use it under the terms of the Project Guten... | [re, use, it, under, the, terms, of, the, Proj... |
As a result we get a new column with array of words. Now we need to flatten theresulted DataFrame somehow. The best approach is to use iterators and create newDataFrame with words; see this StackOverflowpost for the performance details of this solution. Note that thismay take a few seconds depending on the machine; on mine it takes about 25s.
book | word | |
---|---|---|
0 | dracula | |
1 | dracula | The |
2 | dracula | Project |
3 | dracula | Gutenberg |
4 | dracula | EBook |
Now we have DataFrame with words, but occasionally we get an empty string as abyproduct of the splitting. We should remove it as it would be counted as aterm. To do that we can use function len
as follows:
book | word | |
---|---|---|
1 | dracula | The |
2 | dracula | Project |
3 | dracula | Gutenberg |
4 | dracula | EBook |
5 | dracula | of |
The newwords
DataFrame is now free of the empty strings.
If we want to calculate TF-IDF statistic we need to normalize the words bybringing them all to the same case. Again, we can use a Series
string functionfor that.
book | word | |
---|---|---|
1 | dracula | the |
2 | dracula | project |
3 | dracula | gutenberg |
4 | dracula | ebook |
5 | dracula | of |
Tokenize Pandas Column Definition
First, lets calculate counts of the terms per book. We can do it as follows:
n_w | ||
---|---|---|
book | word | |
dracula | the | 8093 |
and | 5976 | |
i | 4847 | |
to | 4745 | |
of | 3748 |
Note that as a result we are getting a hierarchical index aka MultiIndex; seePandas manual formore details.
With this index we can now plot the results in much nicer way. E.e. we can get 5words largest counts per book and plot it as shown below. Note that we need toreset, i.e. remove one level of index, as it doubles when we are usingnlargest
function. I’ve made the process into a function, as I will reuse itfurther.
n_w | ||
---|---|---|
book | word | |
dracula | the | 8093 |
and | 5976 | |
i | 4847 | |
to | 4745 | |
of | 3748 | |
frankenstein | the | 4371 |
and | 3046 | |
i | 2850 | |
of | 2760 | |
to | 2174 | |
grimms_fairy_tales | the | 7224 |
and | 5551 | |
to | 2749 | |
he | 2096 | |
a | 1978 | |
moby_dick | the | 14718 |
of | 6743 | |
and | 6518 | |
a | 4807 | |
to | 4707 | |
tom_sawyer | the | 3973 |
and | 3193 | |
a | 1955 | |
to | 1807 | |
of | 1585 | |
war_and_peace | the | 34725 |
and | 22307 | |
to | 16755 | |
of | 15008 | |
a | 10584 |
To finish the TF as shown in the previous post, we need the counts of the books to get the measure. As a reminder, below is the equation for TF:
Note that I’ve renamed the column as it inherits the name by default.
n_d | |
---|---|
book | |
dracula | 166916 |
frankenstein | 78475 |
grimms_fairy_tales | 105336 |
moby_dick | 222630 |
tom_sawyer | 77612 |
war_and_peace | 576627 |
Now we need to join both columns on book to get the sum of word per book intofinal DataFrame. Having both n_w
and n_d
, we can easily calculate TF asfollows:
n_w | n_d | tf | ||
---|---|---|---|---|
book | word | |||
dracula | the | 8093 | 166916 | 0.048485 |
and | 5976 | 166916 | 0.035802 | |
i | 4847 | 166916 | 0.029039 | |
to | 4745 | 166916 | 0.028427 | |
of | 3748 | 166916 | 0.022454 |
Again, lets look at the top 5 words with respect to TF.
tf | ||
---|---|---|
book | word | |
dracula | the | 0.048485 |
and | 0.035802 | |
i | 0.029039 | |
to | 0.028427 | |
of | 0.022454 | |
frankenstein | the | 0.055699 |
and | 0.038815 | |
i | 0.036317 | |
of | 0.035170 | |
to | 0.027703 | |
grimms_fairy_tales | the | 0.068581 |
and | 0.052698 | |
to | 0.026097 | |
he | 0.019898 | |
a | 0.018778 | |
moby_dick | the | 0.066110 |
of | 0.030288 | |
and | 0.029277 | |
a | 0.021592 | |
to | 0.021143 | |
tom_sawyer | the | 0.051191 |
and | 0.041141 | |
a | 0.025189 | |
to | 0.023282 | |
of | 0.020422 | |
war_and_peace | the | 0.060221 |
and | 0.038685 | |
to | 0.029057 | |
of | 0.026027 | |
a | 0.018355 |
As before (and as expected), we’ve got the English stop-words.
Now we can do IDF; the remainder of the formula is given below:
First we need to get the document count. The simples solution is to use theSeries’ method nunique
, i.e. get size of set of unique elements in a series.
Similarly, to get the number of unique books that every term appeared in, we canuse the same method but on a grouped data. Again we need to rename the column.Note that sorting values is only for the presentation and it is not needed forthe further computations.
i_d | |
---|---|
word | |
laplandish | 1 |
moluccas | 1 |
molten | 1 |
molliten | 1 |
molière | 1 |
Having all the components of the IDF formula, we can now calculate it as a newcolumn.
i_d | idf | |
---|---|---|
word | ||
laplandish | 1 | 1.791759 |
moluccas | 1 | 1.791759 |
molten | 1 | 1.791759 |
molliten | 1 | 1.791759 |
molière | 1 | 1.791759 |
To get the final DataFrame we need to join both DataFrames as follows:
Tokenize Pandas Columns
n_w | n_d | tf | i_d | idf | ||
---|---|---|---|---|---|---|
book | word | |||||
dracula | the | 8093 | 166916 | 0.048485 | 6 | 0.0 |
and | 5976 | 166916 | 0.035802 | 6 | 0.0 | |
i | 4847 | 166916 | 0.029039 | 6 | 0.0 | |
to | 4745 | 166916 | 0.028427 | 6 | 0.0 | |
of | 3748 | 166916 | 0.022454 | 6 | 0.0 |
Having now DataFrame with both TF and IDF values, we can calculate TF-IDFstatistic.
n_w | n_d | tf | i_d | idf | tf_idf | ||
---|---|---|---|---|---|---|---|
book | word | ||||||
dracula | the | 8093 | 166916 | 0.048485 | 6 | 0.0 | 0.0 |
and | 5976 | 166916 | 0.035802 | 6 | 0.0 | 0.0 | |
i | 4847 | 166916 | 0.029039 | 6 | 0.0 | 0.0 | |
to | 4745 | 166916 | 0.028427 | 6 | 0.0 | 0.0 | |
of | 3748 | 166916 | 0.022454 | 6 | 0.0 | 0.0 |
Again, lets see the top TF-IDF terms:
Tokenize Pandas Column Names
tf_idf | ||
---|---|---|
book | word | |
dracula | helsing | 0.003467 |
lucy | 0.003231 | |
mina | 0.002619 | |
van | 0.002126 | |
harker | 0.001879 | |
frankenstein | clerval | 0.001347 |
justine | 0.001256 | |
felix | 0.001142 | |
elizabeth | 0.000813 | |
frankenstein | 0.000731 | |
grimms_fairy_tales | gretel | 0.001667 |
hans | 0.001304 | |
tailor | 0.001174 | |
dwarf | 0.001055 | |
hansel | 0.000799 | |
moby_dick | ahab | 0.004161 |
whale | 0.003873 | |
whales | 0.002181 | |
stubb | 0.002101 | |
queequeg | 0.002036 | |
tom_sawyer | huck | 0.005956 |
tom | 0.004305 | |
becky | 0.002655 | |
joe | 0.002406 | |
sid | 0.001847 | |
war_and_peace | pierre | 0.006100 |
natásha | 0.003769 | |
rostóv | 0.002411 | |
prince | 0.002318 | |
moscow | 0.002243 |
As before we’ve got a set of important words for the given document.
Note that this more of a demo of Pandas text processing. If you’re consideringusing TF-IDF in a more production example, see some existing solutions likescikit-learn’s TfidfVectorizer.
Note that I’ve just scratched a surface with the Pandas’ text processingcapabilietes. There are many other useful functions like the match
functionshown below:
n | |
---|---|
word | |
s | 8341 |
she | 6317 |
so | 5380 |
said | 5337 |
some | 2317 |
see | 1760 |
such | 1456 |
still | 1368 |
should | 1264 |
seemed | 1233 |
Tokenize Pandas Column Tutorial
Again, I encourage you to check Pandas manual for more details.
Notebook with the above computations is available for downloadhere.