Tokenize Pandas Column



Now, we need to tokenize the sentences into words aka terms. While we can do it in a loop, we can take advantage of the split function in the text toolkit for Pandas’ Series; see this manual for all the functions. To get it we just invoke the strip function, which is a part of str, i.e. StringsMethods object. The argument is regular.

  • Pandas Pandas: An on-the-go “cheat sheet” PRO TIP: do a ctrl f first python - How to select rows from a DataFrame based on column values - Stack Overflow.
  • How to apply postagsents to pandas dataframe efficiently. How to use wordtokenize in data frame. How to apply postagsents to pandas dataframe efficiently. Tokenizing words into a new column in a pandas dataframe. Run nltk senttokenize through Pandas dataframe. Python text processing: NLTK and pandas.
  • Edit: You could be thinking the Dataframe df after series.apply(nltk.wordtokenize) is larger in size, which might affect the runtime for the next operation dataframe.apply(nltk.wordtokenize). Pandas optimizes under the hood for such a scenario. I got a similar runtime of 200s by only performing dataframe.apply(nltk.wordtokenize) separately.
Posted By Jakub Nowacki, 18 September 2017

Pandas is a great tool for the analysis of tabulardata via its DataFrame interface. Slightly less known are its capabilities forworking with text data. In this post I’ll present them on some simple examples.As a comparison I’ll use my previous post about TF-IDF in Spark.

First we load data, i.e. books previously downloaded from Project Gutenbergsite. Here we use a handy globmodule to walk over text files ina directory and read in files line by line into DataFrame.

booklines
0draculaThe Project Gutenberg EBook of Dracula, by Br...
1draculan
2draculaThis eBook is for the use of anyone anywhere a...
3draculaalmost no restrictions whatsoever. You may co...
4draculare-use it under the terms of the Project Guten...

Since we’re now in Pandas, we can easily use its plottingcapability tolook at the number of lines in the books. Note that value_counts of bookscreate Series indexed by the book names, thus, we don’t need to set index forplotting. Note that we can now choose plot styling; see demopagefor available styles.

Now, we need to tokenize the sentences into words aka terms. While we can do itin a loop, we can take advantage of the split function in the text toolkit forPandas’ Series; see this manual for all the functions. To get it we just invoke thestrip function, which is a part of str, i.e. StringsMethods object. Theargument is regular expression pattern, on which the string, in our case thesentence, will be split. In our case it is the regular expression set ofeverything that is not a letter (capital or not) or digit and underscore. Thisis because W is the reverse of w which is a set [A-Za-z0-9_], i.e.contains an underscore. In this case we want also to split on underscore so wehad to add it to the splitting set. Note that this decision and other you’llmake while deciding on the splitting set, will influence the tokenization, andthus the result. Hence, select the splitting pattern carefully. If you’re notsure about your decision, use practicing pages, likeregex101.com to check your patterns.

booklineswords
0draculaThe Project Gutenberg EBook of Dracula, by Br...[, The, Project, Gutenberg, EBook, of, Dracula...
1draculan[]
2draculaThis eBook is for the use of anyone anywhere a...[This, eBook, is, for, the, use, of, anyone, a...
3draculaalmost no restrictions whatsoever. You may co...[almost, no, restrictions, whatsoever, You, ma...
4draculare-use it under the terms of the Project Guten...[re, use, it, under, the, terms, of, the, Proj...

As a result we get a new column with array of words. Now we need to flatten theresulted DataFrame somehow. The best approach is to use iterators and create newDataFrame with words; see this StackOverflowpost for the performance details of this solution. Note that thismay take a few seconds depending on the machine; on mine it takes about 25s.

bookword
0dracula
1draculaThe
2draculaProject
3draculaGutenberg
4draculaEBook

Now we have DataFrame with words, but occasionally we get an empty string as abyproduct of the splitting. We should remove it as it would be counted as aterm. To do that we can use function len as follows:

bookword
1draculaThe
2draculaProject
3draculaGutenberg
4draculaEBook
5draculaof

The newwords DataFrame is now free of the empty strings.

If we want to calculate TF-IDF statistic we need to normalize the words bybringing them all to the same case. Again, we can use a Series string functionfor that.

bookword
1draculathe
2draculaproject
3draculagutenberg
4draculaebook
5draculaof

Tokenize Pandas Column Definition

First, lets calculate counts of the terms per book. We can do it as follows:

n_w
bookword
draculathe8093
and5976
i4847
to4745
of3748

Note that as a result we are getting a hierarchical index aka MultiIndex; seePandas manual formore details.

With this index we can now plot the results in much nicer way. E.e. we can get 5words largest counts per book and plot it as shown below. Note that we need toreset, i.e. remove one level of index, as it doubles when we are usingnlargest function. I’ve made the process into a function, as I will reuse itfurther.

n_w
bookword
draculathe8093
and5976
i4847
to4745
of3748
frankensteinthe4371
and3046
i2850
of2760
to2174
grimms_fairy_talesthe7224
and5551
to2749
he2096
a1978
moby_dickthe14718
of6743
and6518
a4807
to4707
tom_sawyerthe3973
and3193
a1955
to1807
of1585
war_and_peacethe34725
and22307
to16755
of15008
a10584

To finish the TF as shown in the previous post, we need the counts of the books to get the measure. As a reminder, below is the equation for TF:

Note that I’ve renamed the column as it inherits the name by default.

n_d
book
dracula166916
frankenstein78475
grimms_fairy_tales105336
moby_dick222630
tom_sawyer77612
war_and_peace576627

Now we need to join both columns on book to get the sum of word per book intofinal DataFrame. Having both n_w and n_d, we can easily calculate TF asfollows:

n_wn_dtf
bookword
draculathe80931669160.048485
and59761669160.035802
i48471669160.029039
to47451669160.028427
of37481669160.022454

Again, lets look at the top 5 words with respect to TF.

tf
bookword
draculathe0.048485
and0.035802
i0.029039
to0.028427
of0.022454
frankensteinthe0.055699
and0.038815
i0.036317
of0.035170
to0.027703
grimms_fairy_talesthe0.068581
and0.052698
to0.026097
he0.019898
a0.018778
moby_dickthe0.066110
of0.030288
and0.029277
a0.021592
to0.021143
tom_sawyerthe0.051191
and0.041141
a0.025189
to0.023282
of0.020422
war_and_peacethe0.060221
and0.038685
to0.029057
of0.026027
a0.018355

As before (and as expected), we’ve got the English stop-words.

Tokenize Pandas Column

Now we can do IDF; the remainder of the formula is given below:

First we need to get the document count. The simples solution is to use theSeries’ method nunique, i.e. get size of set of unique elements in a series.

Similarly, to get the number of unique books that every term appeared in, we canuse the same method but on a grouped data. Again we need to rename the column.Note that sorting values is only for the presentation and it is not needed forthe further computations.

i_d
word
laplandish1
moluccas1
molten1
molliten1
molière1

Having all the components of the IDF formula, we can now calculate it as a newcolumn.

Tokenize Pandas Column
i_didf
word
laplandish11.791759
moluccas11.791759
molten11.791759
molliten11.791759
molière11.791759

To get the final DataFrame we need to join both DataFrames as follows:

Tokenize Pandas Columns

n_wn_dtfi_didf
bookword
draculathe80931669160.04848560.0
and59761669160.03580260.0
i48471669160.02903960.0
to47451669160.02842760.0
of37481669160.02245460.0

Having now DataFrame with both TF and IDF values, we can calculate TF-IDFstatistic.

n_wn_dtfi_didftf_idf
bookword
draculathe80931669160.04848560.00.0
and59761669160.03580260.00.0
i48471669160.02903960.00.0
to47451669160.02842760.00.0
of37481669160.02245460.00.0

Again, lets see the top TF-IDF terms:

Tokenize Pandas Column Names

Tokenize Pandas Column
tf_idf
bookword
draculahelsing0.003467
lucy0.003231
mina0.002619
van0.002126
harker0.001879
frankensteinclerval0.001347
justine0.001256
felix0.001142
elizabeth0.000813
frankenstein0.000731
grimms_fairy_talesgretel0.001667
hans0.001304
tailor0.001174
dwarf0.001055
hansel0.000799
moby_dickahab0.004161
whale0.003873
whales0.002181
stubb0.002101
queequeg0.002036
tom_sawyerhuck0.005956
tom0.004305
becky0.002655
joe0.002406
sid0.001847
war_and_peacepierre0.006100
natásha0.003769
rostóv0.002411
prince0.002318
moscow0.002243

As before we’ve got a set of important words for the given document.

Note that this more of a demo of Pandas text processing. If you’re consideringusing TF-IDF in a more production example, see some existing solutions likescikit-learn’s TfidfVectorizer.

Note that I’ve just scratched a surface with the Pandas’ text processingcapabilietes. There are many other useful functions like the match functionshown below:

n
word
s8341
she6317
so5380
said5337
some2317
see1760
such1456
still1368
should1264
seemed1233

Tokenize Pandas Column Tutorial

Again, I encourage you to check Pandas manual for more details.

Notebook with the above computations is available for downloadhere.