Tokenize Pandas Column

Tokenize Pandas Column Definition
Tokenize Pandas Columns
Tokenize Pandas Column Names
Tokenize Pandas Column Tutorial

Pandas Pandas: An on-the-go “cheat sheet” PRO TIP: do a ctrl f first python - How to select rows from a DataFrame based on column values - Stack Overflow.
How to apply postagsents to pandas dataframe efficiently. How to use wordtokenize in data frame. How to apply postagsents to pandas dataframe efficiently. Tokenizing words into a new column in a pandas dataframe. Run nltk senttokenize through Pandas dataframe. Python text processing: NLTK and pandas.
Edit: You could be thinking the Dataframe df after series.apply(nltk.wordtokenize) is larger in size, which might affect the runtime for the next operation dataframe.apply(nltk.wordtokenize). Pandas optimizes under the hood for such a scenario. I got a similar runtime of 200s by only performing dataframe.apply(nltk.wordtokenize) separately.

Posted By Jakub Nowacki, 18 September 2017

Pandas is a great tool for the analysis of tabulardata via its DataFrame interface. Slightly less known are its capabilities forworking with text data. In this post I’ll present them on some simple examples.As a comparison I’ll use my previous post about TF-IDF in Spark.

First we load data, i.e. books previously downloaded from Project Gutenbergsite. Here we use a handy globmodule to walk over text files ina directory and read in files line by line into DataFrame.

book	lines
0	dracula	The Project Gutenberg EBook of Dracula, by Br...
1	dracula	n
2	dracula	This eBook is for the use of anyone anywhere a...
3	dracula	almost no restrictions whatsoever. You may co...
4	dracula	re-use it under the terms of the Project Guten...

Since we’re now in Pandas, we can easily use its plottingcapability tolook at the number of lines in the books. Note that value_counts of bookscreate Series indexed by the book names, thus, we don’t need to set index forplotting. Note that we can now choose plot styling; see demopagefor available styles.

Now, we need to tokenize the sentences into words aka terms. While we can do itin a loop, we can take advantage of the split function in the text toolkit forPandas’ Series; see this manual for all the functions. To get it we just invoke thestrip function, which is a part of str, i.e. StringsMethods object. Theargument is regular expression pattern, on which the string, in our case thesentence, will be split. In our case it is the regular expression set ofeverything that is not a letter (capital or not) or digit and underscore. Thisis because W is the reverse of w which is a set [A-Za-z0-9_], i.e.contains an underscore. In this case we want also to split on underscore so wehad to add it to the splitting set. Note that this decision and other you’llmake while deciding on the splitting set, will influence the tokenization, andthus the result. Hence, select the splitting pattern carefully. If you’re notsure about your decision, use practicing pages, likeregex101.com to check your patterns.

book	lines	words
0	dracula	The Project Gutenberg EBook of Dracula, by Br...	[, The, Project, Gutenberg, EBook, of, Dracula...
1	dracula	n	[]
2	dracula	This eBook is for the use of anyone anywhere a...	[This, eBook, is, for, the, use, of, anyone, a...
3	dracula	almost no restrictions whatsoever. You may co...	[almost, no, restrictions, whatsoever, You, ma...
4	dracula	re-use it under the terms of the Project Guten...	[re, use, it, under, the, terms, of, the, Proj...

As a result we get a new column with array of words. Now we need to flatten theresulted DataFrame somehow. The best approach is to use iterators and create newDataFrame with words; see this StackOverflowpost for the performance details of this solution. Note that thismay take a few seconds depending on the machine; on mine it takes about 25s.

book	word
0	dracula
1	dracula	The
2	dracula	Project
3	dracula	Gutenberg
4	dracula	EBook

Now we have DataFrame with words, but occasionally we get an empty string as abyproduct of the splitting. We should remove it as it would be counted as aterm. To do that we can use function len as follows:

book	word
1	dracula	The
2	dracula	Project
3	dracula	Gutenberg
4	dracula	EBook
5	dracula	of

The newwords DataFrame is now free of the empty strings.

If we want to calculate TF-IDF statistic we need to normalize the words bybringing them all to the same case. Again, we can use a Series string functionfor that.

book	word
1	dracula	the
2	dracula	project
3	dracula	gutenberg
4	dracula	ebook
5	dracula	of

Tokenize Pandas Column Definition

First, lets calculate counts of the terms per book. We can do it as follows:

n_w
book	word
dracula	the	8093
	and	5976
	i	4847
	to	4745
	of	3748

Note that as a result we are getting a hierarchical index aka MultiIndex; seePandas manual formore details.

With this index we can now plot the results in much nicer way. E.e. we can get 5words largest counts per book and plot it as shown below. Note that we need toreset, i.e. remove one level of index, as it doubles when we are usingnlargest function. I’ve made the process into a function, as I will reuse itfurther.

n_w
book	word
dracula	the	8093
	and	5976
	i	4847
	to	4745
	of	3748
frankenstein	the	4371
	and	3046
	i	2850
	of	2760
	to	2174
grimms_fairy_tales	the	7224
	and	5551
	to	2749
	he	2096
	a	1978
moby_dick	the	14718
	of	6743
	and	6518
	a	4807
	to	4707
tom_sawyer	the	3973
	and	3193
	a	1955
	to	1807
	of	1585
war_and_peace	the	34725
	and	22307
	to	16755
	of	15008
	a	10584

To finish the TF as shown in the previous post, we need the counts of the books to get the measure. As a reminder, below is the equation for TF:

Note that I’ve renamed the column as it inherits the name by default.

n_d
book
dracula	166916
frankenstein	78475
grimms_fairy_tales	105336
moby_dick	222630
tom_sawyer	77612
war_and_peace	576627

Now we need to join both columns on book to get the sum of word per book intofinal DataFrame. Having both n_w and n_d, we can easily calculate TF asfollows:

n_w	n_d	tf
book	word
dracula	the	8093	166916	0.048485
	and	5976	166916	0.035802
	i	4847	166916	0.029039
	to	4745	166916	0.028427
	of	3748	166916	0.022454

Again, lets look at the top 5 words with respect to TF.

tf
book	word
dracula	the	0.048485
	and	0.035802
	i	0.029039
	to	0.028427
	of	0.022454
frankenstein	the	0.055699
	and	0.038815
	i	0.036317
	of	0.035170
	to	0.027703
grimms_fairy_tales	the	0.068581
	and	0.052698
	to	0.026097
	he	0.019898
	a	0.018778
moby_dick	the	0.066110
	of	0.030288
	and	0.029277
	a	0.021592
	to	0.021143
tom_sawyer	the	0.051191
	and	0.041141
	a	0.025189
	to	0.023282
	of	0.020422
war_and_peace	the	0.060221
	and	0.038685
	to	0.029057
	of	0.026027
	a	0.018355

As before (and as expected), we’ve got the English stop-words.

Now we can do IDF; the remainder of the formula is given below:

First we need to get the document count. The simples solution is to use theSeries’ method nunique, i.e. get size of set of unique elements in a series.

Similarly, to get the number of unique books that every term appeared in, we canuse the same method but on a grouped data. Again we need to rename the column.Note that sorting values is only for the presentation and it is not needed forthe further computations.

i_d
word
laplandish	1
moluccas	1
molten	1
molliten	1
molière	1

Having all the components of the IDF formula, we can now calculate it as a newcolumn.

i_d	idf
word
laplandish	1	1.791759
moluccas	1	1.791759
molten	1	1.791759
molliten	1	1.791759
molière	1	1.791759

To get the final DataFrame we need to join both DataFrames as follows:

Tokenize Pandas Columns

n_w	n_d	tf	i_d	idf
book	word
dracula	the	8093	166916	0.048485	6	0.0
	and	5976	166916	0.035802	6	0.0
	i	4847	166916	0.029039	6	0.0
	to	4745	166916	0.028427	6	0.0
	of	3748	166916	0.022454	6	0.0

Having now DataFrame with both TF and IDF values, we can calculate TF-IDFstatistic.

n_w	n_d	tf	i_d	idf	tf_idf
book	word
dracula	the	8093	166916	0.048485	6	0.0	0.0
	and	5976	166916	0.035802	6	0.0	0.0
	i	4847	166916	0.029039	6	0.0	0.0
	to	4745	166916	0.028427	6	0.0	0.0
	of	3748	166916	0.022454	6	0.0	0.0

Again, lets see the top TF-IDF terms:

Tokenize Pandas Column Names

tf_idf
book	word
dracula	helsing	0.003467
	lucy	0.003231
	mina	0.002619
	van	0.002126
	harker	0.001879
frankenstein	clerval	0.001347
	justine	0.001256
	felix	0.001142
	elizabeth	0.000813
	frankenstein	0.000731
grimms_fairy_tales	gretel	0.001667
	hans	0.001304
	tailor	0.001174
	dwarf	0.001055
	hansel	0.000799
moby_dick	ahab	0.004161
	whale	0.003873
	whales	0.002181
	stubb	0.002101
	queequeg	0.002036
tom_sawyer	huck	0.005956
	tom	0.004305
	becky	0.002655
	joe	0.002406
	sid	0.001847
war_and_peace	pierre	0.006100
	natásha	0.003769
	rostóv	0.002411
	prince	0.002318
	moscow	0.002243

As before we’ve got a set of important words for the given document.

Note that this more of a demo of Pandas text processing. If you’re consideringusing TF-IDF in a more production example, see some existing solutions likescikit-learn’s TfidfVectorizer.

Note that I’ve just scratched a surface with the Pandas’ text processingcapabilietes. There are many other useful functions like the match functionshown below:

n
word
s	8341
she	6317
so	5380
said	5337
some	2317
see	1760
such	1456
still	1368
should	1264
seemed	1233

Tokenize Pandas Column Tutorial

Again, I encourage you to check Pandas manual for more details.

Notebook with the above computations is available for downloadhere.

Boardsload274