Finding Word Counts

I thought I might make a little post about to get some word counts. If you are in the MCW neurology department, then go ahead and ssh into my computer to get to a directory with a lot of text files.

ssh $username@noether.neuro.mcw.edu
cd /mnt/data/smazurchuk/intern2crea/bookcorpus/out_txts

There are over 16,000 text files in this directory which are free books. To get word counts, we can simply use Grep (and word count)!

The grep command for $word is:

grep -r -o -i -w $word | wc -l

note: word is allowed to be a string with a space in it

The grep options are:

Option	Function
-r	Just recursively goes through all files in directory tree
-o	Print matched part of matching line, with each part on separate line
-i	Ignore case
-w	Match whole word (cannot be just part of a word)
wc	Word count
-l	Tell word could to count the number of lines

The reason for writing the code this way is so that we can get a short python script which can use grep to get wordcounts for a list of words! Below is the code to accomplish just that

wordList=['red','blue','words']; wordCnt=[]*len(wordList)
for idx, word in enumerate(wordList):
    wordCnt[idx] = int(os.popen('grep -r -o -i -w "'+ word + '" | wc -l').read())

Double Counts

The astute reader might notice that some words will get double counted. That is, we don’t want “baseball bat” in the count for “baseball”. This can be remedied by noticing that one string is a strict substring of the other! We can simply iterate each word in the list across the other words, and if a is a substring of b, then we subtract the count of a from the count of b. You can verify this (“baseball bat” - “baseball”), and the corresponding code is here:

for idx, tWord in enumerate(wordList):
    for idx2, word in enumerate(wordList):
        if tWord in word and idx!=idx2:
            wordCnt[idx] = wordCnt[idx]-wordCnt[idx2]

Bigram Counts

In order to calculate bigram frequencies for words that start with a particular string we have to modify our regular expression. If we want to only consider bigrams that match at the beginning of a word, we can use the \< option¹. We also will remove the requirements to ignore case and to match the whole word. Our new command is:

for idx, bigram in enumerate(bigram_list):
    bi_count[idx] = int(os.popen('grep -r -o "\<'+ bigram[:2] + '" | wc -l').read())

In order to verify these results, I found a blog post by Peter Norvig where he performed a similar analysis using the Google books Ngrams dataset². In principal, the only difference between our analysis are that his ignores capitalization. If you download his summarized dataset here, the following python code can be used to extract his counts for a list of bigrams

import pandas as pd

tbl = pd.read_table('ngrams-all.tsv')
ref_col = (tbl.columns.values == '8/8:8').argmax() # Column of interest
ngrams = [str(k).lower() for k in tbl['1-gram'].tolist()]

vals = []
for word in wordList:
    vals.append( int(tbl.iloc[ngrams.index(word[:2]),ref_col]) )

Visually inspecting the correlation, we find that our methods give very similar results ( $\rho=.96$)

You can interact with this plot. x-axis is from Google Books, y axis was generated as outlined above

However, noting that the bigram ‘th’ is driving a large part of the correlation, we find the correlation decreases to .89 when we remove it.

Hope that helps, and thanks for reading!

Update (9/3/21) - Speed Analysis

In order to not be limited by The equivalent python command to the above command is:

regex = re.compile(r'\bbadger\b',re.IGNORECASE)
count = len(regex.findall(big_string))

Where:

big_string is all the text loaded as a single string
\b indicates any blank space character (which should do the same as the -w option from above)
Also added an option to be case insensitive

Python

%time count = len(regex.findall(big_string))
CPU times: user 1min 57s, sys: 843 ms, total: 1min 58s
Wall time: 1min 56s

Bash

(base) Singularity> time /bin/grep -r -o -i -w badger | wc -l
3799

real	0m54.579s
user	0m34.273s
sys	0m1.982s
(base) Singularity> 

As we can see, the python command takes almost twice as long as just using a grep command. While it might seem that this grep command can be parallized, cpu usage is averages about 70% while this command is running, indicating that this command is actually io bound rather than cpu bound. As such, spawning multiple processes would likely decrease speed. It looks like sticking with just grep is the way to go!

Reference

https://www.arubanetworks.com/techdocs/ArubaOS_63_Web_Help/Content/ArubaFrameStyles/ESI/Basic_Regular_Expression.htm ↩
Blog Post: https://norvig.com/mayzner.html ↩