BUILDING LARGE DISTANCE MATRICES BASED ON PART OF SPEECH (POS) TAGGING WITH PANDAS

Supposing you were given the task of tokenising a dataset consisting of sentences like so:

Tokenising task

The end result needed is a data file with the above sentences as row titles and the different word tags as columns. As thare are around 20000 rows we need to find a way to produce the required dataset within optimal time.

We can start by looping through the above dataframe. While going through every sentence, we use the nltk token function which produces a list of tuples containing the different words and their types eg.(front, NN). The types can be expounded on further Here

From that list, if there is a match, we create a dictionary with the type of word followed by 1 and then add the dictionary to a list which then creates a list of dictionaries as below:

for i,j in tag_list_1:
    tag_dict.update({j:1})
    main_list.append(tag_dict)   

RESULT

[{NN:1, ADV:1, JJ:1}, {VBD:1, JJ:1, RB:1, PRP:1, NN:1}...]

We can then convert this list to a dataframe, we will have the tags (NN, JJ, PRP) as columns but we wont have any row titles. We can insert the titles by appending the dataframe above as the 1st row since the dataframe was created row by row.

Once the sentences have been inserted to the dataframe, we can then replace all nan characters with 0 which will give as a dataframe that looks like the image below:

alt text

This is done by using the code below:

# Convert list of dictionaries to dataframe
df_amended = pd.DataFrame(main_list)

# Insert dataframe column with the sentences as row titles
idx = 0
df_amended.insert(loc=idx, column='job_title', value=df['job_title'])

# Convert all nan values to 0
df_amended.fillna(0, inplace=True) 

Feel free to remove any unneeded features (such as the commas and brackets) and to export the dataframe.

Working with pythons data structures when searching and looping greatly decreases the time spent computing results.

Appending a new dictionary to a list instead of appending a new row to a dataframe is much faster as appending a new row creates a new copy of the dataframe as seen here.

Written on November 9, 2017