This is some preliminary analysis on the phoneme dataset. There is a jupyter notebook with the code to generate the plots in this post, but I thought it might be better to simple put the results in here. First, some general propterties of the data:

Number of Patients	87
Number of Runs	230
Number of Patients with >3 runs	18

Note: A run is 80 tasks not 160

Distribution of Scores

Below is the score distrition for the runs (note, this is for 160 when possible)

As a sanity check, I also thought we should look to see the reaction times of all the patients. Plotted here is a histogram for each patient’s reaction time histogram. It is interesting how varied the reaction times are.

Natural Clusters

I thought we should also see of there is any natural clustering in the dataset. To do this, we can take a the score vector for each run and project this onto principle components. In practice, there are specific algorithms which are non-linear and specifically try to cluster the data (specifically, tSNE)

Next, I thought I would look to see if the small clusters are indeed the same people. For this analysis, I took all runs which were from a patient who had at least 4 runs and then color coded them. There are 18 patients who met this criteria, and therefore 18 colors in this plot.

We could also replicate this process using the reaction time data