Ack! Duplicates in the Data!

As I mentioned in a previous post, I compiled the data set that I’m currently working on in PostgreSQL.  To get this massive data set, I had to write a query that was massive by dint of the number of LEFT JOINs that I had to write.  Today I caught myself wondering If I had remembered to add in DISTINCT to the SELECT clause in my query, as that would prevent any obvious duplicates.  I looked and, woops, I didn’t specify it in the SELECT clause.  It turns out that I didn’t have too many duplicates (only 135 out of about 395000 rows), but I was still bothered.

I took a look at the duplicates and found out that there were a small bunch of duplicates in the main name and contact table where the ID number of the records was the same, but one record had actual address and contact info, while another one would be pretty bare (in other words, missing information in various contact fields).  Although there was only one duplicate for each affected record, the many LEFT JOINs that I did multiplied the number of duplicates to get to about 135.

I really didn’t want to export the data again from the database and redo all the calculations I had made, so I had to find some way of cleaning up all these duplicates from within R.  Thankfully, I had turned the many contact fields in the data into dummy variables in view of the data mining that I had to do later on (I love dummy variables!).  This then made it possible to distinguish duplicates based on whether they had more or less fields filled out, enabling me to remove them, as shown in the script below:


# These column numbers represent fields with name/contact info that I've
# marked with 1s and 0s depending on whether or not there's anything in
# the field.
bio_cols = c(5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,23,24,25,26)
# Now we get the row numbers of all the records with duplicate IDs
dupe_id_rows = which(duplicated(big.dataset$ID) == TRUE)
# Now we get the actual ID values that were duplicated
dupe_ids = big.dataset$ID[which(duplicated(big.dataset$ID) == TRUE)]
# The following line gives us a sum of the number of fields, for each
# record, that have information in them
info_sums = apply(big.dataset[dupe_conids_rows,bio_cols],1,sum)
# Next we build a data frame that glues this information together. The
# goal is to isolate those rows representing records with missing info,
# and then delete them from the larger data set
a = data.frame(sums=info_sums,ids=dupe_ids,rows=dupe_id_rows)
# Now we order the data frame by the ID values
a = a[order(a[,2]),]
# Here we get the mean number of fields, for each ID value, filled out
# between duplicates that have more information, and duplicates that have
# missing information
a$avg = rep(tapply(a$sums, a$ids, mean), tapply(a$sums, a$ids, length))
# Here we make a column that marks a row for deletion (with a 1 instead
# of a 0) if the sum of fields that have information in them is less
# than the mean number for that ID value. Those rows that are marked
# are the duplicates that are missing info.
a$del = ifelse(a$sums < a$avg,1,0)
# The following is a full list of unique row numbers from the bigger
# dataset that constitute duplicates that are missing information.
rows_to_del = a$rows[a$del == 1]
# Here we delete those rows from the bigger dataset
big.dataset = big.dataset[-rows_to_del,]
# Since all remaining dupes are exact duplicates, this line removes them
big.dataset = big.dataset[which(duplicated(big.dataset$ID) == FALSE),]

Well, it worked!  Now I have no more problem with duplication.  Now, hopefully I’ll remember to use DISTINCT in my SQL queries in the future 🙂

PostgreSQL, Excel, R, and a Really Big Data Set!

At work I’ve started to work with the biggest data set I’ve ever seen!  First, let me qualify my use of the term “Big Data”.  The number of rows in the resultant data set (after much transformation and manipulation in PostgreSQL and to a lesser extent in Excel) is only just 395,928.  This will certainly pale in comparison to truly BIG data sets, but for me it’s huge!  For now, the data I’m working with has 307 variables (but a big chunk of null values).  Just calculate: if there were no null values in this data set, the number of data points would be 395,928 * 307 = 121,549,896!!  Even if 60% of this table consisted of null values, that still leaves 48,619,958 data points.  Wow!

Working with so much data (it came from a database dump), I’ve noticed a few things:

1) PostgreSQL is awesome at handling it.  I really haven’t had to wait too long for each of my queries (create table queries, mainly) to execute.  The longest I had to wait was in fact the final query that I ran today that joined all of the transformed tables that I created to a biographical information table.  I don’t think it took much longer than a minute, at most.

In fact, what I found really helpful in PostgreSQL was recoding some of the qualitative variables using CASE … WHEN …. THEN clauses.  Those are super handy and are carried out with seemingly little difficulty.

2) Excel 2010 had LOTS of difficulty letting me manipulate my data set once I loaded it up. Granted, the file was a 176 megabyte csv file… there’s no shame at whining when you have the weight of the world on your shoulders!

One problem I had in excel was that, at one point, it stopped immediately redrawing the excel window every time I changed to a different ribbon.  For example, I would click on the “File” ribbon, after being on the “Home” ribbon, and I would still see my data set, and not all the “File” ribbon options.  Sizing the excel window down and dragging it out of view to the bottom of the screen and back seemed to help at times, and waving my mouse cursor where all the relevant buttons should be helped at other times.

Another problem was when I went to recode some variables in Excel into dummy variables, it took SO LONG to finish the job!  Granted, I used the standard “Find / Replace” window, which might not be the best option in Excel.  I’ve learned my lesson though.  I have to either recode in PostgreSQL, or do what I describe below, in R.

3) Although I haven’t done too much with the data set in R yet, it has handled it with pizzazz!  R loaded the csv file in just under 2 minutes (I was honestly expecting longer).  Once I had it in, I realized that for the variables in my data set where there were no zeros indicating the absence of something, I would need to include them somehow.  I didn’t actually want to modify the data set to include the zeros, as that would add to how much RAM R was using up (it was already using up 1.5 gigs of ram!).  So, I made a simple function that would put zeros into a vector just for the purpose of graphing or statistical testing.

In other words, if I’m doing a boxplot and I need my X variable to contain zeros where there are only NA values, I run the boxplot command, passing the X variable to the function so that the zeros get added temporarily and not permanently (e.g. boxplot(Y ~ zerofy(X)).  This should hopefully prevent me from using too much RAM.

I haven’t yet run any typical statistical tests on the data set, so I can’t attest to how quickly they work on it, but that will come very very soon.  So far, I’m quite impressed with the open source tools I use and their performance on this data!