Un1fy

Inspiration

I did this as part of the Office Ally challenge for LA Hacks. Patient data is widespread across hospitals and medical facilities in America and we need a smart way to consolidate this data and determine which records belong to the same patient with at least 90% accuracy.

What it does

This project makes use of Levenshtein Distance or edit distance as it's main metric. To understand and model after the training data given, I used a neural network. Levenshtein Distance is the number of changes that need to be made to a string before it becomes another string.

How I built it

First, I had to sanitize all the data. This is done in parse.py and prepare.py and it involves making everything lower case, changing all "Male" and "Female" to simply "m" and "f", changing states to abbreviations and a few other nuanced steps
For complete comprehension of data, I then compared every record to every record and calculate the Levenshtein Distance for each column.
I used the edit distances for each column as input to a neural network which would output either a 0 or a 1. 0 means that the two records used to calculate the current edit distance do not belong in the same group, 1 means that they do.
To make predictions on unclassified data, I would do the exact same thing. Let's say n rows of unclassified data are fed to the network. Row 1 would be given its own group right off the bat, but from there, each row will be compared to all of the already classified rows. During the comparison, the code calculates the edit distance for each column like it did when it was training the network and then feeds these edit distances to the trained model. The model then generates a 1 or a 0 to signify whether the two rows belong in the same group. If a certain group gets enough ones, the unclassified row will be put in that group. If no group gets enough ones, a new group is created. "Enough ones" is defined by what I call the yes_threshold. This is a minimum percentage of the classified group that needs to return 1 for the unclassified row to be put in that group

Challenges I ran into

One huge challenge was that we did not know how many groups there would be, which ruled out k-means clustering. Sanitizing the data was also a headache because I essentially had to eliminate as much human error as I could

Accomplishments that I'm proud of

On the sample data, the algorithm only classified 2 (two!) records wrong out of 201 which gives me an accuracy of ~99%.

Here are my algorithm's results and the actual results provided by Office Ally. Lines are formatted like this: `{group id}: {array of patient ids}. The patient id's that my algorithm got wrong are 198 and 201

My results:

1: 1 2 3 4 5 
2: 6 7 
3: 8 9 10 11 12 13 
4: 14 15 
5: 16 17 18 19 
6: 20 
7: 21 22 23 
8: 24 25 26 
9: 27 28 
10: 29 30 
11: 31 32 33 34 
12: 35 
13: 36 
14: 37 
15: 38 39 40 
16: 41 
17: 42 43 44 45 
18: 46 47 
19: 48 49 50 51 
20: 52 53 54 55 56 
21: 57 58 59 60 
22: 61 
23: 62 63 64 65 66 67 
24: 68 69 
25: 70 71 72 73 74 
26: 75 76 77 78 
27: 79 80 81 82 83 
28: 84 85 86 
29: 87 88 89 90 91 92 
30: 93 94 
31: 95 96 97 98 99 
32: 100 101 102 
33: 103 104 105 106 
34: 107 108 
35: 109 110 111 112 113 
36: 114 
37: 115 116 117 118 
38: 119 120 121 
39: 122 123 124 125 126 
40: 127 128 129 
41: 130 131 132 133 
42: 134 135 136 137 
43: 138 139 140 141 142 143 
44: 144 145 146 
45: 147 148 149 150 
46: 151 152 153 154 
47: 155 156 157 158 
48: 159 160 161 
49: 162 163 164 165 
50: 166 
51: 167 168 
52: 169 170 171 
53: 172 173 174 175 
54: 176 177 178 
55: 179 180 181 182 
56: 183 
57: 184 185 186 
58: 187 188 189 190 
59: 191 192 193 194 
60: 195 196 
61: 197 198 
62: 199 
63: 200 201

Office Ally provided data:

1: 1 2 3 4 5 
2: 6 7 
3: 8 9 10 11 12 13 
4: 14 15 
5: 16 17 18 19 
6: 20 
7: 21 22 23 
8: 24 25 26 
9: 27 28 
10: 29 30 
11: 31 32 33 34 
12: 35 
13: 36 
14: 37 
15: 38 39 40 
16: 41 
17: 42 43 44 45 
18: 46 47 
19: 48 49 50 51 
20: 52 53 54 55 56 
21: 57 58 59 60 
22: 61 
23: 62 63 64 65 66 67 
24: 68 69 
25: 70 71 72 73 74 
26: 75 76 77 78 
27: 79 80 81 82 83 
28: 84 85 86 
29: 87 88 89 90 91 92 
30: 93 94 
31: 95 96 97 98 99 
32: 100 101 102 
33: 103 104 105 106 
34: 107 108 
35: 109 110 111 112 113 
36: 114 
37: 115 116 117 118 
38: 119 120 121 
39: 122 123 124 125 126 
40: 127 128 129 
41: 130 131 132 133 
42: 134 135 136 137 
43: 138 139 140 141 142 143 
44: 144 145 146 
45: 147 148 149 150 
46: 151 152 153 154 
47: 155 156 157 158 
48: 159 160 161 
49: 162 163 164 165 
50: 166 
51: 167 168 
52: 169 170 171 
53: 172 173 174 175 
54: 176 177 178 
55: 179 180 181 182 
56: 183 
57: 184 185 186 
58: 187 188 189 190 
59: 191 192 193 194 
60: 195 196 
61: 197 
62: 198 
63: 199 
64: 200 
65: 201

What I learned

I learned about edit distance and different metrics to compare the similarity of strings. As someone who's relatively new to data science and machine learning, I also learned a lot more about Keras and neural networks