Inspiration
I did this as part of the Office Ally challenge for LA Hacks. Patient data is widespread across hospitals and medical facilities in America and we need a smart way to consolidate this data and determine which records belong to the same patient with at least 90% accuracy.
What it does
This project makes use of Levenshtein Distance or edit distance as it's main metric. To understand and model after the training data given, I used a neural network. Levenshtein Distance is the number of changes that need to be made to a string before it becomes another string.
How I built it
- First, I had to sanitize all the data. This is done in
parse.pyandprepare.pyand it involves making everything lower case, changing all "Male" and "Female" to simply "m" and "f", changing states to abbreviations and a few other nuanced steps - For complete comprehension of data, I then compared every record to every record and calculate the Levenshtein Distance for each column.
- I used the edit distances for each column as input to a neural network which would output either a 0 or a 1. 0 means that the two records used to calculate the current edit distance do not belong in the same group, 1 means that they do.
- To make predictions on unclassified data, I would do the exact same thing. Let's say n rows of unclassified data are fed to the network. Row 1 would be given its own group right off the bat, but from there, each row will be compared to all of the already classified rows. During the comparison, the code calculates the edit distance for each column like it did when it was training the network and then feeds these edit distances to the trained model. The model then generates a 1 or a 0 to signify whether the two rows belong in the same group. If a certain group gets enough ones, the unclassified row will be put in that group. If no group gets enough ones, a new group is created. "Enough ones" is defined by what I call the
yes_threshold. This is a minimum percentage of the classified group that needs to return 1 for the unclassified row to be put in that group
Challenges I ran into
One huge challenge was that we did not know how many groups there would be, which ruled out k-means clustering. Sanitizing the data was also a headache because I essentially had to eliminate as much human error as I could
Accomplishments that I'm proud of
On the sample data, the algorithm only classified 2 (two!) records wrong out of 201 which gives me an accuracy of ~99%.
Here are my algorithm's results and the actual results provided by Office Ally. Lines are formatted like this: `{group id}: {array of patient ids}. The patient id's that my algorithm got wrong are 198 and 201
My results:
1: 1 2 3 4 5
2: 6 7
3: 8 9 10 11 12 13
4: 14 15
5: 16 17 18 19
6: 20
7: 21 22 23
8: 24 25 26
9: 27 28
10: 29 30
11: 31 32 33 34
12: 35
13: 36
14: 37
15: 38 39 40
16: 41
17: 42 43 44 45
18: 46 47
19: 48 49 50 51
20: 52 53 54 55 56
21: 57 58 59 60
22: 61
23: 62 63 64 65 66 67
24: 68 69
25: 70 71 72 73 74
26: 75 76 77 78
27: 79 80 81 82 83
28: 84 85 86
29: 87 88 89 90 91 92
30: 93 94
31: 95 96 97 98 99
32: 100 101 102
33: 103 104 105 106
34: 107 108
35: 109 110 111 112 113
36: 114
37: 115 116 117 118
38: 119 120 121
39: 122 123 124 125 126
40: 127 128 129
41: 130 131 132 133
42: 134 135 136 137
43: 138 139 140 141 142 143
44: 144 145 146
45: 147 148 149 150
46: 151 152 153 154
47: 155 156 157 158
48: 159 160 161
49: 162 163 164 165
50: 166
51: 167 168
52: 169 170 171
53: 172 173 174 175
54: 176 177 178
55: 179 180 181 182
56: 183
57: 184 185 186
58: 187 188 189 190
59: 191 192 193 194
60: 195 196
61: 197 198
62: 199
63: 200 201
Office Ally provided data:
1: 1 2 3 4 5
2: 6 7
3: 8 9 10 11 12 13
4: 14 15
5: 16 17 18 19
6: 20
7: 21 22 23
8: 24 25 26
9: 27 28
10: 29 30
11: 31 32 33 34
12: 35
13: 36
14: 37
15: 38 39 40
16: 41
17: 42 43 44 45
18: 46 47
19: 48 49 50 51
20: 52 53 54 55 56
21: 57 58 59 60
22: 61
23: 62 63 64 65 66 67
24: 68 69
25: 70 71 72 73 74
26: 75 76 77 78
27: 79 80 81 82 83
28: 84 85 86
29: 87 88 89 90 91 92
30: 93 94
31: 95 96 97 98 99
32: 100 101 102
33: 103 104 105 106
34: 107 108
35: 109 110 111 112 113
36: 114
37: 115 116 117 118
38: 119 120 121
39: 122 123 124 125 126
40: 127 128 129
41: 130 131 132 133
42: 134 135 136 137
43: 138 139 140 141 142 143
44: 144 145 146
45: 147 148 149 150
46: 151 152 153 154
47: 155 156 157 158
48: 159 160 161
49: 162 163 164 165
50: 166
51: 167 168
52: 169 170 171
53: 172 173 174 175
54: 176 177 178
55: 179 180 181 182
56: 183
57: 184 185 186
58: 187 188 189 190
59: 191 192 193 194
60: 195 196
61: 197
62: 198
63: 199
64: 200
65: 201
What I learned
I learned about edit distance and different metrics to compare the similarity of strings. As someone who's relatively new to data science and machine learning, I also learned a lot more about Keras and neural networks
What's next for Un1fy
I hope to
- sanitize the data even more to get better results
- include more metrics and determine an accurate or dynamic
yes_threshold - build a web UI that would be user friendly
Github repo: https://github.com/gjethwani/la-hacks-patient-matching
Log in or sign up for Devpost to join the conversation.