[P] Deep learning for estimating race and ethnicity from electronic medical records (GitHub + arXiv)

First of all your marketing is great. You have nice .ai domain name, a logo, code on github with appropriate license as well. After checking out your profile it seems like a very well marketed undergrad project. However your problem description as well as methodology is flawed. As someone with experience in both deep learning & large scale medical data mining [1, 2] here are some of the issues with your approach.

  1. You have used closed datasets which are an order of magnitude smaller than those publicly available. [3]

  2. Diagnosis codes themselves are far less reliable than ethnicity. Using one to predict other is futile. Especially if you are then using ethnicity information to adjust for confounding, in which case its all just noise.

  3. The associations between disease and ethnicities that you identified are all well known and far more rigorously studied in literature than via some black-box interpreter.

  4. Its enticing to scream "privacy risk!!!!", as its a quick way to get attention, but there are no regulations that prevent sharing ethnicity/race information, primarily because doing so would be harmful people belonging to that ethnicity by missing out on crucial factors. Also just because X correlates with Y, and Y has not been provided to you does NOT means that there is a privacy risk. Otherwise we might as well just randomly sample from uniform distribution and call it a day.

  5. I understand that Deep Learning is hot right now but the enormous amount of noise in the labels as well as unreliability of diagnosis/procedure codes, makes any DL research applied to EMR/EHR data unreliable. Considering the amount of effort you invested, I would recommend working on some other area which offers a better return on investment.

  6. The computation time numbers don't matter, in fact putting them in your report/paper makes it look like it was written by amateurs (sadly true for most Healthcare DL researchers using EHR/EMR data) . SVM's have been successfully trained on thousand times larger dataset that yours. Just because the library you used did not work, does not invalidates SVM, it just shows that your method was flawed, casting doubts on your entire work.

Finally I would recommend against working on applying Deep Learning for Healthcare Unless you are dealing with dense signal such as ECG, Pathology slides, CT/MRI etc. Even with access to medical data on ~40 Million patients, I don't use Deep Learning because the amount of noise is enormous and careful elimination of confounding effects is not possible with DL methods.

[1] http://www.deepvideoanalytics.com/

[2] http://www.computationalhealthcare.com/

[3] https://hcup-us.ahrq.gov/

/r/MachineLearning Thread Link - github.com