Biomedical scientists often need to pool data for statistical power. This requires time-intensive retrospective integration. Researchers from DTL partner University Medical Center Groningen (UMCG) and McGill University (Montreal, Canada) recently presented a system to address this challenge in the scientific journal Bioinformatics.
Chao Pang, PhD student at UMCG and first author on the paper, explains: “The sizes and numbers of biobanks, patient registries and other data collections are increasing. Nevertheless, biomedical scientists still often need to pool data for statistical power. We developed MOLGENIS/connect, which is a semi-automatic system to find, match and pool data from different sources.”
MOLGENIS/connect shortlists relevant source attributes from thousands of candidates. It uses ontology-based query expansion to overcome variations in terminology. Next, it generates algorithms that transform source attributes to a common target DataSchema. These include unit conversion, categorical value matching and complex conversion patterns.
Compared to human experts, the system was able to auto-generate 27% of the algorithms perfectly. An additional 46% needed only minor editing. Chao Pang: “MOLGENIS/connect will thus reduce the human effort and expertise needed to pool data.”
Source code, binaries and documentation are available as open-source under LGPLv3 from http://github.com/molgenis/molgenis and www.molgenis.org/connect .