Challenges to using big data in health services research

Given the shift in current healthcare trends toward digitization of storing information, there has been an increase in the number of studies using administrative databases. These databases provide a powerful tool to conduct research on outcomes, health services, and epidemiology. However, these databases have limitations and biases that should be considered. Given the sensitive information regarding patients’ health in the database, security clearances must be granted before data is accessed. Furthermore, algorithms to link the different variables to create a cohort of people with specific disease are imperfect and may not yield an accurate representation. Due to a large volume of records, a statistically significant finding may be observed, but may provide insignificant clinical results. Despite the current limitations, administrative databases provide powerful data that researchers can use to identify gaps in performance to improve the healthcare system.


ethics, privacy, and confidentiality considerations
Administrative databases contain sensitive information about a patient's health, diagnoses, demographics, and where they reside. All this information can be used to deduce other factors such as socioeconomic status. 6 Information stored in administrative databases is inherently protected in Canadian laws governing rights and privacy. There is legislation safeguarding the storage and access to this information, such as Ontario's Freedom of Information and Protection of Privacy Act (FIPPA) and Personal Health Information Protection Act (PHIPA). 7,8 Since privacy and confidentiality are of utmost importance, thorough clearance checks are required before individuals or organizations can be granted access to the databases, which can cause delays in accessing data. 6 To safeguard anonymity, data is often transformed, excluded, and/or encrypted before being stored in the databases. 6 This causes a time lag between when the data is recorded and when it is transferred to the databases. Thus, conducting research in 'real time' may be very challenging. Consequently, studies using administrative databases are limited to retrospective studies. Researchers and clinicians employing a health intervention may not know the results until months or years after the intervention has been completed.
rules and regulations as barriers RDC, CIHI, and ICES provide guidelines on navigating the steps required to access databases. For example, CIHI allows graduate students working on a project using administrative databases to request access through the Graduate Student Data Access Program (GSDAP). 9 However, additional barriers exist. For example, if the supervisor has funding to conduct the research, students cannot file an application through the GSDAP. 9 Independent researchers affiliated with a university require funding to access databases, which often poses a financial barrier to such research. The minimum funding required to conduct a rudimentary study can reach tens of thousands of dollars, with the average funding required to conduct more thorough research being significantly higher. 11 Additionally, students wanting to access these health databases to conduct research in fulfilment of their independent project, thesis, or dissertation may require a faculty supervisor affiliated with the organization providing access to the databases. This poses another challenge because students are required to find a supervisor that not only matches their research interests, but also someone who is a member of the desired organization. For example, ICES has a student program only available to students who have acquired supervision from a faculty member affiliated with ICES. RDC and CIHI, on the other hand, do not require a faculty member supervision affiliated with these organizations before access to the databases are granted, thereby providing easier access. 9,10 Publicly funded faculty members and students can submit a proposal to ICES, which can reduce costs but does not eliminate them entirely. reliability of the data and code accuracy Administrative databases are typically intended for financial and administrative management rather than for research purposes. In other words, they were created to monitor utilization, determine the amount of resource consumption, or to determine the required capacity for a service. Although administrative data contains a large amount of information that can determine hospital performance and identify where the current healthcare system is lacking, the feature article degree of detail and accuracy of the data within the databases may vary for many reasons. 12 The quality of information in these systems depends on incentives of data reporting such as financial and systematic support. 13 This has meant that expensive medical procedures are documented more thoroughly than less costly health services, such as ambulatory services, 14 One of the most common types of error in these databases is absent values, which occurs when required information is missing or not inputted. This shows that administrative data may be incomplete and may not be the most accurate source of information.
One of the core elements of the administrative data system is the International Statistical Classification of Diseases and Related Health Problems (ICD; the most current version is ICD-11), which is a medical classification list by the World Health Organization (WHO). 1,2,13 Using various algorithms to link specific ICD codes can provide a cohort of patients with the disease of interest, such as chronic obstructive pulmonary disease (COPD). 15 For example, information such as cumulative patient profile, ICD codes, and prescriptions can be linked to identify individuals with a certain type of disease. 16 However, there is a risk of bias, as the algorithms do not provide the highest level of accuracy. Fluctuations in the level of accuracy can be induced by including or excluding different variables. 17,18 For example, Cooke et al were able to improve the accuracy of an algorithm that identifies patients with COPD by including albuterol, ipratropium, and age as additional variables. 17 One indicator of the accuracy of an algorithm is the positive predictive value (PPV), which refers to the probability of presence of disease given a positive result. 19,20 A study by Brenner and Gefeller have shown that PPV changes depending on the prevalence of the disease. 19 By using a simulation study, they showed that PPV dropped from 95% to 60% when the disease prevalence decreased from 90% to 20% even though the algorithm did not change. Therefore, the algorithmically-predicted probability of a patient having the disease decreases when disease prevalence decreases. 1,19,20 This is an important issue that needs to be considered when conducting research using administrative databases because most validation studies use samples consisting entirely of people with the appropriate ICD code instead of the population. A mixture of people with and without the ICD code will have lower disease prevalence than samples consisting entirely of people with the ICD code and will provide more accurate representation. 1

clinical significance
One of the advantages of administrative database research is the large volume of data. However, because very small variations between groups in large databases result in statistically significant differences, caution should be taken in interpreting the results as clinically significant. 1 Absolute and relative values should be used instead of the P-value to gauge the importance of clinical differences. Since confidence intervals (CIs) are generated around the absolute or relative difference between populations, the use of CIs could help distinguish between clinical and statistical difference.
conclusion Current healthcare technology is shifting towards digitization of storing information, facilitating a subsequent increase in studies using administrative databases. Databases provide a powerful tool to conduct studies on outcomes, health services, and epidemiology. However, these databases have limitations and biases that should be considered when embarking on research. Given the presence of sensitive information in the database, access can be quite difficult and there are often strict policies in place as safeguard measures, which poses as an additional barrier. Furthermore, algorithms to link variables as means to create a cohort with a specific disease may not yield the most accurate representation. Due to a large volume of records, statistically significant results may be observed but may provide insignificant clinical results. Although there are drawbacks to using administrative databases, they still provide real-world data for researchers to use to improve the current healthcare system.