Translate this page into:
In conversation with Dr. Joseph S. Ross: Utility and issues around clinical datasets, the example of the YODA project
-
Received: ,
Accepted: ,
How to cite this article: Arjuman A. Editorial podcast: In conversation with Dr. Joseph S. Ross: Utility and issues around clinical datasets, the example of the YODA project. Indian J Med Res. 2026;163:134-8. doi: 10.25259/IJMR_773_2026
Evidence-based medicine is the primary driver for delivering improved patient outcomes. Critical monitoring of disease trends and outcomes has become cardinal for benchmarking treatment standards and enabling personalized treatments by evaluation of their safety and efficacy across populations. With the current digitally driven environment providing greater access to global data, the bridging of patient outcome data from every corner of the world is now a possibility. Machine learning and AI powered algorithms can ensure early diagnosis of diseases, provided the access points for clinical outcome data are unified. In this context, enabling dashboards for research datasets is a viable and a feasible solution, particularly for collating data from clinical trials. This can facilitate diverse analysis options enabling intense and multifocal analyses of available data. In this podcast, we discuss the utility and issues around developing such a clinical dataset dashboard with Dr. Joseph S. Ross, highlighting the example of the YODA (Yale University Open Data Access) project.
Dr. Ross’s research focuses on delivery research and policy; policy analysis and implementation of ambulatory care services for older adults and other vulnerable populations. He is also extensively involved in addressing issues related to pharmaceutical and medical device regulation, evidence development, postmarket surveillance, and clinical adoption. His scholarly profile boasts of an H-index of 93 and more than 700 publications.

- Prof. Joseph S. Ross
- Center for Outcomes Research and Evaluation, Yale New Haven Hospital, New Haven, Connecticut & Section of General Internal Medicine, Department of Internal Medicine, Yale School of Medicine, New Haven, Connecticut, USA.
SPOTLIGHT: Dr. Joseph S. Ross
Large clinical datasets are presumed to have a greater parametric representativeness. Could you throw some light on the drivers in segregating a clinical dataset and the translatability of their analyses and results thereof?
It’s an interesting question. There are a lot of different types of clinical datasets. Each comes with its own strength and weakness, advantages and disadvantages depending on the type of research question you plan on asking. This may help decide which clinical trial datasets can then be used to address specific research questions. However, when people think of observational datasets for asking research questions, they are often thinking of claims data or data from electronic medical records that health systems collect. Or even data like registries, as part of insurance payers or products data from manufacturers, etc. Each of these has a different representativeness in terms of the patient population whose data are being collected as compared to those who actually receive treatment in the real world. So it depends on what type of research question we are asking in terms of which type of data is going to be the best for it.
Could you please elaborate on that specifically in terms of clinical trial datasets?
Clinical trial datasets, at least with respect to their representation, are not going to be very representative data because clinical trials have certain biases associated with them. First there is the participation bias, like individuals have to be willing to participate in a clinical trial, so those types of people we’ve learned over the years tend to be a little healthier than the typical individual who is being treated in the real world. But there is also the inclusion and exclusion eligibility criteria that affect the representativeness of a clinical trial dataset. Often, clinical trials are conducted to test whether a medical product or other intervention is safe and effective. To do that, typically one would try to exclude patients with specific comorbidities that may confound the trial. So, people with liver failure or kidney failure or heart diseases won’t be eligible to participate in a trial or perhaps people older than 80, and so on. So such criteria are being used intentionally to show that the internal validity of the clinical trial is as high as possible. But of course that limits its external generalizability.
Clearly this is more in terms of the sensitivity of the data as well! What do you think are the broader challenges in obtaining consent towards generating such datasets keeping in mind patient safety, identification, consent or their de-identification eventually?
That is an excellent question and one that I believe we really are going to have to struggle with for 5-10 years. For clinical trials, often consent is given as part of the participation in the trial and includes sharing that data for others to use after the trial is complete. So, when you think about clinical trial data and consent, it is less of an issue, but for many other observational data sources, it is much more of an issue. For instance, when a medical product manufacturer designs a registry to collect data on everybody getting an intervention that will often include consent for people to participate. But when somebody uses data from a health system for a research study, it is unlikely that the patients have directly given consent and that’s because the health system says that they are going to go through all the steps needed to ensure the security and privacy of the patients and they will only make anonymized data available. This is done in different ways, including putting ages into bands, not providing addresses, dates of birth, etc. Others will tell you that if you have access to the world’s data, then you can find out who is who by cross referencing with other data platforms. I think that in today’s world digital data is so readily available and the techniques to analyze are increasingly available to anyone through AI software, we just have to learn how to deal with that risk.
In case of observational studies using archived samples, regulatory guidelines dictate obtaining consents on which there is probably greater adherence in the USA compared to the global south. At the journal end, we observe that awareness in this context is still limited. What are your experiences?
I think it is similar in the US, to be honest. In the datasets that are described, they’re all digital health data, none are blood/tissue samples or genetic data, but those do require consent. For instance in the US there is a large health system called the US Department of Veterans Affairs-for individuals who served in the military with hospitals across the US. They did this large study collecting genetic data from as many veterans who were willing to participate as possible but everyone had consented to it. Everyone had to prospectively consent to provide genetic samples as part of the work. But you can imagine how these data could be misused, including the implications for health insurance coverage of genetic diseases, particularly in the US. Because we do not have a national health system, these data could determine who gets covered and who doesn’t. For that reason I think we will always be in a position where we will require consent to use genetic data for research.
What are your thoughts and experiences around the validity of anonymization of data that are deposited in a clinical dataset?
I’ve been involved in a clinical trial data sharing enterprise for more than a decade now and to my knowledge, no one has ever re-identified an individual working with data of that sort. We at Yale run something called the YODA project (https://yoda.yale.edu/) where we now make data available from close to 500 completed clinical trials. We have a pharmaceuticals partner. They make data available for drugs, biologics, medical devices, consumer products and clinical trials. We have taken additional steps to try to ensure data security and privacy. For instance, when an investigator requests to use the data, we rarely send it to them directly. Instead, we make the data available within a secure data platform where there are analytic programmes embedded that they can then use a software like R and they can only use the data within that secure platform without downloading it. Only the summary of the results data can be downloaded, so these are tables essentially and the estimates of the analyses. We did that in part because in the world of data privacy and security when people talk about re-identification, it has to do not so much with the data itself but with the linkages with that data and other datasets. For instance, say someone participated in a clinical trial and you had data from the health system that they happen to go to and you also had the data from their credit card purchases, then you can triangulate an individual to determine when they went to the hospital, what conditions they had, and then link them in that way to re-identify somebody. It’s rarely one data source that is a culprit for re-identification, unless there is a problem with the anonymization. By making sure that data stay secure in that platform, it’s just an extra step to ensure their privacy and security.
Is that a best practice recommendation, coming from your experience from your dataset platform?
Yes, but that’s not how all clinical trial data are shared. You see that today many investigators will host their dataset as part of publishing and disseminating their work. Usually such publications are not from a major multinational company who are typically more concerned about liability issues. However, those investigators likely took consent from the participant patients and to share their data. In such isolated studies they’re probably collecting less information than one may expect which also makes it more difficult to re-identify somebody.
Based on your experiences from the large clinical dataset made available through the YODA project, what are the lessons learnt thus far in handling it-considering hurdles such as data harmonization, expertise gaps, etc.?
When we launched this platform ∼10 years ago, there were concerns that people had, particularly, whether any company would share their trial data given the potential risk involved. But what we have shown over the years, in finding an amazing company partner. Nothing bad happened. In fact a lot of good happened by sharing these data. There weren’t investigators who were just going through, data dredging to try to find problems that they would then publish in high impact journals and cause all sorts of concerns for patients, clinicians or even for regulators. Instead a large group of investigators from around the world took advantage of this available data and conducted several valuable studies. We have done some analysis which showed that the work that comes from these investigators is of equally high impact in terms of the journals that they are published in. It is equally of interest in terms of the reader scores that are available through Altmetric scores, and these are papers getting cited in clinical guidelines and other policy documents. It has been a great illustration of the value of openness and making research data more of a public good than a private holding that more people can use. The world is filled with good ideas but not everybody has the resources to implement those and by making these data available to more people, we’ve learnt things that we might not otherwise have.
Are you saying that such datasets could actually promote a greater chance of policy translations?
I believe so! Because when large companies or even an academic-based trial group runs a big trial, they have a limited amount of resources to disseminate papers. My group has been in this position where we have run trials and there were staffing changes over time and all the papers that you thought you’re going to write don’t get done. By making the data available to more people increases the chances that certain secondary end points that were of interest for which they were collected get studied. Somebody can pool the data that you’ve collected with other data and may be they’ve collected, to look for adverse events and some of those common secondary analyses that are done with these data are actually about understanding the disease better. So, when you pool together all the placebo arms of clinical trials you get a much better sense of disease prognoses and how people’s symptoms and functional burden of the disease changes overtime. There are several examples of why making these data available actually translates into clinical and policy impact.
Would you recommend, based on your lessons learnt to achieve a multinational collaborative dataset or do you think these should be regionally segregated keeping in mind possible issues of diversity or ethnicity/race differences?
In today’s world, I would never recommend any sort of regionalization, because for generating these types of sharing efforts, the infrastructure that is needed to do it can be applied to any trial collected in any part of the world. The YODA project is not the only clinical trial data sharing platform. There is also Vivli, another non-profit organization, adjunct to Harvard university that does this, and there’s the National Institute of Health, they have had a big data sharing initiative run through the National Heart and Lung Blood Institute where those data are available to other investigators. When I go back and think of the lessons learnt, one thing that comes to mind is that in the early days many people would request the data, but didn’t necessarily have the expertise to work with it. There was this assumption that it was easy to analyze the data. In fact, clinical datasets are quite complicated (in a different way) than observational datasets where you have thousands or tens of thousands of observations. In a clinical trial you may have only 100 patients, but you have thousands of observations over different points of time so, you have to concatenate the data in order to follow people through. There are other challenges and I have a distinct memory of when someone requested data early on and it was like, who is going to help me analyze it? We make the data available not analyze it for anyone. However, we have people go through a learning module before they have access to the data or request it, make sure they have the needed skills. There’s a really good group in Europe based out of the University of Utrecht where they are trying to essentially develop primers for people to better understand and develop expertise on how to share data. So, I think we are taking the essential steps forward.
What is the age of the data that would be most valuable for such clinical datasets?
That’s a great question because you know quite often the clinical trial data that we share are quite old but, that doesn’t mean that it’s not useful. It does raise important concerns about what else was happening in the healthcare system at the time to know how generalizable it is to today’s care. You need to think about what are the advantages/disadvantages of using these data vs. something more recent. It is possible that for a drug, there may be no recent trial, so this may be the best shot.
In the YODA dataset what is the latest data age. How recent is it?
There is data covering ∼50-60 medical products. Some of the trials are old but our pharma partner will make the trial data available essentially at the same time that they have procured both FDA and EMA approval for the products. So, some of the newest drugs that have come out of the company, those data are there and also the COVID-19 vaccine that they developed, that is there and all those trials supporting the approvals are there. But it depends. I would say that our most requested trials are for products that have been on the market for much longer, the SGLT2 inhibitors, the biologics that are used for ulcerative colitis, and the drugs used in schizophrenia that have been on the market for a long time. So there is still value and use in them.
Talking from the Pfizer data sharing experience during COVID for instance, do you think there could be more stringent guidelines towards strongly urging the pharma companies to partner with academic institutions to openly deposit their trial data even if it was not successful?
I believe so! I think there is so much to be learnt from successful and unsuccessful product launches and this gets to the point I made earlier. A lot of large companies see sharing data simply as a risk, they don’t see the potential benefit that can be derived from it both in terms of what we learn about the product and the nuanced efficacy and safety that may allow us to understand its use in different subgroups or the heterogenate treatment effects. It could work better for certain groups vs others or people taking different medication or not but it is also about trust. That is one of the things we really learnt through the YODA project. By making these data available for other investigators to use and to interrogate and to use these for their own research, really does foster trust, in the overall clinical research enterprise.
How do you handle errors around loss to follow up in case of trial data, also lack of representativeness in rare diseases, etc.?
I think that those kind of issues are inherent to any clinical trial, the onus is really on the individual who is using the data for their own research to take into account and understand whether that lack of representativeness precludes them from using that data to answer the question that they’re trying to ask or the lack of follow up precludes them for doing the analysis they are attempting to do. All datasets are imperfect. We have yet to identify the perfect dataset. You just have to know what you are working with, and to understand those limitations, do your best to address them. There are different ways analytically to think about some of those issues and give it the best you can.
Does the YODA dashboard provide any information on possible errors in a dataset to nudge an end user in the right direction?
We have tried to develop a number of ways to assist a potential user to understand the advantages and disadvantages of certain datasets. Like on each trial’s landing page, when you are trying to decide which data to request for your research proposal, key pieces of information including how many people were enrolled in the trial, what their average age was, the proportion of males and females, what proportion of them were of minority race are provided and so are links to important metadata (links to the trial registration page at ClinicalTrials.gov, trial protocol if there is one available, the primary publication from the trial if it was published). In ∼90% cases these are published in some way and if there is a data dictionary we provide that as well. We hope that this information will allow investigators to evaluate upfront and assess if that’s the right trial to address their questions.
What in your experience are the legal and privacy requirements in managing such datasets?
As I said before we have had no issues related to re-identification from the datasets shared through the YODA project. We work with a legal team at the company and they also contract out to a firm that does the anonymization. All those things costs money, I think that’s the thing to keep in mind. We make these data available for free to any investigator because our pharma partner pays for all of these other things out of their own pocket, the secure platform, the anonymization efforts, etc. to make such information available to all to use. So there are certain costs but I think, it’s proven to be valuable.
What extent of engagement do you envision happening from the global south, particularly India in similar such efforts in the future?
I hope that our conversation leads to more people from around the world including from the global south, India specifically to request these data because these are really a great resource and I want everyone to know and take advantage of them. We have had lot of users form around the world.
Users form US, Canada, Europe, and China are more common but I think if more people can know that such data can be used then understanding the advantages and disadvantages/limitations of such data, can only improve our broader understanding of science. I would be more than happy to share my experiences. About 5 years ago we published a paper1 that laid out the framework which we adopted for sharing data that could be of great use to people. We are actually preparing another paper now since its now 10 years and we want to update our lessons learnt for others.
| In a nutshell |
|
|
|
|
|
References
- Overview and experience of the YODA Project with clinical trial data sharing after 5 years. Sci Data.. 2018;5:180268.
- [CrossRef] [PubMed] [PubMed Central] [Google Scholar]