Building Trust in Cancer AI Starts With Better Data
From imaging analysis to personalized treatment, AI is changing cancer care – but only if we build it right.

Complete the form below to unlock access to ALL audio articles.
Artificial intelligence (AI) is rapidly transforming oncology – from earlier and more accurate diagnoses to unlocking complex disease biology at the single-cell level.
In this conversation, Joe Day, senior business development executive (data/AI) at Cancer Research Horizons, speaks with Technology Networks about the most exciting developments in AI-powered cancer research, the critical importance of diverse, high-quality data and what it takes to build patient trust. Day also shares insights into ethical challenges, real-time model refinement and why involving patients at every step of the AI development journey is essential for ensuring impact and equity.
What are some of the most promising ways AI is currently being used to improve cancer diagnosis, treatment or research?
The most advanced application of AI in oncology is its use in medical imaging, with AI tools being used to support radiologists in the identification of lesions, helping to improve accuracy, reduce workloads and potentially enable detection earlier.
However, AI is now increasingly being used to support digital pathology, with many companies now using AI to extract detailed phenotypic information on a patient’s disease directly from hematoxylin and eosin (H&E) images. This is particularly exciting when companies can determine genomic or expression profiles of a patient’s tumor directly from the H&E image, as this could lead to more cost-efficient and accessible companion diagnostics and potentially democratize personalized cancer care.
Looking forward, the next stage of AI is in its use in the interpretation and analysis of complex multi-modal spatial or single-cell data. This data is incredibly rich and could provide us with significant detail on cancer biology at a single-cell level. However, to gain the most insight from this data, you need complex, high-powered advanced analytics, and AI could be the answer. If AI tools can unlock the value of these novel data modalities, it could help to improve our understanding of cancer and lead to the identification of novel targets and biomarkers.
Why is high-quality, diverse data so critical for training effective AI models in oncology?
AI is only as good as the data that you feed it. In oncology, where the outputs of AI models could result in changes in diagnosis or treatment decisions that have significant impacts on patients’ lives, AI tools must be able to perform at a high level across diverse cases. Although technological improvements claim to reduce the quantity of data needed to train AI systems, using high-quality, well-labeled data remains the best way to ensure algorithmic performance.
Importantly, these high-performing algorithms must work for all patients in all scenarios, and therefore, the diversity of the data that the AI is trained on is vital.
Diversity can mean ensuring that the algorithm has exposure to the heterogeneity of disease or different races and genders, but it can also come down to technical considerations, such as the type of manufacturer that generated a certain dataset.
At Cancer Research Horizons, we have realized the importance of diversity as it relates to our large mammography dataset, OPTIMAM. Here the group has not only made an effort to include data collection sites from across the UK – to ensure we are capturing the ethnic diversity of the UK population – but they have also made a concerted effort to ensure that we are accurately representing the different equipment manufactures that are used in breast screening in the UK. This is crucial for ensuring that the dataset accurately reflects data that an AI might see in the real world and means that the AI is more likely to be able to cope with the clinical and technical differences that might be seen when deployed in the clinic.
Finally, clinicians' trust is crucial in ensuring effective AI implementation. Demonstrating the quality and quantity of data used to train a system will help reinforce confidence in the capabilities of AI and smooth the process of integrating it into clinical systems.
How can researchers ensure that the data they produce is most useful for the training and development of AI tools?
The most important thing to do is to think about the potential for secondary use of the data from the very start of a research project – so the study design and recruitment stage. Clinical data is crucial for providing context to other forms of data, such as imaging or molecular profiling. Without this, the dataset loses significant value in its use to train algorithms to support clinical decisions.
Additionally, when recruiting patients, researchers must ensure that the consent is broad and clearly outlines the potential for commercial entities to access the data and samples collected at the end of the study. This ensures transparency and enables patients the opportunity to opt out if they are not comfortable with this possibility.
As the team progresses through the project, they need to start to think about data collection and ensure that the data is standardized. Huge amounts of resources are employed in cleaning and curating data to ensure that it can be used to train AI, but if research groups can collect data in a standardized manner, using clinical data models such as the Observational Medical Outcomes Partnership Common Data Model, we can reduce this downstream curation burden.
Finally, to enable reuse, prospective partners need to be able to understand the details of the dataset; therefore, creating a metadata catalog that describes the data and how it was created is crucial. Thinking upfront about how to capture this information will help in the creation of the catalog.
What are the ethical considerations around using real-time patient data to train or refine AI models in clinical settings?
The first thing to consider is the transparency of the use. One of the biggest issues we have heard from patients when discussing the use of data in AI training is that they want to understand how their data is being used. So, if you are aiming to use patient data from a clinical setting to support the training of AI tools, this must be clearly communicated. There are several good examples of this happening, for example, creating “patient data promises” that outline how data might be used and the protections that will be in place. If these activities have not been completed, it can lead to significant damage to trust.
The second element to think about is the regulatory factor. If we are using data in real-time to train and refine algorithms, how are we monitoring the impact on performance? The current regulatory systems are not set up to enable real-time improvements, with companies needing to submit static products for review by regulatory authorities. Therefore, companies will need to think through their regulatory strategy and develop clear performance criteria to ensure that the changes to the algorithm are driving improvements in patient outcomes, and these are appropriately reviewed and approved by regulatory authorities.
How can organizations build trust with patients and the public when it comes to using patient-derived data for AI development?
There are two key things that organizations need to do. The first is to build confidence that you have safe, secure mechanisms for enabling data access or sharing. This can be through secure data environments or encrypted data-sharing processes. Whatever approach you choose, it must be robust in ensuring that the data being shared is safe and that patient privacy is maintained.
The second is building patient oversight and ensuring transparency – this is a little more difficult. One of the biggest things we heard from patients when discussing the possibility of commercial use of their data was that they wanted to know how their data is being used and, as long as this use is in the best interest of patients and the public, they were in favor. So, you must develop governance processes for patients to review commercial access requests and ensure that you are transparent about how data is being used.
At Cancer Research Horizons, we have embedded patient and public engagement into every aspect of our data partnering activities. First, we collaborated with patients to develop a set of guiding principles that form the foundation of all our partnerships. Second, we established governance structures that give patients a voice in every data access request – companies can only use Cancer Research Horizons data if their request is approved by patients and if they agree to publish a non-confidential, plain language summary of their work on our deal registry. Finally, we have formed a strategic patient panel that meets every six months to discuss emerging trends and issues in the field, ensuring we continue to align our actions with the expectations of the community.
As I mentioned earlier, I'm excited about the potential of AI in imaging to improve patient care by unlocking deeper insights from images and boosting healthcare efficiency through reduced workloads and streamlined processes. AI also holds great promise for helping us unravel the complexity of disease. With biology and disease progression requiring multi-modal, high-resolution data, AI is uniquely positioned to extract meaningful insights from these rich but complex datasets.
I'm particularly encouraged by the rise of large language models (LLMs) and their potential to enhance AI explainability. Many current tools face adoption hurdles due to their 'black box' nature. LLMs could bridge this gap by interpreting model outputs and communicating reasoning in ways clinicians can understand, which may build trust and accelerate adoption.
However, I do have concerns. One of the biggest is ensuring AI works for everyone. Many datasets that are used for training of AI models lack diversity. This raises the risk that tools may underperform for ethnic minorities. Initiatives like the SAMBAI project – profiling patients of African heritage as part of the Cancer Grand Challenge – are an encouraging step toward addressing this gap.
Another major concern is maintaining public trust. Past initiatives, like care.data, showed how easily trust can be lost. To prevent this, patients and the public must be meaningfully involved in the development, validation and deployment of AI systems, as I discussed earlier.