

Imperial’s Data Learning Group has created a guide for data scientists looking to tackle the common problems faced when working with real-world data.
The paper, published in the Journal of Computational Science introduces the ‘Data Learning Paradigm’, which combines principles of machine learning, data science and data assimilation to tackle the common problems data scientists face when working with complex real-world data.
The Imperial team collaboratively wrote the paper after recognising the growing challenges associated with using real-world data in data driven-applications. As computational models grow more complex and data availability increases, data scientists lack a comprehensive list of frameworks to identify, understand, and address the imperfections inherent in real-world data.
According to Data Science Institute Director of Research and lead of the Data Learning Group Dr Rossella Arcucci: “The first paper published in 2025 by the Data Learning group is a service to the growing community of data scientists who face (and mitigate) challenges when working with real-world data.”
It categorises data deficiencies into five main types and provides a structured guide for researchers to identify these issues and mitigate them, linking the theory to practical applications across various fields.
By outlining current methods and presenting real-world applications from healthcare, environmental modelling, social networks and planetary exploration, the article serves as a comprehensive resource for data scientists seeking to improve data quality and model performance, ultimately enhancing the credibility of insights derived from complex data-driven analysis.
Quality over quantity - ‘Good data’ for better quality models
Artificial intelligence models are a product of the data upon which they are trained, and no data collected from real-world scenarios is perfect due to natural limitations of sensing and collection. Importantly, training the models on more data does not necessarily mean the model is better quality – good, better quality data is what is needed.
A ‘good’ model will require high-quality data and a comparatively comprehensive training regime in order to achieve sufficiently accurate results. Therefore, high data quality is of the upmost importance to data scientists looking to train a model or perform analysis for a real-world scenario.
In their paper, the Imperial team define good quality data as: “An Accurate reflection of the real world scenario, comprehensively Complete information, from a Reliable source, containing Relevant information and temporally relevant in its Timeliness.”
Data which incorporates these characteristics is a balanced and accurate representation of the contextual scenario, and thus, models which are sufficiently and comprehensively trained on this data will more often be more successful.
Learning to be a good data scientist
Part of being a good data scientist is learning the right questions to ask about your data; is this enough data for my investigation? Is this data too noisy to be useful? Is this dataset too large? Or too small?
As identified in the paper, the Data Learning Group help early career data scientists to identify these questions by highlighting the common imperfections inherent in working with real world data. These imperfections can include the size of the dataset (too big or too small) and the nature of the dataset, whether that be unstructured, incomplete or too noisy.

Throughout the paper the group highlight how these problems can be tackled, drawing on examples of their own research in fields such as environmental monitoring, planetary exploration, healthcare analytics, linguistic analysis, social networks, and smart manufacturing.
Often, however, the solutions to common problems in data science are very domain-specific and the best solution will vary depending on the application and purpose of the dataset at hand.
For more information on the Data Learning Group visit their website.
-
‘Facing and mitigating common challenges when working with real-world data: The Data Learning Paradigm’ by Lever et al., published on 8 January 2025 in Journal of Computational Science.
Image source: Stephen Dawson, Unsplash.
Article text (excluding photos or graphics) © Imperial College London.
Photos and graphics subject to third party copyright used with permission or © Imperial College London.
Reporter
Gemma Ralton
Faculty of Engineering

Contact details
Email: gemma.ralton@imperial.ac.uk
Show all stories by this author