Integrative Data Analysis and Exploratory Data Mining in Biological Knowledge Graphs

Modern life sciences are based on large amounts of data in many different formats, which model in many different ways a wide variety of interrelated species and phenomena at multiple scales. In this chapter, we show how to integrate and make sense of this wealth of data through digital applications that leverage knowledge graph models, which are ideal to flexibly connect heterogeneous information. Furthermore, we discuss the benefits of this approach when applied to data sharing practices, which maximise the opportunities to reuse integrated data for novel analysis and digital applications. Knetminer, a genetic discovery platform that leverages knowledge graphs built from molecular biology data sources, will be used as a significant use case of the described concepts.
This is a preview of subscription content, log in via an institution to check access.
Access this chapter
Subscribe and save
Springer+ Basic
€32.70 /Month
- Get 10 units per month
- Download Article/Chapter or eBook
- 1 Unit = 1 Article or 1 Chapter
- Cancel anytime
Buy Now
Price includes VAT (France)
eBook EUR 117.69 Price includes VAT (France)
Softcover Book EUR 147.69 Price includes VAT (France)
Hardcover Book EUR 147.69 Price includes VAT (France)
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others

Methods of Creating Knowledge Graph by Linking Biological Databases
Chapter © 2019

BioGrakn: A Knowledge Graph-Based Semantic Database for Biomedical Sciences
Chapter © 2018

KGTK: A Toolkit for Large Knowledge Graph Manipulation and Analysis
Chapter © 2020
Notes
This shortened URL can be used to see the GXA visualisation: https://tinyurl.com/ye3fq8mk
References
- A Comparison of Serialization Formats [Internet] (2019). https://blog.mbedded.ninja/programming/serialization-formats/a-comparison-of-serialization-formats/. Accessed 11 May 2021
- Adamski NM, Borrill P, Brinton J, Harrington SA, Marchal C, Bentley AR et al (2020) A roadmap for gene functional characterisation in crops with large genomes: lessons from polyploid wheat. Elife 9:e55646 ArticlePubMedPubMed CentralGoogle Scholar
- Afgan E, Baker D, Batut B, van den Beek M, Bouvier D, Čech M et al (2018) The galaxy platform for accessible, reproducible and collaborative biomedical analyses: 2018 update. Nucleic Acids Res 46:W537–W544 ArticleCASPubMedPubMed CentralGoogle Scholar
- Altschul S (1997) Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res 25:3389–3402 ArticleCASPubMedPubMed CentralGoogle Scholar
- Anderson JG (2007) Social, ethical and legal barriers to E-health. Int J Med Inform 76:480–483 PubMedGoogle Scholar
- Antoniou G (2008) A semantic web primer, 2nd edn. MIT Press, Cambridge, MA Google Scholar
- Arnaud E, Laporte MA, Kim S, Aubert C, Leonelli S, Cooper L et al (2020) The Ontologies Community of Practice: an initiative by the CGIAR Platform for Big Data in Agriculture. SSRN Electron J. https://www.ssrn.com/abstract=3565982. Accessed 11 May 2021
- Attard J, Orlandi F, Scerri S, Auer S (2015) A systematic review of open government data initiatives. Gov Inf Q 32:399–418 ArticleGoogle Scholar
- Ausiello G, Gherardini PF, Marcatili P, Tramontano A, Via A, Helmer-Citterich M (2008) FunClust: a web server for the identification of structural motifs in a set of non-homologous protein structures. BMC Bioinform 9:S2 ArticleCASGoogle Scholar
- Avila-Garzon C (2020) Applications, methodologies, and technologies for linked open data: a systematic literature review. Int J Semant Web Inf Syst 16:53–69 ArticleGoogle Scholar
- Bandrowski A, Brinkman R, Brochhausen M, Brush MH, Bug B, Chibucos MC et al (2016) The ontology for biomedical investigations. PLoS One 11:e0154556 ArticleCASPubMedPubMed CentralGoogle Scholar
- Bang H, Zhou XK, van Epps HL, Mazumdar M (eds) (2010) Statistical methods in molecular biology [Internet]. Humana Press, Totowa, NJ. http://link.springer.com/10.1007/978-1-60761-580-4. Accessed 2021 May 10 Google Scholar
- Barah P (2021) Gene expression data analysis: a statistical and machine learning perspective. Gene Expression Data Analysis, S.l. BookGoogle Scholar
- Bartling S, Friesike S (2014. Accessed 9 May 2021) Opening Science [Internet]. Springer International, Cham. https://doi.org/10.1007/978-3-319-00026-8BookGoogle Scholar
- Baumgartner C, Beckmann JS, Deng H-W, Shields DC, Wang X (eds) (2016) Application of clinical bioinformatics, 1st edn. Springer, Dordrecht Google Scholar
- Beluhova-Uzunova RP, Dunchev DM (2019) Precision farming–concepts and perspectives. Probl Agric Econ Google Scholar
- Berners-Lee T, Hendler J, Lassila O (2001) The semantic web. Sci Am 284:34–43 ArticleGoogle Scholar
- Boyle EA, Li YI, Pritchard JK (2017) An expanded view of complex traits: from polygenic to Omnigenic. Cell 169:1177–1186 ArticleCASPubMedPubMed CentralGoogle Scholar
- Brandizi M (2020) The Power of Standardised and FAIR Knowledge Graphs [Internet]. KnetMiner. https://knetminer.com/cases/the-power-of-standardised-and-fair-knowledge-graphs.html Google Scholar
- Brandizi M, Singh A, Hassani-Pak K (2018a) Getting the best of linked data and property graphs: rdf2neo and the KnetMiner use case. SWAT4LS Google Scholar
- Brandizi M, Singh A, Rawlings C, Hassani-Pak K (2018b) Towards FAIRer Biological Knowledge Networks Using a Hybrid Linked Data and Graph Database Approach. J Integr Bioinforma [Internet]. De Gruyter. https://www.degruyter.com/view/journals/jib/15/3/article-20180023.xml. Accessed 2 Sep 2020
- Brase J (2009) DataCite—a global registration agency for research data. In: 2009 Fourth International conference on cooperation and promotion of information resources in science and technology, pp 257–261 ChapterGoogle Scholar
- Brickley D, Burgess M, Noy N (2019) Google Dataset Search: building a search engine for datasets in an open web ecosystem. In: World Wide Web Conference [Internet]. ACM, San Francisco, CA, pp 1365–1375. Accessed 12 May 2021. https://doi.org/10.1145/3308558.3313685ChapterGoogle Scholar
- Brito G, Mombach T, Valente MT (2019) Migrating to GraphQL: a practical assessment. In: 2019 IEEE 26th Int Conf Softw Anal Evol Reengineering SANER [Internet]. IEEE, Hangzhou, pp 140–150. https://ieeexplore.ieee.org/document/8667986/ Google Scholar
- Caracciolo C, Stellato A, Morshed A, Johannsen G, Rajbhandari S, Jaques Y et al (2013) The AGROVOC linked dataset. Seman Web 4:341–348 ArticleGoogle Scholar
- Che H, Duan Y (2020) On the logical design of a prototypical Data Lake System for biological resources. Front Bioeng Biotechnol 8:553904 ArticlePubMedPubMed CentralGoogle Scholar
- Check HE (2013) Geneticists push for global data-sharing. Nature 498:16–17 Google Scholar
- Choi J, Yang F, Stepanauskas R, Cardenas E, Garoutte A, Williams R et al (2017) Strategies to improve reference databases for soil microbiomes. ISME J 11:829–834 ArticlePubMedGoogle Scholar
- Chowdhury B, Garai G (2017) A review on multiple sequence alignment from the perspective of genetic algorithm. Genomics 109:419–431 ArticleCASPubMedGoogle Scholar
- Corbellini A, Mateos C, Zunino A, Godoy D, Schiaffino S (2017) Persisting big-data: the NoSQL landscape. Inf Syst 63:1–23 ArticleGoogle Scholar
- Dada JO, Mendes P (2011) Multi-scale modelling and simulation in systems biology. Integr Biol 3:86 ArticleGoogle Scholar
- Degtyarenko K, de Matos P, Ennis M, Hastings J, Zbinden M, McNaught A et al (2007) ChEBI: a database and ontology for chemical entities of biological interest. Nucleic Acids Res 36:D344–D350 ArticleCASPubMedPubMed CentralGoogle Scholar
- Demir E, Cary MP, Paley S, Fukuda K, Lemer C, Vastrik I et al (2010) The BioPAX community standard for pathway data sharing. Nat Biotechnol 28:935–942 ArticleCASPubMedPubMed CentralGoogle Scholar
- Description Logics (2014) IEEE Intell Syst 29:12–19 ArticleGoogle Scholar
- Designing Future Wheat [Internet] (2021) Designing. Future Wheat. https://designingfuturewheat.org.uk/. Accessed 20 May 2021
- Di Tommaso P, Chatzou M, Floden EW, Barja PP, Palumbo E, Notredame C (2017) Nextflow enables reproducible computational workflows. Nat Biotechnol 35:316–319 ArticleCASPubMedGoogle Scholar
- Ehrlinger L, Wöss W (2016) Towards a definition of knowledge graphs. Semant Posters Demos SuCCESS 48:2 Google Scholar
- Figueiredo AS (2017) Data sharing: convert challenges into opportunities. Front Public Health 5:327 ArticlePubMedPubMed CentralGoogle Scholar
- Forbes SA, Beare D, Boutselakis H, Bamford S, Bindal N, Tate J et al (2017) COSMIC: somatic cancer genetics at high-resolution. Nucleic Acids Res 45:D777–D783 ArticleCASPubMedGoogle Scholar
- Gabrilovich E, Usunier N (2016) Constructing and mining web-scale knowledge graphs. ACM, pp 1195–1197. http://dl.acm.org/citation.cfm?doid=2911451.2914807. Accessed 22 Feb 2018 Google Scholar
- Germain RN, Meier-Schellersheim M, Nita-Lazar A, Fraser IDC (2011) Systems biology in immunology: a computational modeling perspective. Annu Rev Immunol 29:527–585 ArticleCASPubMedPubMed CentralGoogle Scholar
- Gostev M, Faulconbridge A, Brandizi M, Fernandez-Banet J, Sarkans U, Brazma A et al (2012) The BioSample database (BioSD) at the European bioinformatics institute. Nucleic Acids Res 40:D64–D70 ArticleCASPubMedGoogle Scholar
- Gray AJ, Goble C, Jimenez RC (2017) Bioschemas: from potato salad to protein annotation. Springer, Berlin Google Scholar
- Guha RV, Brickley D, Schema MS (2016) Org: evolution of structured data on the web. Commun ACM 59:44–51 ArticleGoogle Scholar
- Hassani-Pak K, Castellote M, Esch M, Hindle M, Lysenko A, Taubert J et al (2016) Developing integrated crop knowledge networks to advance candidate gene discovery. Appl Transl Genom 11:18–26 PubMedPubMed CentralGoogle Scholar
- Hassani-Pak K, Singh A, Brandizi M, Hearnshaw J, Parsons JD, Amberkar S et al (2021) KnetMiner: a comprehensive approach for supporting evidence-based gene discovery and complex trait analysis across species. Plant Biotechnol J:pbi.13583 Google Scholar
- Heather JM, Chain B (2016) The sequence of sequencers: the history of sequencing DNA. Genomics 107:1–8 ArticleCASPubMedGoogle Scholar
- Holmes A (2015) Avoiding big data antipatterns [Internet]. https://www.slideshare.net/grepalex/avoiding-big-data-antipatterns. Accessed 12 May 2021
- Horler R, Turner A, Fretter P, Ambrose M (2018) SeedStor: a germplasm information management system and public database. Plant Cell Physiol 59:e5 ArticleCASPubMedGoogle Scholar
- Hutson M (2020) Artificial-intelligence tools aim to tame the coronavirus literature. Nature Google Scholar
- Jaakkola H, Mäkinen T, Eteläaho A (2014) Open Data: opportunities and challenges. In: Proc 15th Int Conf Comput Syst Technol [Internet]. ACM, New York, NY, pp 25–39. Accessed 7 Mar 2018. https://doi.org/10.1145/2659532.2659594ChapterGoogle Scholar
- java2rdf [Internet] (2021) EBI BioSamples Database Project. https://github.com/EBIBioSamples/java2rdf. Accessed 12 May 2021
- Kinsella RJ, Kahari A, Haider S, Zamora J, Proctor G, Spudich G et al (2011) Ensembl BioMarts: a hub for data retrieval across taxonomic space. Database 2011:bar030 ArticleCASPubMedPubMed CentralGoogle Scholar
- Koepsell D (2010) Back to basics: how technology and the open source movement can save science. Soc Epistemol 24:181–190 ArticleGoogle Scholar
- Köster J, Rahmann S (2018) Snakemake—a scalable bioinformatics workflow engine. Bioinformatics 34:3600–3600 ArticlePubMedGoogle Scholar
- Leipzig J (2016) A review of bioinformatic pipeline frameworks. Brief Bioinform:bbw020 Google Scholar
- Li L, Zhang Q, Huang D (2014) A review of imaging techniques for plant phenotyping. Sensors 14:20078–20111 ArticlePubMedPubMed CentralGoogle Scholar
- Liakos K, Busato P, Moshou D, Pearson S, Bochtis D (2018) Machine learning in agriculture: a review. Sensors 18:2674 ArticlePubMed CentralGoogle Scholar
- Lightbody G, Haberland V, Browne F, Taggart L, Zheng H, Parkes E et al (2019) Review of applications of high-throughput sequencing in personalized medicine: barriers and facilitators of future progress in research and clinical application. Brief Bioinform 20:1795–1811 ArticleCASPubMedPubMed CentralGoogle Scholar
- Ling H-Q, Zhao S, Liu D, Wang J, Sun H, Zhang C et al (2013) Draft genome of the wheat A-genome progenitor Triticum urartu. Nature 496:87–90 ArticleCASPubMedGoogle Scholar
- Lyon W (2021) Fullstack GraphQL applications with GRANDstack [Internet]. Manning Publications. https://books.google.co.uk/books?id=DbsKzgEACAAJ Google Scholar
- Mantione KJ, Kream RM, Kuzelova H, Ptacek R, Raboch J, Samuel JM et al (2014) Comparing bioinformatic gene expression profiling methods: microarray and RNA-Seq. Med Sci Monit Basic Res 20:138–142 ArticlePubMedPubMed CentralGoogle Scholar
- Mayrhofer MT, Holub P, Wutte A, Litton J-E (2016) BBMRI-ERIC: the novel gateway to biobanks: from humans to humans. Bundesgesundheitsblatt Gesundheitsforschung Gesundheitsschutz 59:379–384 ArticlePubMedGoogle Scholar
- McGuinness DL (2005) Ontologies come of age. Spinn semantic web bringing world wide web its full potential. The MIT Press, pp 171–194 Google Scholar
- McGuinness DL, Van Harmelen F, others. OWL web ontology language overview. W3C Recomm 2004;10:2004 Google Scholar
- Meindertsma J (2019) What’s the best RDF serialization format? [Internet]. Ontola.io. http://ontola.io/blog/rdf-serialization-formats/. Accessed 12 May 2021
- Meyer K (2016) A mathematical review of resilience in ecology. Nat Resour Model Wiley Online Libr 29:339–352 ArticleGoogle Scholar
- Miksa T, Simms S, Mietchen D, Jones S (2019) Ten principles for machine-actionable data management plans. PLoS Comput Biol 15:e1006750 ArticleCASPubMedPubMed CentralGoogle Scholar
- Mills L (2014) Common File Formats. Curr Protoc Bioinforma [Internet]. https://onlinelibrary.wiley.com/doi/10.1002/0471250953.bia01bs45. Accessed 11 May 2021
- Molloy JC (2011) The open Knowledge Foundation: open data means better science. PLoS Biol 9:e1001195 ArticleCASPubMedPubMed CentralGoogle Scholar
- Mountantonakis M, Tzitzikas Y (2019) Large-scale semantic integration of linked data: a survey. ACM Comput Surv 52:1–40 ArticleGoogle Scholar
- Murakami M, Matsushika A, Ashikari M, Yamashino T, Mizuno T (2005) Circadian-associated rice pseudo response regulators (OsPRRs): insight into the control of flowering time. Biosci Biotechnol Biochem 69:410–414 ArticleCASPubMedGoogle Scholar
- Murphy SN, Mendis M, Hackett K, Kuttan R, Pan W, Phillips LC et al (2007) Architecture of the open-source clinical research chart from informatics for integrating biology and the bedside. AMIA Annu Symp Proc:548–552 Google Scholar
- Murray-Rust P (2008) Open Data Sci Ser Rev 34:52–64 Google Scholar
- Nadolska-Orczyk A, Rajchel IK, Orczyk W, Gasparis S (2017) Major genes determining yield-related traits in wheat and barley. Theor Appl Genet 130:1081–1098 ArticleCASPubMedPubMed CentralGoogle Scholar
- Nicholls HL, John CR, Watson DS, Munroe PB, Barnes MR, Cabrera CP (2020) Reaching the end-game for GWAS: machine learning approaches for the prioritization of complex disease loci. Front genet. Frontiers 11:350 Google Scholar
- November J (2018) More than Moore’s mores: computers, genomics, and the embrace of innovation. J Hist Biol 51:807–840 ArticlePubMedGoogle Scholar
- Papatheodorou I, Moreno P, Manning J, Fuentes AM-P, George N, Fexova S et al (2020) Expression atlas update: from tissues to single cells. Nucl Acids Res Oxford Acad 48:D77–D83 CASGoogle Scholar
- Perkel JM (2018) Why Jupyter is data scientists’ computational notebook of choice. Nature 563:145–146 ArticleCASPubMedGoogle Scholar
- Perryman SAM, Castells-Brooke NID, Glendining MJ, Goulding KWT, Hawkesford MJ, Macdonald AJ et al (2018) The electronic Rothamsted archive (e-RA), an online resource for data from the Rothamsted long-term experiments. Sci Data 5:180072 ArticleCASPubMedPubMed CentralGoogle Scholar
- Polding R (2018) Databases: Evolution and Change [Internet]. https://medium.com/@rpolding/databases-evolution-and-change-29b8abe9df3e Google Scholar
- Reese JT, Unni D, Callahan TJ, Cappelletti L, Ravanmehr V, Carbon S et al (2021) KG-COVID-19: a framework to produce customized knowledge graphs for COVID-19 response. Patterns 2:100155 ArticlePubMedGoogle Scholar
- Regenmortel MHVV (2004) Reductionism and complexity in molecular biology: scientists now have the tools to unravel biological complexity and overcome the limitations of reductionism. EMBO Rep 5:1016–1020 ArticleCASPubMedPubMed CentralGoogle Scholar
- Rodrıguez-Doncel V, Suárez-Figueroa MC, Gómez-Pérez A, Poveda-Villalón M (2013) Licensing patterns for linked data. In: Proc 4th Int Workshop Ontol Patterns Appear Google Scholar
- Rothamsted Research, UK (2019) AgriSchemas and FAIR-ification of DFW Data [Internet]. https://www.slideshare.net/mbrandizi/agrischemas-progress-report. Accessed 12 May 2021
- Schade S, Granell C, Perego A (2015) Coupling public sector information and public-funded research data in Europe: a vision of an open data ecosystem. In: Information and communication technologies in public administration: innovations from developed countries. CRC, London, pp 275–298 Google Scholar
- Schüngel M, Stackebrandt E, Bizet C, Smith D (2013) MIRRI—the microbial resource research infrastructure: managing resources for the bio-economy. EMBnet J 19:5 ArticleGoogle Scholar
- SDG U (2019) Sustainable development goals. Energy Prog Rep Track SDG 7 Google Scholar
- Sharma S, Shandilya R, Patnaik S, Mahapatra A (2016) Leading NoSQL models for handling big data: a brief review. Int J Bus Inf Syst 22:1 CASGoogle Scholar
- Shorte SL, Frischknecht F (eds) (2007) Imaging cellular and molecular biological functions: with 13 tables. Springer, Berlin Google Scholar
- Singh A, Rawlings CJ, Hassani-Pak K (2018) KnetMaps: a BioJS component to visualize biological knowledge networks. F1000Res 7:1651 ArticlePubMedPubMed CentralGoogle Scholar
- Smith B, Ashburner M, Rosse C, Bard J, Bug W, Ceusters W et al (2007) The OBO foundry: coordinated evolution of ontologies to support biomedical data integration. Nat Biotechnol 25:1251–1255 ArticleCASPubMedPubMed CentralGoogle Scholar
- Stephens ZD, Lee SY, Faghri F, Campbell RH, Zhai C, Efron MJ et al (2015) Big Data: astronomical or genomical? PLoS Biol 13:e1002195 ArticleCASPubMedPubMed CentralGoogle Scholar
- Surwase V (2016) REST API modeling languages-a developer’s perspective. Int J Sci Technol Eng 2:634–637 Google Scholar
- Taelman R, Vander Sande M, Verborgh R (2018) GraphQL-LD: linked data querying with GraphQL. In: ISWC 2018 17th International Semantic Web Conference, pp 1–4 Google Scholar
- Tang B, Pan Z, Yin K, Khateeb A (2019) Recent advances of deep learning in bioinformatics and computational biology. Front Genet 10:214 ArticlePubMedPubMed CentralGoogle Scholar
- Tarql: SPARQL for Tables—Tarql—SPARQL for Tables: Turn CSV into RDF using SPARQL syntax [Internet]. https://tarql.github.io/. Accessed 1 Sep 2020
- Taubert J, Köhler J (2014) Molecular information fusion in Ondex. In: Approaches in Integrative Bioinformatics. Springer, Berlin, pp 131–160 ChapterGoogle Scholar
- Thakkar H (2020) A survey of approaches for supporting data interoperability between RDF and property graph databases [Internet]. http://harshthakkar.in/wp-content/uploads/Semantics_Seminar_Report_2020_HT_RDF-PG.pdf Google Scholar
- The Principles of Good Data Management [Internet] (2014) IGGI (Intra-governmental Group on Geographic Information). http://cedadocs.ceda.ac.uk/1085/ Google Scholar
- Watson JT, Sparkman OD (2007) Introduction to mass spectrometry: instrumentation, applications, and strategies for data interpretation. Wiley, Hoboken, NJ BookGoogle Scholar
- Weber S (2009) The success of open source. Harvard University Press, Cambridge, MA Google Scholar
- Wilkinson MD, Dumontier M, Aalbersberg IJ, Appleton G, Axton M, Baak A, et al. The FAIR guiding principles for scientific data management and stewardship. Sci Data 2016;3 Google Scholar
- Wise J, de Barron AG, Splendiani A, Balali-Mood B, Vasant D, Little E et al (2019) Implementation and relevance of FAIR data principles in biopharmaceutical R&D. Drug Discov Today 24:933–938 ArticlePubMedGoogle Scholar
- Wiseman L, Sanderson J, Zhang A, Jakku E (2019) Farmers and their data: an examination of farmers’ reluctance to share their data through the lens of the laws impacting smart farming. NJAS Wagening J Life Sci 90–91:100301 Google Scholar
- Yang W, Feng H, Zhang X, Zhang J, Doonan JH, Batchelor WD et al (2020) Crop phenomics and high-throughput phenotyping: past decades, current challenges, and future perspectives. Mol Plant 13:187–214 ArticleCASPubMedGoogle Scholar
- Yang Y, Aduragbemi A, Wei D, Chai Y, Zheng J, Qiao P, et al (2021) Large-scale integration of meta-QTL and genome-wide association study discovers the genomic regions and candidate genes for yield and yield-related traits in bread wheat [Internet]. https://www.researchsquare.com/article/rs-342038/v1 Google Scholar
- Zhang ZJ (2017) Graph databases for knowledge management. IT Prof 19:26–32 ArticleCASGoogle Scholar
Acknowledgments
This work was supported by the UKRI Biotechnology and Biological Sciences Research Council (BBSRC) through the Designing Future Wheat ISP (BB/P016855/1), the FAIR BBR (BB/S020020/1) and DiseaseNetMiner TRDF (BB/N022874/1). CR and KHP are additionally supported by strategic funding to Rothamsted Research from BBSRC. We acknowledge all the past and present members of the KnetMiner Bioinformatics team at Rothamsted for their scientific inputs and software contributions, especially: Joseph Hearnshaw, Martin Castellote, and Richard Holland.
Author information
Authors and Affiliations
- Rothamsted Research, Harpenden, UK Marco Brandizi, Ajit Singh, Jeremy Parsons, Christopher Rawlings & Keywan Hassani-Pak
- Marco Brandizi