Aldo Guzmán, IBM: Topology in Computational Genomics
TLDRAldo Guzmán from IBM discussed the application of topology and geometry in computational genomics, focusing on the study of genetic properties and their relation to biological processes. He highlighted the use of algebraic topology tools like persistent homology in analyzing genetic data. Guzmán also covered the historical use of mathematics in biology and ecology, the significance of genotype and phenotype, and how gene expression levels are measured. He detailed the process of using topological signatures for machine learning in predicting phenotypes, such as Parkinson's disease, and the potential of persistent harmonic homology in identifying significant features in data related to diseases like lung cancer and chronic lymphocytic leukemia.
Takeaways
- 🧵 Aldo Guzmán from IBM discussed the application of topology and geometry in computational genomics, focusing on the study of genetic properties and data.
- 📊 Mathematics has been applied in biology and ecology contexts dating back to Fibonacci’s observations on rabbit population growth patterns.
- 🧬 Daniel Bernoulli conducted studies on life expectancy after smallpox inoculation, sparking debates on maximizing life expectancy and other population-related quantities.
- 🌱 Gregor Mendel’s experiments with plants led to the formulation of genetic laws on how information is passed across generations.
- 🔬 Scientific progress has shifted focus from macroscopic observations of populations to microscopic levels, such as studying cells and DNA.
- 🧬 The 20th century saw developments in math and statistics, rigorous treatments of Mendelian ideas, and the discovery of DNA, leading to genome sequencing.
- 💡 In the 21st century, artificial intelligence, machine learning, and neural networks have become important tools in genomics.
- 🌐 The concepts of genotype (hereditary information) and phenotype (observable properties) are central to studying genetic properties.
- 📊 Gene expression levels, which indicate how much a gene is expressed, can be measured through protein presence or messenger RNA, and are crucial for data analysis.
- 🔍 Persistent homology is used to analyze data by constructing filtered simplicial complexes and computing homology to identify topological features.
- 🏥 An application of topology in genomics is predicting phenotypes like Parkinson's disease from gene expression levels using machine learning and topological signatures.
- 📈 Persistent landscapes, a vector representation of homology, can be used with convolutional neural networks to exploit spatial coherence in data for better predictions.
Q & A
What is the main topic of Aldo Guzmán's talk?
-The main topic of Aldo Guzmán's talk is the application of topology and geometry in computational genomics, which involves studying properties related to genetics and multiple types of data.
What is computational genomics?
-Computational genomics is a field that studies properties related to genetics and other 'omics' data types, using tools from algebraic topology to analyze these properties.
What is the significance of Fibonacci's observation in the context of biology?
-Fibonacci's observation of the growth pattern of rabbit populations under certain assumptions is significant as it relates to studies of growth and decline of populations of organisms, which has been a fundamental concept in biology and ecology.
How did Daniel Bernoulli's study contribute to the field of biology?
-Daniel Bernoulli's study contributed by attempting to determine the life expectancy of people after inoculation against smallpox, which led to debates and discussions about maximizing life expectancy and other factors affecting society.
What is the difference between genotype and phenotype in genomics?
-Genotype refers to the entire set of hereditary information or portions of the genome, often called genes, while phenotype refers to the observable properties, such as eye color, that can be directly observed.
What is gene expression and how is it measured?
-Gene expression refers to how much a gene is expressed, which can be measured by the amount of protein present in a cell or by measuring the messenger RNA itself, indicating the level of gene activity.
How does persistent homology relate to the study of genomics?
-Persistent homology is used to analyze filtered simplicial complexes constructed from genomic data, allowing researchers to compute homology and identify topological features that persist across different scales of data.
What is the purpose of using persistent landscapes in the context of machine learning?
-Persistent landscapes are used because they can be treated as vector spaces, allowing for various operations to be performed on them, which is beneficial for feeding the data into machine learning models.
How does harmonic persistent homology help in identifying important features in data?
-Harmonic persistent homology provides a way to select unique representatives from homology equivalence classes, allowing researchers to map features back to the topological space and the original data, identifying which features are important.
What is the potential impact of using harmonic weights in distinguishing subtypes of cancer?
-Using harmonic weights can improve the performance of machine learning models in distinguishing subtypes of cancer by scaling features based on their harmonic weights, which correspond to known genes associated with cancer, thus providing a more accurate representation of the data.
How does the future of computational genomics look in terms of data and methodology?
-The future of computational genomics looks promising with the anticipation of more data, especially with the advent of single-cell sequencing producing vast amounts of information. Additionally, more advanced mathematical and statistical methods, including AI and ML, are being developed to help analyze this data and gain insights into biological processes.
Outlines
🌐 Introduction to Topology in Computational Genomics
Aldo Guzman sence from IBM introduces the concept of topology in computation and genomics. He discusses the intersection of algebraic topology with computational genomics, which involves studying genetic properties across various types of data known as 'omics'. Guzman highlights the successful application of algebraic topology tools such as persistent homology in this field. He also provides a brief historical context of how mathematics has been applied in biology and ecology, starting with Fibonacci's observation on rabbit population growth, which has its roots in Indian mathematics as early as 200 AD. The narrative continues with the transition from macroscopic observations to microscopic levels, leading to the development of population studies and the understanding of life expectancy, as exemplified by Daniel Bernoulli's study on smallpox inoculation.
🧬 Historical Developments in Biological Mathematics
The paragraph delves into the historical developments in the application of mathematical models to biological studies. It starts with the debate between Daniel Bernoulli and John Baptist Lonber, which underscores the societal impact of mathematical models. The discussion then shifts to Gregor Mendel's work with plant experiments that led to the formulation of genetic laws, which are fundamental to understanding how information is passed across generations. The paragraph also touches on the advancements in microscopy, mathematical and statistical developments in the 20th century, and the discovery of DNA. It emphasizes the transition from studying populations to focusing on individual levels, such as the bacteria in our gut, which can be seen as both individual and population entities. The paragraph concludes with an explanation of the terms 'genotype' and 'phenotype', which are central to genomics.
🧬 Gene Expression and Its Measurement
This paragraph explains the concept of gene expression and how it is measured. Gene expression levels indicate how much a gene is expressed, which can be determined by measuring the amount of protein present in a cell or by measuring the messenger RNA itself. The process is simplified to illustrate the mechanism of gene expression: DNA in the nucleus is unzipped by enzymes, copied to create messenger RNA, which then leaves the nucleus and is used by ribosomes and transfer RNA to assemble proteins. The paragraph emphasizes the importance of gene expression levels in understanding cell functions and how they can be used to create a table of data with rows representing subjects and columns representing gene expression levels, which can be used for further analysis.
🧠 Predicting Phenotypes from Gene Expression
The focus of this paragraph is on using gene expression levels to predict certain phenotypes, such as Parkinson's disease. It describes the process of creating a table of data points and using algebraic topology, specifically persistent homology, to analyze these data points. The paragraph explains the construction of filtered simplicial complexes from the data and the computation of homology. It also includes an animation to illustrate the process of recovering the homotopy type of a simplicial complex using neighborhoods around points. The concept of barcodes is introduced as a way to represent the persistence of holes in the data over time. The application of this method to Parkinson's disease is mentioned, highlighting the potential of using topological signatures for patients in machine learning settings.
🔍 Machine Learning and Topological Data Analysis
This paragraph discusses the application of persistent homology in machine learning, specifically for predicting phenotypes from gene expression levels. It describes the process of enriching a point cloud with weights based on sample values and using this to construct a weighted VOR complex. The resulting persistent landscapes, which are vector spaces, are then used as inputs for machine learning models. The paragraph emphasizes the use of convolutional neural networks (CNNs) for their ability to exploit spatial coherence in the representation. It also mentions the performance of the model in predicting Parkinson's disease, noting its low false positive rate. The discussion concludes with the idea of mapping features back to the topological space and original data to understand which features were important in the machine learning model.
🧬 Persistent Harmonic Homology for Feature Mapping
The paragraph introduces persistent harmonic homology, a generalization of harmonic chains to the persistent setting, which allows for the selection of representative cycles in homology. This method provides a way to map features back to the topological space and then to the original data, assigning weights based on the harmonic representatives. The process involves constructing a barcode and then propagating the information to lower orders, which can be used to directly influence the feature space. The paragraph discusses how this method can be applied to different problems and how it has been implemented to get actual results on real data.
🧬 Applications of Harmonic Persistent Homology in Cancer Subtyping
This paragraph discusses the application of harmonic persistent homology in distinguishing subtypes of lung cancer and other cancers using the TCGA dataset. It explains how traditional tools like PCA do not separate subtypes well, but supervised learning and the use of harmonic weights can improve performance. The paragraph also mentions the use of harmonic weights in breast cancer subtyping and chronic lymphocytic leukemia studies. The results show that genes with higher harmonic weights are known to be associated with lung cancer, confirming the method's effectiveness. The paragraph highlights how this approach can help overcome challenges in studies with limited data points and discover new patterns in an unsupervised manner.
📈 Future Directions in Genomic Data Analysis
The final paragraph discusses the future of genomic data analysis, emphasizing the increasing amount of data available, such as single-cell sequencing data. It highlights the importance of advanced mathematical and statistical methods, as well as AI and ML techniques, in handling this data and gaining insights into biological processes. The paragraph concludes by emphasizing the importance of conferences in fostering the development of these advanced methods and the potential for these tools to move research forward.
🌟 Conclusion and Q&A
The speaker concludes the talk by reiterating the importance of Topological Data Analysis (TDA) in the context of the larger picture of connecting various tools to gain insights into biological structures. The speaker leaves the audience with a sense of the potential of these methods to move research forward and invites questions from the audience.
Mindmap
Keywords
💡Topology
💡Computational Genomics
💡Persistent Homology
💡Omics
💡Genotype and Phenotype
💡Gene Expression Levels
💡Simplicial Complex
💡Barcodes
💡Persistent Landscapes
💡Convolutional Neural Networks (CNNs)
💡Harmonic Persistent Homology
Highlights
Aldo Guzmán, a topologist from IBM, discusses the application of topology in computational genomics.
Topology and geometry are used to study properties relating to genetics and multiple types of data in computational genomics.
Algebraic topology tools, such as persistent homology, have been successfully applied in this field.
Mathematics has a history of application in biology, including Fibonacci's observations on rabbit population growth.
Demographic studies, like Daniel Bernoulli's work on life expectancy after smallpox inoculation, show mathematics' impact on society.
Gregor Mendel's experiments led to the formulation of genetic principles passed from parents to offspring.
The 20th century saw developments in math and statistics, including rigorous treatments of Mendelian ideas.
DNA's experimental discovery and genome sequencing have been facilitated by advancements in computing.
In the 21st century, artificial intelligence and machine learning have become integral to genomics.
Genomics studies genotype, the hereditary information, and phenotype, the observable properties like eye color.
Gene expression levels indicate how much a gene is expressed and can be measured through various lab methods.
Data analysis involves creating a table with gene expression levels to predict phenotypes, such as Parkinson's disease.
Persistent homology is used to establish a topological signature for patients, which can be applied in machine learning.
Vor complex construction and barcodes are used to visualize the persistence of holes in topological data.
Persistent landscapes are used to vectorize topological data for machine learning, exploiting spatial coherence.
Convolutional neural networks (CNNs) are used to interpret topological data as images for pattern recognition.
The model's performance in predicting Parkinson's disease was good, with fewer false positives compared to other methods.
Persistent harmonic homology allows mapping features back to the topological space and original data for interpretability.
Harmonic persistent homology representatives can be used to weight features, improving machine learning model performance.
In lung cancer studies, harmonic weights helped distinguish between subtypes, aligning with known literature on gene associations.
Breast cancer subtyping using unsupervised clustering with harmonic weights showed coherent behavior in clusters.
In chronic lymphocytic leukemia, changes in gene expression before and after treatment were identified using harmonic homology.
Future developments in genomics will likely involve more data and advanced mathematical, statistical, AI, and ML methods.