09:30 - 10:30, February 28th (Thu), 2019 (Tentative)
Big data and big computation can be applied to many classes of problems -- but one of the most compelling is /discovery/ within the sciences, medicine, and engineering. Many community-scale efforts predate the term "big data," such as genetics data repositories, the Sloan Digital Sky Survey in astronomy, and phylogenetic tree databases in biology. Such efforts have been most successful when the data, acquisition methods, and use cases are relatively homogeneous. A key challenge moving forward is how to tackle the challenges of sharing across /heterogeneous/ and evolving communities, which require more than centralized points of data access -- but rather ecosystems of many interoperable big data tools, standards, and repositories. Using our experiences in building open data ecosystems for the neuroscience community as a model, I will describe the major challenges we have encountered, what progress has been made, and key open big data problems.
Zachary Ives is the Chair and Adani President's Distinguished Professor of Computer and Information Science at the University of Pennsylvania. He is a co-founder of Blackfynn, Inc. <http://www.blackfynn.com/>, a company focused on enabling life sciences research and discovery through data integration. Zack's research interests include data integration and sharing, big data analytics, scientific data management, and data provenance and authoritativeness. He is a recipient of the NSF CAREER award, and an alumnus of the DARPA Computer Science Study Panel and Information Science and Technology advisory panel. He has also been awarded the Christian R. and Mary F. Lindback Foundation Award for Distinguished Teaching. He is a co-author of the textbook /Principles of Data Integration/, and has received 10-year most-influential paper awards from the International Conference on Data Engineering (2013) and International Semantic Web Conference (2017). He has served as a Program Co-Chair for the ACM SIGMOD conference (2015) as well as an Associate Editor for Proc. VLDB, the VLDB Journal, and the IEEE Transactions on Data and Knowledge Engineering.
09:00 - 10:00, March 1st (Fri), 2019 (Tentative)
Personalized medicine has been hailed as one of the main frontiers for medical research in this century. In the first half of the talk, we will give an overview on our projects that use complex and big data sets for biomarker discovery. In the second half of the talk, we will describe some of the challenges involved in biomarker discovery. One of the challenges is the lack of quality assessment tools for data generated by ever-evolving genomics platforms. We will conclude the talk by giving an overview of some of the techniques we have developed on data cleansing and pre-processing.
Raymond’s main research area for the past two decades is on data mining, with a specific focus on health informatics and text mining. He has published over 200 peer-reviewed publications on data clustering, outlier detection, OLAP processing, health informatics and text mining. He is the recipient of two best paper awards – from the 2001 ACM SIGKDD conference, the premier data mining conference in the world, and the 2005 ACM SIGMOD conference, one of the top database conferences worldwide. For the past decade, he has co-led several large-scale genomic projects funded by Genome Canada, Genome BC and industrial collaborators. Since the inception of the PROOF Centre of Excellence, which focuses on biomarker development for end-stage organ failures, he has held the position of the Chief Informatics Officer of the Centre. From 2009 to 2014, Dr. Ng was the associate director of the NSERC-funded strategic network on business intelligence.