Please tune in to Episode 4 at Apple Podcast, Google Podcast, Amazon Music, Spotify, and Firstory

Transcripts:

Intro:

How can statistics aid research about health effects of air pollution and climate change? What is the underlying difference between data science and statistics? What is the Harvard data science initative? Francesca Dominici is here to answer all these questions for you!

Dr. Dominici obtained her Phd degree in statistics from University of Padua in Italy and currently is Clarence James Gamble Professor of Biostatistics, Population and Data Science at the Harvard TH Chan School of Public Health and Co-Director of the Harvard Data Science Initiative. The main focus of her research is to utilize statistical techniques in addressing significant problems in environmental health science, pollution and climate change. Her studies have resulted in a direct and continuous impact on air quality policies, leading to more stringent standards for ambient air quality in the United States. In addition, she has a keen interest in Bayesian causal inference methods for large-scale observational data and their potential application to comparative effectiveness research in cancer.

Other than the academic achievements, Dr. Dominici also devotes herself into promoting diversity in academia. Let’s dive into this episode to see what she shared with us!

Jocelyn: Hi Francesca, welcome to the Biostatistics podcast. Thank you for coming to the show.

Francesca: Thank you for having me.

Jocelyn: Thank you. So I’m wondering, can you start by telling us a bit about your background and how you became interested in biostatistics?

Francesca: Yes, sure. I would say I probably have an unusual background. I am a first generation student. None in my family has a college degree and still now none of my extended family has a college degree. I was born in Rome, Italy, and I started my studies at the University La Sapienza in Rome. It was always very clear to me that I love mathematics, but as I started my bachelor’s, I realized that I wanted to do something that was very mathematical oriented but had at the same time the ability to solve important problems. And so I came across the field of statistics in my twenties and it was pretty clear, you know, it was literally love at first sight. It was exactly what I wanted to do. So I finished my college degree at the University of La Sapienza in Rome and then I did a national competition to get into an Italian PhD program. So I started a PhD program in statistics at the University of Padua, Italy. Then I was able to be a visiting student at Duke University and I convinced my Italian PhD program to sponsor me to finish my PhD program at Duke. And then I applied for a postdoctoral fellowship position in Biostat Hopkins and then transitioned to a Biostat tenure track faculty. And I was at Hopkins for 12 years and now I’ve been at Harvard for 14 years. So yeah, so it’s been a long road from Rome, Italy, to Padua, to Duke University, North Carolina, to Johns Hopkins in Baltimore, and now Cambridge and Boston, Harvard.

Jocelyn: It definitely sounds like such an exciting journey, but I’m also wondering how did you switch from just statistics? I guess not exactly switching, but how did you choose the Biostats instead of other applied stats?

Francesca: Yeah, I mean, I would say Biostat decided to choose me because my training and my PhD is in statistics. And my thesis is, I would say, my PhD thesis was the traditional PhD thesis in statistics. It was on Bayesian hierarchical models. But when I was at Duke University finishing my Italian PhD, I started to get involved in more applied statistics. Again, I always really loved to develop new methodology, but also with a purpose. So I always like to think about, okay, what’s the motivating problem? What is the scientific question? And then how I’m going to address this question with statistical thinking. And so even through my thesis, there were some examples of applied statistics. And then when I was finishing my Italian PhD at Duke University, I started to apply for a postdoctoral fellowship. And that’s where my thesis on the Bayesian hierarchical model seems to fit very well: a research project at Johns Hopkins and Biostat, and on combining information across studies on establishing the relationship between air pollution and health. And so when I got the offer from Johns Hopkins, it seemed to me that that was a very good fit, and I happened to be in a Biostat department. And so that’s how I started my career really in Biostat. And I’ve been very happy about that.

Jocelyn: That’s great to hear. I guess one question I always ask all the professors is, how did you pick the research topic that you end up working on? For example, I know you work on some air pollution related health climate and some causal inference method, as well as some comparative effectiveness research. I’m wondering, how did you come to that path?

Francesca: Yeah, I mean, I think in the context of… So my first, when I got my first job to be a postdoctoral fellow in the Biostat department at Johns Hopkins, that was to work on a specific research project, which is how you define and develop Bayesian hierarchical model to combining information across several location to estimate the relationship between air pollution and health. So in a certain way, again, my job, the description already included the research area that I was working on. And I found that extremely stimulating and really, again, it really represented to me the perfect framework where we had the really important societal problem where we wanted to regulate the level of contaminants in the air. And you can do that by providing statistical evidence that is harmful to you. So that was really the very good part of it for my research. So I have worked on air pollution and health for many years. So then throughout my career, and that’s the beauty to be a biostatistician, is that you can always get involved in other areas. So when I came also Harvard, I wanted to explore other areas of research. And so, first of all, I was really interested in getting to understand more the field of causal inference, because ultimately, when you’re trying to figure out whether or not a contaminant is harmful, or whether or not a policy regulation works, you’re really trying to address a causal question. So I started to learn more about causal inference. And then I also wanted to learn a little bit more about cancer. And so I started several collaborative projects to really understand which of the more advanced treatment in cancer research were beneficial. The interesting thing is that eventually I went back to air pollution and climate research. And so I went back to the area I started to work with, because I really love that. And especially now with the climate crisis that we are all facing, and the fact that climate and air pollution are really two sides of the same coin, it’s really continued to spark my interest and my passion to train the new generation. So it’s not that I woke up in the morning and I said, OK, should I work on this area? Should I work on this area? I just take the opportunities. And I’m always very welcoming in terms of trying to understand new areas of research. But also, I try to figure out in which area of research, with my interests and my training and my skill, can I really contribute, right? Contribute in a significant way versus making marginal contributions.

Jocelyn: I see. That’s very interesting. When you’re talking about the, when you work on air pollution, climate change and stuff, you did mention it is very relevant to the policy changing. I’m wondering how much of your work is, aside of being a biostatistician, how much of your work is related to the regulatory side of this?

Francesca: Yeah, a huge part, because in the context of regulating air pollution, you know, there is a law which is called the Clean Air Act. The Clean Air Act is a federal law that requires the Environmental Protection Agency to set what’s called the National Ambient Air Quality Standard. These are considered a safety standard. And if there is data science evidence, the level of air pollution below the safety standard is still harmful to human health. By law, the EPA has to lower the safety standard, right? So there is, I would say, a perfect opportunity for a data scientist, a statistician, to impact policy, because what you have to do is to do the best possible statistical analysis to provide evidence whether or not the safety standards that the EPA is setting are truly safe. And analyzing this data is very complex. So these are data that vary in time and space, and there is measurement error, there is confounding, and you have to address the question of causality. So I always, I like to be laser focused into developing methodology and analyzing data that can directly inform policy, and they can be directly translatable in a new legislation that protects public health.

Jocelyn: I see. Are there any current projects you’re working on that you think will be very interesting to share with our audience?

Francesca: Sure. Yeah. So there is one project that we’re actively working on is, as you might know, just last January, 2023, the Environmental Protection Agency has announced that they are planning to lower the safety standard for fine particulate matter from 12 micrograms per cubic meter to potentially 10 or 8. And also they wanted to protect the most marginalized communities. And so we are conducting additional analysis and additional modeling to be able to inform directly these decisions. So should they lower to 10, to 9, to 8, but also by lowering to what degree they are going to support and allowing a benefit also for the marginalized community, like the person of low income or underrepresented minority.

Jocelyn: I see. That’s so exciting, and I think it’s very meaningful work, for sure. And when you’re talking, you constantly talk about data science and statisticians, I guess statistics. I learned that you’re also the co-director of the Data Science Initiative in Harvard. I’m wondering, since data science is such a broad term, how do you think statistics fall below that term?

Francesca: So data science is, in my mind, a pretty comprehensive discipline that includes fields like statistical methods and application and computer science and machine learning and AI. In my mind, when I think about data science, I really think about the science of the data. And I think about how do we extract meaningful, actionable knowledge from the huge amount of data between quotes that is now available through words, through sensor, through social media, through our interaction with the internet, through the news, through cell phones. So originally, when you think about the field of statistics, the field of statistics started, or some parts of the field of statistics was really part of doing an analysis of the experiments where you would start with a specific hypothesis, and then you will collect data. And then after you collect the data, you have a specific study design, and then you test the hypothesis. But with the data science, the science of data is actually the opposite. The data comes first. You have this massive amount of data that comes from everywhere that is collected for different purposes. And then you try to extract knowledge from that data. So it really, data science includes, I would say, the context of what type of information you are seeking, includes a huge amount of data engineering, how you align, how you clean, how you build this gigantic database. It includes statistical thinking, how you analyze the data, how you develop a model, how you quantify the uncertainty. It includes computer science in terms of how you think about fairness, how you think about neural network, how you think about prediction. But there are also ethical, it’s also important to always think about the ethical use of data. How do we make sure that we don’t propagate biases? How do we make sure that there are not some ill use of what is now the ChatGPT? So I think we have to think that data now is the new currency. Data is very powerful. And so the data science is really the discipline that should govern the way as we extract information from this data that ultimately benefits society and the world and is not used for ill purposes. I see. That’s very insightful. I’m wondering, can you elaborate a little bit more on what exactly is this data science initiative you’re trying to build? Yeah, the data science initiative at Harvard is a university wide, which has been really exciting. The fact that it’s an initiative that really supposed to touch all of the schools at Harvard. So from law school, the divinity school, design, dental, medicine, public health, art and sciences, engineering, because it’s almost impossible right now to think about scientific discipline that doesn’t deal with data in one way or another. And so it’s very broad. But also, I think it’s important that we have a purpose. And the purpose is really to advance data science with the lens of having a positive impact. So we want it to be extremely interdisciplinary. We like to convene faculty across the university. We like to provide funding for research project, interdisciplinary research project, always with a lens of how can we leverage this huge amount of data in an ethical and technically rigorous way. So then we can have a positive impact where we can advance our understanding of poverty, of malnutrition. We can combat climate change. We can assure access to education to everyone. We can figure out the critical cause of factor of gun violence, right? So think about a lot of the great majority of huge problem that we are facing, our problem that can be advanced, if not fully addressed via data science.

Jocelyn: I see. It sounds like a very comprehensive project, because you can apply the data science into a different kind of field. Sounds very interesting. So how do you think, I guess, not just biostatistics research, in general, statistics research will be involved in the coming years?

Francesca: Well, I think, you know, a very good training of statistics is an essential component of data science. Let me give you an example, I mean, multiple examples. When you have a formal training in statistics, first of all, we are taught about the sampling mechanism that you don’t like, right? But that’s fine, because you can always transition to something different. But I haven’t, I was really lucky. I love the training. I always really loved the technical part of the training. And my first research project happened to be in a research area that I liked. So for me, it was pretty easy. But I think, you know, I think you just have to be aware that the training can be pretty technical.

Jocelyn: I see. That’s very good advice. So which I guess, thank you for all of the great insight that you gave us, which brings us to the end of this podcast episode. My last question is, what is one question that you wish I had asked? And how would you have answered it? Or it could be anything that you wanted to share, but I haven’t asked about?

Francesca: That’s a good question. Well, I think that maybe one thing to always keep in mind, and this is something I’ve been valuing more and more throughout the year, is the importance of reproducibility and the importance of embracing the culture of using a GitHub repository and reproducible code. And that there is a tension there, because on one hand, it’s really important to do your research that is fully reproducible and accessible by others. On the other hand, there is still, you know, there is this little fear that people might find mistakes in what you did, which they will, because the bottom line is statistics is complicated. So I hope that we will continue. I think we are on the right track, but we will continue to embrace a culture in biostatistics, any statistics, where it becomes the norm. And I think it is happening already that we share the code of everything that we do, but also that we embrace the norm that to make a mistake or not calculating the standard error in the best possible way, it’s human. And as a community, we should work toward a fully reproducible data science that goes together with constructive criticism and constructive feedback to help each other doing better instead of criticism.

Jocelyn: For sure. That sounds really important, because I do notice that sometimes when I try to get access to other people’s research and I want to reproduce their results, they don’t really have their code uploaded anywhere. So I have to contact them by email. So it would definitely be more convenient if this awareness of reproduction could be more spreaded in the statistics community. Well, thank you, Francesca, for coming to the Biostatistics podcast. And it’s great talking to you.

Francesca: Thank you very much for the interview and good luck with everything.