Please tune in to Episode 6 at Apple Podcast, Google Podcast, Amazon Music, Spotify, and Firstory

Transcripts:

Intro:

Can you do research while being in an industry setting? How do people do science in the real world? What are some technical and cultural differences between statistics and machine learning?

Jonathan O’Brien will answer all these questions for you. Jonathan obtained his PhD degree in biostatistics from University of North Carolina and currently works as a principal data scientist at Catacombs Life Science, which is a healthcare biotech company that focuses on combating aging and aging-related disease. His main research interest has been focusing on improving the mathematical modeling and downstream analysis of mass spectrometry proteomics experiments, which involves missing data techniques, Bayesian hierarchical modeling, clustering, and compositional data analysis.

It was great talking to Jonathan since he gave me a lot of new perspectives and I learned a lot more about proteomics studies and what he’s working on. Let’s dive into this episode to see what Jonathan has shared with us.

Jocelyn: Welcome Jonathan to the Biostatistics podcast.

Jonathan: Thank you for having me.

Jocelyn: Thank you. And I’m really excited to have Jonathan here because it’s on a little bit of personal note because the things I’m working on are involved with the Bayesian framework and I was trying to explore what are other topics that could be done in the Bayesian framework. And I came across this proteomics study and then actually Jonathan’s paper, the effects of non-ignorable missing data on label-free mass spectrometry proteomics experiments was one of the first few papers I read. And it’s really well written and it’s really comprehensive in terms of what are some, the general data structure of proteomics and then what are some common problems that are arising from proteomics. And then he provided a really detailed guide and they proposed a Bayesian-related model to deal with the missing data problem. So that was a really interesting paper and thank you, Jonathan.

Jonathan: Thank you.

Jocelyn: Let’s get started by you talking about your background and how you became interested in biostatistics.

Jonathan: Oh, so before biostatistics. Well I did an undergraduate degree in math and philosophy and a master’s degree in applied math. And then I went out into the working world for a while. This was all in Boulder, Colorado. I moved from Chicago to Boulder for my undergrad and I stayed there for 12 years. I loved it there. But I was working in a very strange place. It was a place called Simplicity. I think the official name was Simplicity International, even though it wasn’t international at all. But it was a very small group of people working out of a retrofitted barn in the mountains. And they managed to sell a product line of environmentally friendly dish soap and laundry detergent into a nationwide distribution in Walmart. And I was tutoring one of the employees and she ended up asking me to come out there and help them out for a while. And they thought I was basically magical because I knew how to use Excel. And they ended up offering me a job. So I worked for them full time for a while. And during that time, I ended up coming down with a case of melanoma. I had it on my leg and it was an intermediate case. We didn’t know if it had spread or not. And if you know about melanoma, once melanoma spreads, it’s one of the really, really bad ones. You really do not want this thing floating around your system. But if you catch it while it’s still in the skin, you cut it out and throw it away. Fine. But I had to spend a few months, four months, three or four months, not knowing whether or not it had spread. It could have gone either way based on the depth of the tumor. And during that time, I just decided I didn’t want to. I didn’t really care about selling soap and I wanted to do something more intellectual and I wanted to learn more about biology and why my body was trying to kill me. So with my background in mathematics, applied mathematics, and a little bit of probability and stats, I decided that biostatistics was probably the best way to sort of achieve those goals. And it turned out that the cancer scare was fine. It didn’t spread. I’m here. That’s good. This was a long time ago now. But I ended up just taking all my time and trying to do well on the GREs and applied all over the place. And the best program I got into was University of North Carolina and Biostats, which is one of the big ones in the biostatistics world. So I moved to North Carolina and started my career over. And that was how I got into Biostats.

Jocelyn: And so far, how do you like this career path?

Well, it worked out fantastically for me.

Jocelyn: Well, that’s awesome.

Jonathan: The path I followed is definitely unusual, even for people in Biostats. It continued to be unusual. After my PhD, I ended up doing a postdoc in the lab, which is one of the big proteomics groups that are out there. They specialize in the development of technology for something called an isobaric proteomics experiment. And I’ve been working on this. I still work on it. In fact, we’re about to publish what I think will be the most high profile work I’ve done. We’re in revision right now. And it’s still the statistical modeling of proteomics data. This is a very complicated topic. I started working on this. Some of my initial research at UNC ultimately led to the paper that you read. But having gone and received training in an actual cell biology department at Harvard, and they made me do actual experiments. I had to play with yeast and figure out what was going on with the technology and just trying to translate and interpret and figure out all the things they were saying, what it meant statistically was an enormous undertaking, very, very difficult. And that’s basically been the foundation for my career. I went out to the lab, I focused on a problem as they explained these things to me, and I was working side by side with all the people in the lab. It’s much different in a proteomics lab or in the biology world. You have weekly lab meetings. You come in every day and you’re working with a team of people. When I was at UNC Biostatistics, I didn’t talk to anyone. I had people that I worked out problems with, and I’d meet with my advisor once a week. But in terms of day to day interaction, it was just a very different world. But that went well. I had a great experience in the lab. Steve was a wonderful postdoctoral mentor and the team out there was incredibly friendly. I enjoyed the experience a lot, and it led to Calico recruiting me to come out to California. They heard about what I was doing and invited me to give a job talk and ultimately recruited me to come out and continue what I was doing here. So I love it. I get a lot of freedom. I get to work on a lot of challenging problems. And basically, I get to do all the things I set out to do when I first went into biostatistics. I get to work on fun intellectual problems that I think are going to be meaningful, that can actually have an impact. And I get to learn a lot more about the world of science.

Jocelyn: For sure. I guess it sounds very unusual, like you said. But also, we’re very excited for the publication when it comes out. And can you tell us a bit more about any exciting projects or initiatives you’re currently involved or working on other than, I guess, the publication you’re talking about?

Jonathan: Well, I guess I haven’t told you anything about where I am yet, so I should probably start there. Have you heard of or have you looked up Calico Labs? We’re one of the alphabet companies. Our mission is to study aging and aging-related diseases. From that perspective, nobody here cares about proteomics or statistics. I am a sideshow here, but we have a lot of good scientists. The company is run by biologists and people who are very serious about doing real research to understand the mechanisms that underlie the aging process, which is a very, very difficult problem. And in that capacity, a lot of people want to know about proteins. They want to know what they’re doing. And I am deemed as someone who is useful in that regard. But we do have a number of very interesting projects that are things I contribute to and that I help with. The projects that I kind of consider my own are much more technical. They’re usually ways to improve technology for doing proteomics experiments. Sometimes they are actually experimental improvements. There are statistical modeling changes. There’s various ways in which doing statistical thinking, having a very deep understanding of the properties of the data can lead to real advantages. Coming from UNC Biostatistics, I always thought of everything as… Let me back up. Coming from a math background, I thought something is published, it must be correct. It’s either correct or it’s not, and you’re going to throw it out. And in statistics, you have a way of doing things. You formalize everything. You have to write down your assumptions. You have to write down your parameters and know exactly what’s going on. And I just thought, this is how it has to be. This is how you do science. And then I went into the real world of science. That’s not how anybody does anything. And in fact, you have to admit that the majority of the time when you do something quote-unquote the right way versus what somebody else who just kind of had a good understanding of the data and then muddled their way through it, it’s not guaranteed at all that doing things the quote-unquote right way is going to make much of a difference. So I really try very hard to find situations where I think it will make a difference. I’m not interested in making 1 to 2% improvements in some simulated data. That’s not something that anybody here cares about. And it’s not something I really care about either. But there are places where inferential thinking is incredibly important. And when you can actually make technological breakthroughs by understanding it and doing quite well. And one of these places is missing data, where you read that paper. One of the things that always appealed to me about proteomics as a technology is that it’s incredibly statistical in its nature. You’re randomly sampling ions from an ionized cloud. You are doing inference at many steps of these experiments before you even start talking about experimental design. And you also have possibly the worst missing data problem in science. It’s a really bad one. At the level of the observations that you’re seeing, if you put together a matrix of all your observations, you might notice that 40% of the cells have no values. So well, what you do in that situation is not going to be a 1 or 2% difference. You’re talking about some people coming up with arbitrary rules for throwing out information on proteins. And at the end of the day, you’re like, wow, that person just threw out 20% of their data. That’s a big, big difference. You’re not seeing 20% of the proteome. Well, it’s much less than that, because we don’t start out seeing the whole proteome. But you’re not seeing a giant chunk of data that maybe if you had been more careful with, you still could. So these are the types of things I look for, places where my skills, my way of seeing things might actually make a real impact on the technological landscape. See, and I can always talk about that stuff. There’s things I can’t tell you about at Calico, but nobody cares if I talk about statistical modeling or proteomics data. That’s perfectly fine.

Jocelyn: I see. I’m wondering, out of all the omics-related data or related studies, why do you choose proteomics?

Jonathan: Well, partly, it was happenstance and luck. When I was at UNC, there was a pharmacology department that reached out to us asking for help. They had recently purchased a mass spectrometer, and they started doing some experiments, and they were looking at the data analysis, and they did not know what was going on. So I went with a junior faculty member over there, and we sat down and had a meeting with them. And there was some poor postdoc who had been given the task of figuring out how to work this mass spectrometer. In retrospect, no one in the room knew what was going on. It was a big mess. But that was my introduction to the technology. We started playing around with trying to define the data type, trying to figure out what the key properties were, what the challenges were. And this took a while. I worked on this for a couple months, and I had a very wonderful advisor, Bajat Kekish. He’s an incredibly sharp individual and one of the best people I’ve ever been able to just sit down with and talk about statistics. When you’re in the room with him and you’re asking him about statistics and statistical modeling, he’s just fantastic. And this is what we would do. We would sit down one week after the next, and he would challenge me to write down the properties of these data sets. And I remember one day when I finally convinced him that this was an unusual data type. He looked at it and said, huh, I’ve never seen data like this before. And that was probably the point at which I was really hooked. It seemed like, hey, here’s this thing where people haven’t described it right. There’s all these properties. And if you read the proteomics papers, you won’t. It’s a different language. They don’t think about these things. Or that’s not right. They do think about these things. People from the proteomics world have an incredibly strong intuition for the properties of their experiments. They just don’t have the formalized language to write down what’s going on here. But I can give you an example. I mentioned isobaric proteomics, the thing that’s done at the Gygi lab. The key thing here is that you have a bunch of outcomes that you see in a matrix. And the whole experiment was being described to me. And I thought I had a decent idea of what was going on because I’d been working with proteomics data for a while when I started getting there. People who had a real understanding of the experimental process kept trying to get me to realize that there was this component that if you increase the signal in one sample, it necessarily will decrease the signals in the others. I was like, what are you talking about? This was not a small thing. And eventually I realized as it was explained to me, this is very real. There’s an actual physical constraint on the measurements. One way to think of this is if you’re sampling 200 ions that came from, say, eight different samples because you’re co-isolating samples, that’s the key thing here. You create this cloud. Well, if you tripled the abundance in sample one, you’re still only collecting 200 ions. So there’s a name for this. It’s compositional data analysis. It’s really the proportions you care about. But this is not like most compositional data analysis because the constraint is not at all obvious. You could study these data sets for a hundred years and not figure out that this is compositional data analysis. And I had a good statistician ask me exactly this question. We published a paper called compositional proteomics, something like that. That’s basically the whole idea. As we look at this property, we say, hey, you can do exactly what I just said. You can spike a gigantic amount into one of the samples and the actual signals you see from the co-isolated samples are going to decrease. And everyone in the proteomics world understands that and knows that, but none of them really wrote down what that would mean statistically because that’s not how they do things. But as a statistician, looking at the literature, you’re not going to figure this out unless you happen to stumble upon my paper now. And even then, the paper that I did didn’t provide all the answers. We mostly just showed the problem, showed the property and described it. Similar to the missing data work, we described what I view as the big problem there with the missing data. But the solution we proposed, the Bayesian thing that you’re looking at, that’s not a real solution to any sort of general type of problem. Nobody uses that. And the reason is that it’s incredibly complicated to come up with ways to deal with the problem we described for arbitrary designs. The solution in that animals have applied statistics paper was for like one specific situation. And I did it as part of my PhD, which means you have to derive some results and you create code. So I had to write a curse and Gibbs sampler and C++ as students tend to be forced to do these things. Nobody’s ever going to use this. It’s incredibly slow. It’s horrible. Our new work, we actually have a system that we think we can do for these experiments that generalizes not all possible designs, but a large class of things. Almost everything that I’ve encountered working at Calico, you can push through our new system. But going back to compositional data analysis, I was in the Gygi lab, which is this real lab that does exist in the world of science. But I was given the freedom to sit around and read books on compositional data and time to think about these things and think about the properties. And that was a lot of fun. And it matters. These things do matter. They really do change, but you have to know where and when it turns out often that sort of the ad hoc approaches that people come up with when they really understand the data are actually often much better than a formalized approach that you will see coming from someone who doesn’t understand the data. Particularly when the way that the ad hoc approaches develop ends up being through trial and error in systems that are well understood. And a large part of being able to make improvements that I have seen is figuring out when you’re no longer in that tried and true setting. So for the example of the work we’re currently doing in the isobaric data and the compositional data, well, suppose you have 10 samples that are co-isolated and they’ve got this property that they’re all going to sum up to some number. And what that number actually is, is going to change from one scan to the next, from one sample to the next. So you’re never going to really see that that constraint is there. You only know the constraints there because it has to be there physically. You have this tiny trap that only holds so many ions. And so you puzzle through what this means and what this is going to be like. And I remember Steve Gigi asking me, he said, well, what are the implications of this? When we run this experiment, do you think the model you’re creating is going to be way different than the results we’ve been getting? And most of the experiments that are run only combine, we’re calling these multiplexed samples. You’re putting a bunch of samples together and analyzing them all at once. In one scan, you get measurements from all of these samples. And the answer for doing this in one multiplex experiment is virtually nothing. It’s gonna have very, very little impact on the final results if you’re just looking at one batch of data. And I said to him, but if you combine one batch here with another batch over there, all bets are off. All of a sudden, everything you’re doing is gonna go, is gonna, could be a disaster. I wouldn’t give you, I wouldn’t bet any money that what you’re doing is gonna work well. And it turns out that’s the thing I’ve really been working on, is this combining of multiple batches of the data. And that’s what our pending publication is really about. So again, it was deeply exploring, finding properties of the data, figuring out what they’re all about, and then looking for the case where it’s really gonna matter. And I think that’s paid off. And it’s great that people have let me take all the time to play around and try to make improvements in these places. That’s actually kind of rare. There aren’t many places where people will pay you to do something like that. Right. It sounds like really rewarding experience.

Jocelyn: I guess, if you think what the work you’re doing is gonna eventually make a huge impact or is heading towards that way, how do you think the industry or specifically the work that you’re working on will involve in the coming years? What are some new developments or innovations that you can see on the horizon?

Jonathan: Well, I think that the work we’re describing is gonna be a pretty big change for what you can do with this type of technology. And technologies do change all the time. But to give you some frame of reference here, DNA makes RNA makes proteins, and proteins are often the final product that you wanna study. But most of the world of biology is not built on studying proteins directly. Most of it is coming from exploring genetics, exploring DNA and RNA. Getting into the actual proteins, studying all the post translational biology is insanely complicated, just insanely complicated. The complexity of biological systems explodes when you start allowing for post translational modifications. When you start, forget it, if you have to think about structures and complexes that are formed and all of the things that happen in cells at the post translational levels makes everything so, so difficult. And I really think that from the technological standpoint, and this is more than the technological standpoint, because the types of biological questions that we are capable of asking is largely determined by the state of technology. And so many questions, so many publications and things that people work on would just go away if you could just look inside a cell and see what was going on. Most of the time we’re just like, what was there? How much of it was there? What went up? What went down? You could just look, problem solved, I’m out of a job. But you can’t, you can’t see these, it’s very difficult. So you have to play all sorts of complicated games to try to figure out what’s happening in these systems. And getting into post translational biology is not something that’s going to be easy anytime soon. But being able to do this with mass spectrometry is, I believe very powerful. I think that’s one of the most powerful tools that we have for accomplishing this. And there are papers that have come out recently suggesting other ways you might be able to do it. I think they’re all a long way off. I think a lot of technological development, a lot of the things people are trying to study will be in this space going forward, because we are getting better at it. The improvements are astonishing. I really think, and sometimes I describe this to people and they think mass spectrometry sounds really boring. Like, okay, you’re creating a mass spectra. This is a bunch of peaks at masses and it just doesn’t sound that interesting. Once you understand what these things are doing, we live in a science fiction fantasy world. This is really amazing stuff. You take a tumor out of a person, you vaporize it. You put it into a cloud of ions and you spin those ions around in such a way that you can figure out everything that was there and what it was doing. This is pure magic. It’s really amazing that you can do this sort of thing. But you can, and we do it all the time. But saying that we do it all the time doesn’t mean that it’s done perfectly. There’s all sorts of problems. And I just mentioned these things that people have known about forever. People have all known for 15, maybe 20 years that combining batches of multiplex data is very difficult. And depending on who you talk to, they may or may not be able to explain why it’s so difficult to do that. But even that, we’re not looking, I mean, you can. You can look at post-translational modifications, but it’s very difficult. It’s an incredibly challenging problem to be able to do that sort of thing. And then if you start thinking about how much more there is that you would actually want to know to understand the complexity of human life, structure, incredibly important, dynamics, incredibly important. And this is an area that I’m very interested in in the future. We talk about these things. I already threw out the saying, DNA makes RNA makes protein. It’s the central dogma, but that’s not at all what’s happening. You actually have, DNA works with metabolites to make RNA, works with metabolites to make proteins that then get translated. And the process never stops. It’s all happening in a flux. You have rates of creation, you have rates of degradation. You have all of these things happening in these very complicated dynamic system. And the dynamics play a key role in regulation of response to stress, of all sorts of different things. So I think that probably for the rest of my lifetime, I’m going to be watching people strive to push technology forward, to be able to see more and more of what’s actually happening when you look at the totality of the complexity of what’s going on in biology.

Jocelyn: I see. That’s very interesting. And I guess one thing I wanted to ask is, how much biological science knowledge do you need when you’re looking at these problems? Like, do you need to fully understand what is exactly going on in a biological sense? Or is it just enough so that you understand the data structure and you can figure out what is the optimal way to deal with this kind of data?

Jonathan: That’s a really good question. And I can argue, what is needed, nothing is needed. Everything occurs on a scale, right? It’s not like, oh, if you don’t know this, you fail. And if you do, you succeed. It’s very important. It’s very important to learn as much as you possibly can about biology. And there’s a few reasons for that. And I remember as a student, let’s step away from biology for a second. Let’s talk about the data generating process. As a student, I was always told that it is crucial. The most important thing for a statistician to do is to understand the data generating process. Because ultimately what you’re doing is you’re taking a very complex real world situation and trying to map it on the mathematics that includes randomness. Well, if you don’t understand the real world situation, you’re probably not gonna do a very good job at this. It’s difficult enough, even when you do understand. So certainly understanding the data generating process is very important. And understanding the biology is part of that. But additionally, the division of labor is incredibly important. I’ve been working at this company now for five years and I did a postdoc for two years in the NSL biology department. I like to think I’ve picked up a fair amount of biology, but compared to the people who’ve dedicated their lives to studying biology, I’m a complete ignoramus. There’s all sorts of things that I don’t understand and where I ask silly questions, but I do ask the silly questions and I find that biologists appreciate that a lot. They like having to explain it, like seeing that I’m trying to understand what’s going on. And that makes me better at my job. It makes it easier for me to communicate with them. And knowing the types of questions they’re going after and the types of things that they’re trying to achieve also helps me to focus on working on problems that are going to matter. I’ll give you an example. If somebody has designed an experiment where they’re expecting to create a tenfold change in the abundance of a protein, great, you don’t need me. You don’t need a statistician to see this. You need two functioning eyes and you’re gonna see a giant signal or you’re not. And that’s important to understand because not all problems require statistical thinking. Some do, some don’t. So when you start getting to know how the experiments work, what the ultimate objective is, then you can really start to piece together where you will be most useful. And this is all essential. It’s a nonstop process. I think statisticians should always be learning and always striving to understand the scientific content of what they’re doing. I used to ask myself this question all the time, particularly when I was in the VU lab, is it better to really understand formal statistics but not understand the data generating process very well or understand the data generating process very well and not understand formal statistics? Because that was often the world that I saw, people coming up with ad hoc solutions, but really well-motivated, really thoughtful ad hoc solutions versus people writing down equations. But really you ask them where the data came from, they have no idea whatsoever. That happens all the time. And now I don’t think there’s any question at all. It’s much better to have the understanding of the data, much, much better. You don’t even know what types of things could be going wrong if you’re just making a bunch of assumptions and writing down a formal model. You’ve gotta know, you’ve gotta understand the experiments in order to be effective at your job. That makes sense. I think it also makes sense in some of the work I’ve been doing in order to design certain trials for certain diseases. I have to understand what’s going on with the disease and what are some possible risk factors that could affect the treatment of the new, the effect of the new treatment. So I think that’s very important to know that it’s good to know some of the clinical or biological knowledge.

Jocelyn: Yeah. And I guess aside from your work, if we’re to talk about advice for the student or early career starters, biostatistics is a very collaborative field and you must have crossed paths with a lot of individuals in the workplace and have extensive experience. So what are some skills and traits of individuals that make you think, yes, this is the kind of person I wanna work with? Kind of person do I want to work with?

Jonathan: For me, I’m thinking a lot about people who are still curious and interested in learning. And I view that as just really essential for what I’m doing, but I have to step back. I don’t do standard biostatistics. If you see a job posting for biostatistics, you’re probably doing clinical trials and we hire biostatisticians at my company. And even though I call myself a statistician and I’m trained as a statistician, I’m over here in the basic research world where we’re playing around and doing weird stuff. So it’s not really the standard model for what a biostatistician does. And I think that if you’re in the clinical trials world, you actually should have a very different answer to this question. For example, you gotta really care about doing things the FDA is gonna approve of if you’re in the biostatistics world doing that type of work. I don’t care about that at all. That’s not something I think about. It’s not something I do. So for me, being able to want to learn about biology, for example, going back to our last conversation is immensely important for doing the type of work I’m doing. The type of thing I’ve spent all day today working on requires a lot of time reading papers, not from the statistical world, talking to people in a different universe. And I have to be interested enough in that to spend many months exploring it before I start to figure out what the statistical problems are. And for anyone who is following a similar path or is doing statistics in the basic research environment and the biological environment, I think that’s actually really probably the most important quality is perseverance and a desire to learn and the patience to slowly understand what’s happening and then slowly translate that into something like a statistical model.

Jocelyn: I see. So I guess the question could have been rephrased into as a researcher, what kind of researchers that you would like to work with? And then that would be the answer you just gave us?

Jonathan: Someone who says- Yeah, I’m looking for students, people who are still learning, who are constantly learning and who are curious and honest and hardworking. These are the most important things.

Jocelyn: I see. I think I kind of probably positioned this interview in the way that was not supposed to because I thought it would be more industry oriented. I think a lot of stuff we talk about is more research side. So the rest of the questions I’ve prepared for this, I don’t think will be very practical for this interview. For example, what do you wish you knew when you first started your career? So I think you always knew you wanted to work on some sort of research problem that could potentially bring impact to the world.

Jonathan: So like I said, it’s not really a, what kind of job do I get? And what do I do at my work? And how do I progress from there? Well, and that’s for sure. You’ve got to realize that there’s a lot of private sector research going on right now. What I’m doing sounds weird. It sounds like I’m very academic. And in many ways I am, but I work for a private company and I have good insurance and all that stuff. So, no, it’s not at all weird. On the contrary, I think it’s really interesting because it’s- But it’s fairly new and we’re not the only ones. You’ve got Facebook or whatever they call themselves now. They have a major, major organization out there. Amazon paid for, they made a huge investment in aging research. So there’s a lot of private sector research that’s happening. So I think that that should be thought of, students should think about that as something that exists that they might want to participate in. And it all falls into the overall biotech landscape. How do you create biotechnology that maybe hopefully someday you’ll translate into something that a pharmaceutical company would want to buy. But the way this works, pharmaceutical companies largely do buy up biotech companies. They largely let the smaller biotech companies come up with the ideas and test things out. And then when it comes time to do some gigantic clinical trial, that’s where you need the size and the infrastructure of an AbbVie or Novartis or one of these big players. So there are a lot of, a lot of interesting places out there where you can still be a researcher even though you’re in the private sector.

Jocelyn: I guess that’s good to know. Because from my past experience or what I talk to people about is either academia or pharmaceutical companies. So I guess we never really came across the situation where other biotech companies might be interested in doing this sort of research. And it’s also related to what you’re studying as a student. So it’s really valuable as well. I guess it will be good for the audience to know that that’s a possible career path as well. So that’s really awesome to hear. Thank you.

Jonathan: Yeah, a lot of people are surprised by that. They think if you’re doing research, it’s strictly academic. It’s not true. We’ve got a person on our team who just left. He got tenure and he left his department the next day. That’s not my story to tell, but these things do happen.

Jocelyn: I see. And I guess as a last question in this podcast, what is one question that you wish that would have asked that could be valuable for the audience and how would you have answered it? It could be on any kind of scale, like any topic would do.One question that you should have asked. Or maybe something that you would personally want to share.

Jonathan: The thing I always see that’s a hot topic is thinking about the differences between, who’s your audience? You’re gonna give this to other people in statistics departments? Well, not specifically statistics department. I guess anything related to like, thought informatics, spell stats, or in general, just healthcare data science world. I think a really important topic and an important thing for people to think about is the cultural and technical differences between statistical thinking and machine learning. Like machine learning is the hot thing. You wanna get a job. I get recruiters who contact me all the time like, oh, do you do deep learning? Well, no, stop, read my resume. I do not. But a lot of people want to do machine learning because of the big success. AlphaFold, which was developed just down the street from Google has made huge progress in the space of discovering protein structure. I’m sure you’ve seen what’s happening with language models and chat GPT and all these things. So people are really excited about this stuff. And I see a lot of statisticians wanting to go that way. And I see a lot of statisticians ending up just being kind of bad at what computer scientists do. And that’s not really fair. A lot of the original developments came from statisticians. But there are real cultural differences. There’s a real big difference in the way that people think about problems. And I’m way outnumbered here. I’m in a computing department with two statisticians and a dozen machine learning engineers. So I see this all the time. I see all of these dynamics. And I just think it’s very important to understand the differences, to think about the differences, to think about the underpinnings and how people go about making decisions. And that helps you focus on what your value proposition is as well. And I have adamantly said all along, I’m a statistician, I don’t do machine learning. I focus on not prediction, on understanding. I focus on trying to make assumptions about a system that I’m striving to understand as a scientist would. And I’m thinking about how principles of statistical inference and randomness come into play and can help inform the way we interpret experiments. This is a very different thing than focusing on how to optimize prediction, which is what is really happening in machine learning world. And they’re very good at it. And they come up with, they don’t, we talked before about the creation of a formalized model and how that compares to just having good intuition. Well, if you define your outcome as prediction and you have a very clear thing that you’re striving to achieve and you have good training data, writing down your formal statistical model, you’re gonna get crushed by the people who have been focused on how to win the contest. The key thing is understanding when it’s appropriate to try to win a contest and when it’s better to try to tease out various properties and understand them and explain them. There’s a lot of problems in the world and it’s key to know what you do well and where your skills should be applied. That’s my advice, is to think through that very carefully.

Jocelyn: I think that’s a really, really good point to bring up. Cause I think some of the PhD students that I talked to that I know who are working on say genetics data or omics data, they sort of work on more of a machine learning kind of, I guess method, but it’s under the umbrella of statistical science as well. So, but it just seems like the trend that they wanna do is they wanna do more of a machine learning kind of oriented career. And sometimes I’m not really sure if that is the necessary thing to do if you’re in genetics or omics world. I guess today from you, I learned that it’s not the only thing that you can do. And statistical science mainly focus on the inference, which is-

Jonathan: I would say more than that, I would say most of the time it is absolutely not the thing you should do. You have to understand when these things are gonna work well. They’re going to work best when you have lots of training data or lots of ground truth knowledge and that then you’re going to be able to use that information to learn something that you can then apply later. The underpinnings of the circumstance can’t change much from when you did the training, when you do the prediction. You need a lot of data. So if you think about what I do in quantifying proteomics data, like none of those things are true. You don’t have a lot of ground truth data. The situation changes almost every time you run the experiment and you almost never have good ground truth and you also have very small sample sizes. Experiments are incredibly expensive. Sometimes you’re only looking at three samples versus three samples. Forget it, that entire framework’s not right. You really have to understand the properties. You’ve got to know, hmm, there’s higher variance down here in this region. We can’t trust that. Ooh, there was this missing data pattern. We don’t know how to deal with that. You’ve really got to explore and get in depth there. But moreover, if you go back, I always get this guy’s name wrong. His last name’s Breiman or Breiman. He was a statistician who did a lot of the development of early tools. He came up with, I think he used random forests and a number of other things, but he wrote a paper called like the two cultures. If you look up the two cultures, Breiman, I’m sure I’m butchering the guy’s name, it’s terrible. But he wrote this paper about the emerging culture of machine learning. And he was an early pioneer in this space. And he was a statistician. And I would encourage everyone to read it. It’s really interesting. And he was pushing back on what was at the time a very dominant statistical culture where everything was about creating parametric models. And he thought that was dumb. He thought people were missing out on all sorts of great ways to make improvements over what we were doing. But he really did frame everything in terms of black box prediction. And now I think that the successes of those black box prediction algorithms have been so extraordinary that people have stopped, almost stopped appreciating the value of not having a black box prediction of what you’re doing when you really do try to tease apart and understand everything that’s going on under the hood. They’re very different types of thinking. There are different approaches. And again, it’s a big world. No one approach, no one paradigm or framework is going to be the best thing to do for every circumstance you’ll find yourself in. It’s worth trying to understand the whole landscape and figuring out where the way you think the skills you have are best applied.

Jocelyn: That’s truly some wise advice. Cause I do think I have a brief period of what I was confused about. Should I just go into machine learning somehow just because it’s very popular? But also at the same time, like is it really statistical thinking if you completely ditch the whole structure? So that’s really good advice. And I’m pretty sure our audience will think more about it as well.

Jonathan: You’re going to be competing with a bunch of CS people who are really from the machine learning world. They’re going to do very well.

Jocelyn: Yeah. I think it’s a big culture where people always try to switch to, I guess, computer science related.

Jonathan: People follow money. The money is here. You get the hot terms, you get the big publications and the papers, you say the key words, you do the old thing, people get confused and annoyed. So it’s not a surprise. There’s, and especially with the latest advances, it’s probably just going to exacerbate, but there’s a lot of, a lot of, well, it doesn’t always work out.

Jocelyn: Right. Well, I guess that concludes our episode. And thank you again for being willing to take the time to talk to me. And I will link the paper that I read of Jonathan’s in the description and please check it out. And we will stay updated with your new publication that’s coming up.

Jonathan: Great. It was my pleasure. Thanks for having me. Thank you.

Jocelyn: Thank you.