50 students. Four schools. One 72-hour data science marathon.

That’s what happens at DataFest, a weekend-long competition thrown each year by the American Statistical Association that pits teams of 2-5 students against the clock to make sense of a massive, complex, real-life data set.

In March Colby hosted their first-ever regional event virtually with teams from Bates, Bowdoin, and College of the Atlantic. “I wanted to bring people together from all kinds of majors and get people engaged in data science,” said Mufaddal Ali ’21, founder and president of Q Data Science Group—Colby’s first data science club—who helped organize the event. “We had fifty students from four schools participate in the 72-hour marathon.”

“When I saw the DataFest announcement, I knew I wanted to do it. I love the challenge of working under pressure,” said Emily Larson ’21, whose team won “Best Modelling.” As a computational biology major and statistics minor, she had firsthand experience with data sets like these. “I’ve definitely been data science-focused throughout my time at Colby, and it was fun using those skills in a real-world context.”

Datafest team presentation infographic

Colby’s team of Maddie Carlini ’21, Julia Chahine ’21, Emily Larson ’21, and Lily Matson ’22 focused on the abuse of prescription drugs by Canadian students and found that parental income and history of alcohol abuse were the most common predictors.

Using Real-World Data to Solve Real-World Problems

This year’s data required teams to take a deep dive into the non-medical use of prescription drugs across multiple countries, years, and demographics. “It’s totally open-ended, so each student team tackled completely different topics,” said Jerzy Wieczorek, assistant professor of statistics. “For me, I was just excited to have a chance to practice what I preach. Statistics and data science is such a broad discipline, and we’re teaching students core technical skills like coding and data analysis, but it’s important to remind them that we’re not doing this in a vacuum.”

For many of the students, this was their first real-world data set—filled with frustrating gaps, incomplete figures, and irrelevant noise.

“Even though the data and techniques matter, it’s really the story they can tell and how they bring that data to address a real business problem that makes the difference,” said judge Jamie Warner ’09, vice president, data science at Lincoln Financial Group. “Coming out of academia, a lot of people haven’t seen a bad data set. I was really impressed at how they handled it.”

Sifting Through Data to Find the Golden Question

It all starts with a question.

The problem was, each team had to find that question, slicing and dicing the data into manageable sections to build a model that could provide answers. “We approached it from a public health perspective, looking at the abuse of prescription drugs by Canadian students specifically,” she said. “We found that the most common predictors were parental income, history of alcohol abuse, and mental health.”

Datafest team presentation infographic

The team of Muxin Li ’22, Shailin Shah ’21, Michael Yu ’21, and Chris Zhu ’22 won Best Use of External Data for their analysis of seizures from synthetic and non-synthetic drugs.

Taking a typical liberal arts approach, Larson’s team focused on cross-discipline policy implications, drawing on coursework from biology, statistics, and government. “You can do all the fancy modeling you want to, but if it’s not useful to anyone, what’s the point in doing it?” said Larson. “My teammates are all really interested in public health, and we wanted to look at something that would make an impact.”

Other teams looked at non-medical prescription drug use among medical professionals, the opioid crisis in rural Virginia, and the use of crisis counseling and other social service programs in Germany.

A 72-Hour Data Marathon

Professors and alumni from the Mathematics, Computer Science, STS, Psychology, and Biology departments stayed on call through Slack and Zoom to help teams progress over the long weekend. “It was a testament to how important data science is across campus and how it isn’t just some purely technical skill—it’s woven into our academic work and into our lives as citizens in so many different ways,” said Wieczorek.

At the end of the 72-hour sprint, teams presented their work to a judging panel made up of faculty, staff, and industry experts from IBM, Fidelity, BCG Gamma, and others. Winning teams received a $250 cash prize, memberships to the American Statistical Association, and lifetime bragging rights.

“I’m really excited to participate again,” said Warner. “I think it’s such a cool opportunity that hasn’t existed for data folks in the past. I was so impressed to see the level of understanding and the visualizations that added value and made a data story impactful. I’m thrilled at the way Colby is approaching data science in a multi-disciplinary way.”

Warner was so impressed, in fact, that she created a job for Ali at Lincoln—their first undergraduate hire for data science.

This is only the beginning of data science opportunities at Colby. The McVey Data Science Initiative integrates data science into the curriculum across campus, from the humanities to natural sciences and beyond, building into a larger program that gives students critical skills in predictive analytics, AI and machine learning, and data visualization.

“It’s all about community building,” said Wieczorek. “There are so many ways to draw students interested in data science across campus, and we can’t wait to get more people involved next year.”