About Me

My name is Ishita Sarraf. I am majoring in Computer Science with a concentration in Statistics at Grinnell College, IA. Originally from Kolkata, India, I am a rising junior and will graduate in May 2025. I am interested in conducting research in data science, artificial intelligence, and machine learning. Looking at the intersection of technology with various different fields is captivating for me. In the summer of 2022, I worked as a Backend Developer for New Emerging World of Journalism in Mumbai, India where I developed a pipeline used to extract information from various websites and automated it to work every 5 minutes. I am conducting research in the Information Quality lab with Dr. Jodi Schneider and Ph.D. mentor Yuanxi Fu at the University of Illinois Urbana-Champaign for the Summer of 2023. I am working on developing, documenting, and testing reusable pipelines for acquiring and processing full-text content of scholarly publications by text and data mining. I am interested in going to graduate school to further learn about the various ways to conduct Computer Science and make an impact on society using my research. Apart from computing, I am an avid reader, a mezzo-soprano, and an intermediate Spanish learner.

About My Mentor

My mentor is Dr. Jodi Schneider from the Information Science School at the University of Illinois Urbana-Champaign. Dr. Jodi Schneider is an Associate Professor of Information Sciences at the University of Illinois Urbana-Champaign, where she directs the Information Quality Lab. She studies the science of science through the lens of arguments, evidence, and persuasion. Her long-term research agenda analyzes controversies applying science to public policy; how knowledge brokers influence citizens; and whether controversies are sustained by citizens’ disparate interpretations of scientific evidence and its quality. She holds affiliate appointments in the Beckman Institute, Health Care Engineering Systems Center, European Union Center, Informatics, Center for Health Informatics, and Cline Center for Advanced Social Research at the University of Illinois and the Department of Psychiatry of the University of Illinois Chicago School of Medicine. Her work has been funded by the Alfred P. Sloan Foundation, the European Commission, IMLS, NIH, Science Foundation Ireland, and an NSF CAREER award. You can find more about her research at https://infoqualitylab.org.

My PhD mentor is Yuanxi Fu and you can find more about her work at https://ischool.illinois.edu/people/yuanxi-fu.

About My Project

My project involves constructing a data extraction pipeline that will download full texts of scholarly publications to help researchers create their own custom datasets. My pipeline is reusable such that given any Digital Object Identifier (DOI) of scholarly papers it can extract the papers’ PDF and XML full texts, if available, and store them in a database. To extract the full text under various copyright licenses, I used text and data mining APIs supplied by Crossref, Elsevier, and Wiley. I also identified scientific analysis tasks that could be done after the extraction by interviewing researchers who mine and analyze scholarly publications as part of my requirements analysis. The full text extraction pipeline is important because it allows different datasets of scholarly publications to be created using a single pipeline, making it easier for researchers to construct their custom datasets without wasting time on copyright licenses.

My Final Report

My Blog

My Blog