Analyzing Open Source Software Ecosystems

This Fall 2023 DataFirst project attempts to analyze code commits(source code and patch discussions) of open-source software to better understand them. Gaining a deeper understanding of OSS ecosystems will enable the open-source community to identify potential vulnerabilities, and define better development practices. We use data from Linux Kernel Mailing List. Our work focuses on extracting clean messages from the raw data and performing keyword extraction and summarization on individual messages and patch discussions. Our initial results indicate that most clean messages can be extracted through regular expression, and keyword extraction and summarization can be accomplished with the help of large language models such as GPT4.