Analyzing Open Source Software Ecosystems
This
Fall 2023 DataFirst project attempts to
analyze code commits(source code and patch discussions) of open-source software to better understand them. Gaining a deeper understanding of OSS ecosystems will enable the open-source community to identify potential vulnerabilities, and define better development practices. We use
data from Linux Kernel Mailing List. Our
work focuses on extracting clean messages from the raw data and performing keyword extraction and summarization on individual messages and patch discussions. Our initial
results indicate that most clean messages can be extracted through regular expression, and keyword extraction and summarization can be accomplished with the help of large language models such as GPT4.