Projects | DataFirst

AI Ethics for Smart Health through Smart Watches

Sun, 01 Jan 2023 00:00:00 +0000

Description

Lots of personal data can be obtained from wearable devices such as smart watches. This data can be used to improve health, for example to learn to detect health problems and to check whether people adhere to doctor’s exercise recommendations. This project will conduct a thorough study of the ethical issues in using AI systems in this domain, with recommendations of how AI systems for smart health should be designed with ethical considerations in mind.

Students

Advisors

Yolanda Gil

What students will learn

What kinds of health-related data can be captured through wearable devices, what kinds of analyses are possible, privacy and ethical aspects of personal applications for smart health.

AI/ML assisted fault detection in foundry processed devices

Sun, 01 Jan 2023 00:00:00 +0000

Description

Highly accurate fault detection in foundry produced microelectronics is crucial to ensuring quality of devices that leave the foundry. However, current defect detection flows are human-centric, which produces a bottleneck. The objective of this project is to leverage recent advances in AI/ML to develop automated techniques that can 1) identify manufacturing defects in microelectronics using imagery collected at the foundry, and 2) determine whether the identified defect will impact the performance of the manufactured component.

Students

Advisors

What students will learn

Students will learn about manufacturing defect detection algorithms, machine learning techniques, and microelectronics fabrication.

Analyzing Open Source Software Ecosystems

Sun, 01 Jan 2023 00:00:00 +0000

Description

Open source runs a lot of the world's critical software systems, but there is much that's unknown in how maintainers, developers and other parts of the software ecosystem function. Help us analyze a large corpus of open source data — both source code and patch conversations — to better understand them! We'll study things like rise to influence, authorship styles, malware analysis, topic modeling and social network analysis!

Students

Advisors

What students will learn

We'll touch on using LLMs to parse text messages and analyze code, graph databases, program analysis, and social network analysis among other skills

Application of AI, ML and NLP in understanding and preventing a serious aviation safety problem in the US - Runway Safety

Sun, 01 Jan 2023 00:00:00 +0000

Description

This project, which is co-advised by Dr. Yolanda Gil, will use AI/ML/NLP to understand root-causes of one of the serious aviation safety problem in the US - runway incursions. The Aviation Safety Reporting Systems, which is administered by the NASA and is an untapped treasure trove of textual data, will be used for this project.

Students

Advisors

Najmedin Meshkati

What students will learn

Using AI/ML/NLP and working on the data from a major global industry - aviation.

Assessing the California Public Sector Job Market

Sun, 01 Jan 2023 00:00:00 +0000

Description

Public sector institutions at local, state, and federal levels are facing an unprecedented hiring crisis in competition for new talent. Yet there is no systematic understanding of the needs and openings across these levels of government to inform stakeholders such as universities, community colleges, and high schools on the current and emerging hiring trends in what constitutes approximately 15-20% of the entire labor market. In this project, students will develop algorithms that continuously scrap relevant job sites used by these governments to assess both developed and emerging hiring trends by aptitudes, professions, entry-levels, mobility, location, and other important attributes. In so doing, the project will inform researchers in public policy, public administration, political science, and labor economics as well as practitioners in government and associated stakeholders.

Students

Advisors

William Resh

What students will learn

Students will learn how to develop and organize labor market data to be used by practitioners and researchers through the construction of portal that can ably transform data into usable aggregated statistics and graphs.

Auditing web content promoting eating disorders

Sun, 01 Jan 2023 00:00:00 +0000

Description

The pandemic worsened eating disorders among adolescents, particularly girls. Eating disorders, a psychiatric illness where an adolescent tries to control their weight with severe food restriction (anorexia) or purging (bulimia), can be fatal. How much does web content and algorithms that power web search contribute to eating disorders? Imagine a girl who is unhappy about her weight and looks for dieting tips online. It does not take long for search algorithms to lead her to extreme content that promotes anorexia and extreme weight loss. The goal of this project is to audit the web for potentially harmful content and communities that promote eating disorders.

Awards

Best Interdisciplinary Data Science Teamwork
Highlighted Project

Students

Advisors

Kristina Lerman

Skills Required by the team

Statistics
R
Python
NLP
Machine Learning

Final presentation resources

Final presentation

Automated question type coding of forensic interviews

Sun, 01 Jan 2023 00:00:00 +0000

Description

Question type coding is used in research on forensic interviewing to distinguish between best practice open-ended questions, and closed-ended and leading questions that interviewers are trained to avoid. Most research teams in the field rely on a time-consuming and labor-intensive method of question type coding whereby a researcher codes every question in the interview, and a second researcher codes a subset to demonstrate inter-rater reliability. We are currently working with a graduate of the Masters in Computer Science program at USC on a project exploring automated question type coding of forensic interviews with victims of child abuse. In collaboration with the student, we have trained a large language model (RoBERTa) to distinguish between question types based on a rudimentary classification system. In the next stage of the project, we are aiming to finetune the model and use zero shot and few shot prompting to make distinctions for which there is limited manually-coded data.

Students

Advisors

What students will learn

Students will learn to train and finetune large language models

Bad Writing is "Fine": Tuning an LLM to Suggest Improvements

Sun, 01 Jan 2023 00:00:00 +0000

Description

Prototype an approach to fine-tune a large language model (LLM) to help diagnose areas to improve a specific writing product. For example, scientific papers require consistent language but in creative writing variety matters. Proposed steps are:

Writing Product: Coordinate with project mentors to choose a common and important writing product, such as a position paper or an academic conference. Identify/gather a rubric and a corpus.
Inject Bad Writing: For each element of the rubric, develop prompts for generative AI to decrease the quality of writing based on the rubric (i.e., make it worse). This will form a training data set of the good example and version worse on certain characteristics.
Fine Tune: Students will be expected to attempt to fine tune an LLM (e.g., LLAMA 2) based on this synthetically generated data
Evaluate: Research if tuning suggests better domain-specific areas to improve.

This project aligns with ongoing work with the USC Generative AI Center.

Students

Advisors

Benjamin Nye

What students will learn

Generative AI for large language models. Generating synthetic data for a rubric. Fine tuning a large language model, likely using CARC (the on campus computing cluster). Understanding intelligent tutoring system design fundamentals for modeling how experts diagnose issues from novices.

Build a multilingual decipherment system

Sun, 01 Jan 2023 00:00:00 +0000

Description

We will build a working system that can decipher a letter substitution cipher into 14 languages and beyond, based on https://aclanthology.org/2021.acl-long.561/ then apply it to languages it has never seen

Students

Advisors

Jonathan May

What students will learn

read and understand an NLP paper, unusual applications of transformers, reproduction study

Building a Platform for NFL Data Insights

Sun, 01 Jan 2023 00:00:00 +0000

Description

Open source sports data such as the nflverse has lead to a massive increase in public sports analytics. But it's still hard to process, subset, visualize and analyze this data. This project will build a general-purpose analysis platform and dashboard, similar to what many teams use internally. Using the nflfastr data, this platform will allow interested individuals to select the play parameters they're interested in, and will provide relevant analysis, visualization and insight. Ideally, we'll set up the dashboard on the internet, and open source the project, allowing others to expand the available datasets, analyses and visualizations.

Students

Advisors

Jeremy Abramson

What students will learn

How to analyze and present insights from NFL play-by-play data

Determinants of spatial variations in broadband quality and prices

Sun, 01 Jan 2023 00:00:00 +0000

Description

There is anecdotal evidence that broadband service offers vary in quality and price along income and racial lines. This project seeks to validate these claims by collecting data about service speeds and prices for all known serviceable locations in Los Angeles County, and merging with sociodemographic variables from the Census Bureau and other sources. Other researchers have already developed prototype code for scraping data from ISP's websites, but the study only covered a small percentage of addresses in LA County. The analysis will probe for evidence of a "poverty penalty' whereby residents of poorer areas in Los Angeles are offered higher cost/lower quality broadband services.

Awards

Best Presentation
Highlighted Project

Students

Advisors

Hernan Galperin

Skills Required by the team

Python
PyTorch

Final presentation resources

Final presentation

Does Municipal Broadband Deliver as Promised? An examination of broadband pricing and household adoption in areas served by muni networks.

Sun, 01 Jan 2023 00:00:00 +0000

Description

Broadband networks owned and/or operated by local governments ("muni networks") are increasingly seen as a key tool to close the digital divide in Internet availability and adoption. There is however only anecdotal evidence about whether muni networks deliver on the promise of more affordable broadband in communities of little interest to traditional ISPs - typically disadvantaged communities. Taking advantage of the greater level of resolution in the new FCC broadband availability maps, this project will examine broadband pricing and adoption at the address level in areas served by muni networks, using a matched sample of comparable areas as a reference point. The goal of the project is to empirically assert whether muni networks are delivering on the promise of more affordable services, and whether this results in more household adoption than expected. The project is a component of an ongoing collaboration with digital equity advocacy organizations.

Students

Advisors

Hernan Galperin

What students will learn

Students will have the opportunity to apply data scraping, organization and analysis skills in the context of policy analysis

Event Forecasting using Efficient and Expressive Temporal Knowledge Graph

Sun, 01 Jan 2023 00:00:00 +0000

Description

Temporal Knowledge Graph (TKG) models incorporate temporal aspects of facts into their graph neural networks (GNNs) learning processes to predict temporally conditioned facts. These models capture the temporal dynamics of the facts well and are well-suited for temporally conditioned graph completion tasks. However, there remain many open issues that need to be addressed to make it more practical for real-world applications: (1) Real-world problems usually do not conform to the graph completion tasks which fill out the missing element in a fact; (2) Sparse graph due to lack of temporal triples for target task leads to poor performance; (3) Temporal graphs are inherently dynamic entities that grow and change over time but most existing models require computationally expensive training from scratch to incorporate these changes. Thus, we propose to design an efficient event forecasting framework that solves such challenges.

Students

Advisors

Skills Required by the team

Python
Web Crawling

Federated Learning for Neuroscience

Sun, 01 Jan 2023 00:00:00 +0000

Description

Federated learning is an approach to distributed deep learning without sharing data. Multiple site train a neural network over private data. The parameters of the neural network are shared with a federation controller, but they are encrypted before sharing. Model aggregation is performed under fully homomorphic encryption. We propose to apply federated learning to several problems in neuroscience, such as predicting Alzheimer's, Parkinson's, epilepsy, and autism, possibly over multimodal data.

Students

Advisors

Jose-Luis Ambite

What students will learn

Federated learning, machine learning for biomedical applications.

Human Bio-signals as a Function of Indoor Air Quality Control for Human Health in Buildings

Sun, 01 Jan 2023 00:00:00 +0000

Description

After the COVID, the remote work environment has become popular, and many commercial offices have tried to keep both the in-person work environment and the remote work environment while attempting to reduce the size of their workplaces. Even though hot desking systems are increasingly common and even starting to feel like a trend these days, there isn't much data to show how such systems can support occupants' environmental comfort, work productivity, and psychological stability while they are at work. Therefore, this project focuses on investigating how much the occupants are satisfied with their new workplace platform and how much a hotdesking system affects their work productivity, environmental satisfaction, etc. The project's findings will help design this new desking system in a way that will increase occupants' satisfaction with their surroundings and productivity at work without compromising their quality of life in the workplace.

Students

Advisors

Joon-Ho Choi

Skills Required by the team

R
Python
Matlab
Statistics

Identifying Causal Pathways from Online to Offline Systems

Sun, 01 Jan 2023 00:00:00 +0000

Description

Students will learn to collect and analyze data in the goal of identifying causal pathways between online and offline systems. Video games, present a rich set of opportunities for this analysis, and we will begin by studying them. We will start with the effect that video games have on culture, movement, and discussion offline. We will then transition to Reddit activity, where certain communities can spur offline action. Students will not only collect data, but apply state-of-the-art causal detection algorithms alongside PhD students studying the same phenomena.

Students

Advisors

Fred Morstatter

Skills Required by the team

Python

Learning and forgetting in neural networks

Sun, 01 Jan 2023 00:00:00 +0000

Description

In this project, you will examine the mechanism responsible for forgetting previous tasks in artificial neural networks. You will study how those mechanisms shape the behavior of neural network learning from heterogeneous data distributions. You will investigate how new information is stored in neural networks by plotting and interpreting the neuron activation patterns. You will also compare different learning schemas, and you will examine how they influence the final loss function landscape.

Students

Advisors

Marcin Abram

What students will learn

How the information is stored in neural networks. How neural networks can forget how to perform previously mastered tasks. How to interpret neural networks (by examining the neuron activation patterns). How to conduct scientific experiments (in the domain of machine learning). How to present and visualize scientific data.

Natural language processing of safety reports in nuclear power plants

Sun, 01 Jan 2023 00:00:00 +0000

Description

This project, which will be co-advised by Dr. Yolanda Gil, will use Natural Language Processing (NLP) techniques to analyze voluminous Diablo Canyon Independent Safety Committee (DCISC) annual reports to identify the role and contribution of "Traits of a Healthy Nuclear Safety Culture", as defined by the Nuclear Regulatory Commission and the Institute of Nuclear Power Operations, in incident causation.

Students

Advisors

Najmedin Meshkati

What students will learn

Application of NLP in real-world, working on very serious and important issues with global applications, which can be generalized and applied to other safety-sensitive technologies.

Natural language processing of safety reports in nuclear plants and aviation

Sun, 01 Jan 2023 00:00:00 +0000

Description

This project will focus on extracting concise structured information about incidents in nuclear plants and in aviation that are currently described in lengthy document reports. By structuring this information and mapping it to safety models and standards, we can help improve their operations.

Awards

Best Data Science Teamwork

Students

Advisors

Skills Required by the team

Data Analysis
SQL
Clinical knowledge
Statistics
Python
AWS Sagemaker

Networked social influence in large-scale networks

Sun, 01 Jan 2023 00:00:00 +0000

Description

We recently published a novel algorithm for measuring influence among people. What's different about it is that it doesn't rely on social media. It takes behavior data, such as might exist within a company's databases, and is able to say ""This person is causing this person to do X."" This technique is broad and powerful and has many social and business applications. We're also fortunate to have a lot of data from corporations to explore, including over a 100m person dataset on charitable giving, several datasets of video game play, and one on commercial travel. What we need are smart students who can learn the algorithm and help us run it, answering questions of both scholarly and commercial interest. We need horsepower, and we're happy to help train, advise and add students to the many publications we see coming out of this effort.

Students

Advisors

Skills Required by the team

Statistics
R
Python
NLP
Machine Learning

Predicting the possibility of escalation of care for specific cohorts admitted to ICU

Sun, 01 Jan 2023 00:00:00 +0000

Description

To predict the possibility of escalation of care for specific cohorts admitted to Keck ICU,), on the following: • Increase in pressors • Intubation • Trip to operating room within 24 hours • Starting dialysis The objective is to predict on a real-time/near real-time basis, whether someone is likely to require escalated care as defined above. Prediction of the likelihood of such a care escalation need is assumed to be an optional/secondary requirement at this stage.

Students

Advisors

Neil Bahroos

Skills Required by the team

Statistics
R

Pyleoclim: A Python Package for the Analysis of Paleoclimate Data

Sun, 01 Jan 2023 00:00:00 +0000

Description

Paleoclimate timeseries data are crucial to understand how climate has changed in the past. A major aspect of this work falls under exploratory analysis, and in particular, visualization. Pyleoclim contains many functionalities for timeseries analysis of paleoclimate data and has already been used in teaching and research settings. In the coming months, we are expanding several functionalities of the package to address growing community need: outlier detection, automated visualizations, automated checks for the validity of datasets loaded into the package. In addition, these new functionalities will be integrated into tutorials distributed through a Jupyter Book.

Advisors

What students will learn

Timeseries analysis, Python packaging, continuous integration, containerization, GitHub, Jupyter, Binder.

Regular Data: Quality health monitoring while you sit

Sun, 01 Jan 2023 00:00:00 +0000

Description

To create a software-as-a-service data pipeline for collecting health biomarkers via an instrumented toilet seat. This clinical data management system (CDMS) will enable collecting clinical-grade, high-quality, curated, and consistent data capture that meets NIH and FDA standards of clinical utility.

Students

Advisors

Francisco Valero-Cuevas

What students will learn

Engineering and software architecture skills to create the data pipeline from instrument signals to health reports compatible with reimbursement, research and clinical data systems.

The value of player tracking technology in assessment of training volume in youth soccer players.

Sun, 01 Jan 2023 00:00:00 +0000

Description

Wearable IMU and GPS sensors are used to estimate training volume in professional sports to ensure adequate preparation for competition. Recommendations to modify training is individualized to meet health and performance needs of each player. Expense associated with these systems restrict their use in youth sports. More economical alternatives are available but it is not clear if the data is comparable. This project will compare data collected concurrently to determine if the economical system provides similar conclusions. Players from the LA Galaxy (second and developmental teams) will participate. Continuous data from triaxial accelerometers, gyroscope and GPS will be collected during practice and competition. Raw data and commercially derived key performance indicators will be compared to understand the agreement between the systems. Acceptable agreement with comparable interpretation of data with respect to training recommendations may allow for greater access for young athletes

Students

Advisors

Susan Sigward

Skills Required by the team

Python
Statistics
Econometrics

Transition of Care

Sun, 01 Jan 2023 00:00:00 +0000

Description

Analytics to determine what tertiary and quaternary care patients would have the best outcomes at Keck Medical Center. Analyze data from incoming transfer patients to look at data quality, evaluate clinical records and predict outcomes.

Awards

Best Website
Highlighted Project

Students

Advisors

Neil Bahroos

Skills Required by the team

Python
Web Scraping

Final presentation resources

Final presentation

Understanding the Relation Between Noise and Bias in Annotated Datasets

Sun, 01 Jan 2023 00:00:00 +0000

Description

When it comes to classification tasks, many previous work has tried to design larger and more complex neural networks. Recently, the line of data-centric AI has worked on shifting the focus to the quality of the train data. This shift arises from the recognition that the annotations associated with dataset instances can exhibit both noise, stemming from vague instructions or human errors, and bias, arising from differing perspectives among annotators in response to given prompts. In this project, our objective is to bridge the gap between the two lines of research: one dedicated to identifying noisy instances and the other striving to account for the diverse perspectives of annotators. Specifically, we will delve into the domain of offensive text detection datasets, a highly subjective task. Our investigation will center on whether perspectivist classification models have effectively harnessed valuable information from instances flagged as noisy by noise-detection techniques.

Students

Advisors

Negar Mokhberian

What students will learn

The student will learn the importance of individual instances and individual annotations in training the classification models. Each of these datapoints can introduce either useful signal or noise to the model and the student will learn to recognize the difference.

Urban Futures Data Core

Sun, 01 Jan 2023 00:00:00 +0000

Description

Cities are the focal point of economic, social, and environmental challenges and opportunities. To establish USC as a thought leader and partner of choice to tackle the challenges of the urban future, the USC Sol Price School of Public Policy and the USC Marshall School of Business propose establishing an Urban Futures Data Core to serve as a university-wide hub for data analysis and dissemination. Students working on this project will work with all faculty at Price and Marshall to catalogue the publicly available, restricted-use, and self-collected datasets that USC researchers have previously used. They will then create a secure website to track each data source and its data use agreements, dates of availability, and geographic level of granularity. After a data website is constructed, students will have the opportunity to assist with creating geographic visualizations of key indices related to urban futures.

Students

Advisors

Alice Chen

What students will learn

The students will learn about all data sources used in public policy and business, data management, and web design.

Utilizing AI Generated Images for Object Detection and Classification

Sun, 01 Jan 2023 00:00:00 +0000

Description

Developing image-based object detection and classification models requires significant time, resources, and effort. Especially, acquiring a good training dataset is essential. However, there are some cases when it is very hard to get quality data such as rare cases (e.g., disasters) or expensive cases to get (e.g., faraway places). Due to the development of generative AI, we might produce synthetic images to enhance the quality of dataset by filling up missing images with them. Based on our prior work in object detection and classification for smart city applications, we would like to explore the potential of AI generated images for an enhanced object detection and classification.

Students

Advisors

Seon Ho Kim

What students will learn

Image machine learning, object detection

Automatic Discovery of News Articles: Case of Policy Misconduct

Sat, 01 Jan 2022 00:00:00 +0000

Description

The PMR (Police Misconduct Registry) is a database of officers who have been terminated or have resigned in lieu of being fired for misconduct. The objective of PMR is to increase the public trust and legitimacy of law enforcement officers serving the community while also helping departments hire the best possible candidates. The PMR is continually populated with all instances of police misconduct anywhere in the United States.

Currently, data entries are manually identified, discovered and registered using public, open-sourced information, mostly news articles on the web, which critically limits its data collection process. Thus, this project aims at automating the discovery of such data with an efficient identification mechanism. Working with the Price School of Policy, we will implement an automatic identification mechanism to effectively search police misconduct articles utilizing web crawling/scarping and natural language processing.

Awards

Best Presentation

Students

Advisors

Seon Ho Kim

Skills Required by the team

Python

Characterizing Online Attitudes, Expectations, and Concerns about Novel Medical Treatments

Sat, 01 Jan 2022 00:00:00 +0000

Description

Novel, or hypothesized medical treatments, such as COVID-19 vaccines and male contraception, are regularly discussed on social media. For example, on the AskReddit subreddit, questions of the form “”Would you take [x] if it existed?”” Aside from willingness to use these novel treatments, the answers to these questions contain important clues to peoples’ latent concerns and barriers to adoption of novel medications. Understanding them can provide crucial information about how to introduce, communicate, and counsel about new medications when they come to market. In this project you will use pre-collected Reddit data spanning 10 years to answer questions including: What concerns do individuals have about a novel medication? How do these concerns vary by demographics, such as cultural background? How have these concerns evolved over time? What has caused users to become more or less accepting of the treatment over time?

Awards

Best Presentation

Advisors

Fred Morstatter

Skills Required by the team

Python
Machine Learning

Characterizing the counter-narratives of climate change (Spring - 2022)

Sat, 01 Jan 2022 00:00:00 +0000

Description

Top climate scientists post their findings and views regularly on social media. These very scientists are met with tweets from those with opposing views, often containing vitriolic and false information. It is important that we can identify and characterize these tweets to understand the counter-narratives of climate change. We will address topics including false information, bot campaigns, and harassment.

Advisors

Skills Required by the team

Python
Statistics
Classification
Data Collection

Community Economic Tool

Sat, 01 Jan 2022 00:00:00 +0000

Description

Determining what makes a region “most attractive” for new business will involve more exploratory research in determining what variable(s) are most indicative of potential economic growth opportunities. Additional research will be conducted to identify various predictors of our indicator variable, such as unemployment rate and educational level of tract level residents, neighboring tract residents, and broadband accessibility. That is to say, “What does this region of Miami do?” What industries are the largest employers in each region and are they also the ones generating the most revenue? Once each region has been properly identified, one can identify the dominant predictors of economic growth within this region and compare to other tracts with the same dominant industry structure. This will highlight the role geographic regions play in economic development and growth of various industries.

Advisors

Palak Agarwal

Skills Required by the team

Python
Statistics
Machine Learning

Decoding How Humans Encode Memories

Sat, 01 Jan 2022 00:00:00 +0000

Description

Advancements in closed-loop deep brain stimulation (DBS) enabled more intelligent autonomy for therapeutic intervention across a wide range of neurologic and psychiatric disorders. The predominant approach relies on control-theoretic approximations of the brain’s complex functional relationships with the external environment–in particular, a mapping between targeted stimulation and naturalistic responses of different regions of the brain. However, existing approaches fail to capture the environmental context of neuronal biomarkers. Thus, we leverage a set of IoT sensors to capture the human experience and environmental context, i.e., a subset of human sensory channels, in order to estimate the state of the human brain and provide the foundation for smarter, context-dependent DBS. We explore neural-symbolic approaches that integrate the powerful perception capabilities of deep learning with human logic to reason about the complex dependencies across a heterogeneous set of sensors.

Advisors

Luis Garcia

Skills Required by the team

Python
Deep Learning

Decoding How Humans Encode Memories (Fall - 2022)

Sat, 01 Jan 2022 00:00:00 +0000

Description

In this project, we will work on semantically aligning IoT sensor data and neural data with the human experience. We have developed a sensor platform to record the human experience as patients perform navigational tasks. The goal is to understand what context shifts in the human experience anchor our episodic memories. Each student will have the opportunity to work with different sensing modalities, and develop models for both sensory perception and reasoning.

Students

Advisors

Luis Garcia

Skills Required by the team

Python
PyTorch
Tensorflow

Hot desking system guarantee your productivity? : Investigation of a first-come-first-served workplace system focusing on the occupants’ work productivity and wellness

Sat, 01 Jan 2022 00:00:00 +0000

Description

After the COVID, the remote work environment has become popular, and many commercial offices have tried to keep both the in-person work environment and the remote work environment while attempting to reduce the size of their workplaces. Even though hot desking systems are increasingly common and even starting to feel like a trend these days, there isn't much data to show how such systems can support occupants' environmental comfort, work productivity, and psychological stability while they are at work. This project adopts a commercial office as a testbed, which is located in the downtown Los Angeles, and conducts questionnaire surveys, indoor environmental quality measurements. The project's findings will help design this new desking system in a way that will increase occupants' satisfaction with their surroundings and productivity at work without compromising their quality of life in the workplace.

Students

Advisors

Joon-Ho Choi

Skills Required by the team

Machine Learning
Statistics
R
WEKA

Identification and characterization of cross-platform misinformation diffusion

Sat, 01 Jan 2022 00:00:00 +0000

Description

Fringe communities are often the sources of conspiracy theories and extreme ideas. Niche online platforms hosting such communities represent suitable incubators for reinforcing questionable stories and to ultimately push them into the mainstream. Over the years, information pathways from fringe to mainstream media have significantly increased, enabling the proliferation of harmful content. This project aims at developing novel network- and AI-based models for identifying and characterizing information pathways that enable the proliferation of potentially harmful content on online media channels. Mainstream social media (Twitter, Facebook, and Instagram), video streaming platforms (YouTube and Bitchute), niche platforms (Gab, 4chan, and Parler), and messaging apps (Telegram) will be considered to investigate how harmful narratives flow across diverse platforms and predict those that will gain traction on mainstream media.

Awards

Best Cyberphysical Data Science
Best Data Science Open and Sharing Practices

Students

Advisors

Luca Luceri

Skills Required by the team

Python
Statistics
Data Visualization
Social Network Analysis

Identification of Sustainability-Related Research at USC through Machine Learning and Keyword Mapping

Sat, 01 Jan 2022 00:00:00 +0000

Description

Are you passionate about data science and sustainability? Then this interdisciplinary project is for you! Here, we will develop a machine learning program to identify USC research publications and grants as ‘sustainability-focused’, ‘sustainability-inclusive’ or ‘not-sustainability-related’ by using pre-categorized publication samples. In addition, we will use keyword lists that relate to the 17 UN Sustainable Development Goals (SDGs) to map all research groups at USC as they relate to these SDGs (https://sdgs.un.org/goals). Lastly, we will create an interactive dashboard in R Shiny that will act as a public directory of all research at USC with classification of the research by the SDGs and broader sustainability categorization. As an example, check our github for USC curriculum: https://github.com/USC-Office-of-Sustainability/USC-SDGmap . Your work on this project is critical in boosting sustainability-related research at USC and thereby achieving our Asgmt: Earth Research Goals."

Awards

Best Data Science Collaboration Practices
Best Data Science Teamwork
Highlighted Project

Students

Advisors

Julie V. Hopper

Skills Required by the team

R
Python
Web Scraping
Machine Learning

Final presentation resources

Final presentation

Identifying Catalysts for Online Depolarization

Sat, 01 Jan 2022 00:00:00 +0000

Description

None

Students

Advisors

Kristina Lerman

Intelligent Analytics and Integration of Internet Memes

Sat, 01 Jan 2022 00:00:00 +0000

Description

Internet memes are a popular tool for creatively expressing ideas on the Web. Machine understanding of memes would benefit researchers interested in trends, virality of topics, and hate speech. However, understanding memes is difficult, as it requires combining text, vision, and extensive background knowledge. Recently, we created an Internet Meme Knowledge Graph that contains rich information about thousands of entities. In this project, we will perform extensive profiling of the Internet Meme Knowledge Graph, enrich it with other sources, and store its knowledge into a centralized resource like Wikidata.

Students

Advisors

Skills Required by the team

Python
Knowledge Graphs
Statistics
Wikidata
APIs

Knowledge-powered understanding of diet’s water footprint

Sat, 01 Jan 2022 00:00:00 +0000

Description

Food production and their supply chains are the main cause of water pollution, especially in emerging and developed countries. Understanding the connection between our diet and water pollution is extremely challenging because it requires combining knowledge about food production, supply chains, and transportation. As more and more people seek to live sustainably, there is a need to inform consumers about the planetary impacts of their choices. We propose to construct a knowledge graph and application to create a water footprint calculator for our dietary choices that computes the water footprint for each ingredient of a meal proposed by the user and suggest alternatives to reduce the water footprint (e.g., replacing beef by pork in a cheesesteak meal will save the planet 1,600 gallons of water).

Awards

Best Data Science Insight

Students

Advisors

Skills Required by the team

Knowledge Graphs
Python
Information Extraction
Software Engineering
Databases

Listen to your body: Human Bio-signals as a Function of Indoor Air Quality Control for Human Health in Buildings

Sat, 01 Jan 2022 00:00:00 +0000

Description

The development of COVID-19 has had an impact on the lives of millions of people worldwide. Air quality needs to be improved immediately because COVID-19 can spread through airborne aerosols, raising concerns about how the virus will behave in enclosed spaces. To understand how to estimate the time lag of air quality transmission between indoor and outdoor, it is necessary to gather enough and reasonably accurate data before using a machine learning model and statistical tools to analyze the data. This will allow us to investigate the connection between indoor air quality and human physiological responses. Participants' bio-signals, survey results, and the state of the indoor and outdoor environments must all be collected as three different types of data. The heart rate, skin temperature, stress level, and EDA of the occupants make up the second set of bio-signal data.

Students

Advisors

Joon-Ho Choi

Skills Required by the team

R
Machine Learning
Statistics

Machine Learning Enabled Fault Detection and Diagnosis of Quantum Circuits

Sat, 01 Jan 2022 00:00:00 +0000

Description

This is an interdisciplinary data science project that involves aspects and requires expertise from quantum information theory and machine learning. In this project we plan to develop and implement a novel approach to substantially improving the performance of quantum computers using advancements in the area of machine learning enabled fault detection and diagnosis. We will adapt and further develop existing machine learning protocols to efficiently and reliably detect and diagnose faulty quantum circuits. The protocols are expected to reach beyond the capabilities of current arts in the error diagnosis of quantum circuits, and to provide detailed and transparent information about various sources of errors in the quantum circuits with significantly fewer queries to the quantum circuit and considerably fewer repeated experiments. This project will allow student to learn and acquire expertise in topics that cross quantum information theory, quantum computing, and machine learning.

Advisors

Amir Kalev

Skills Required by the team

Python
Machine Learning

OSINT Social Networks on GitHub

Sat, 01 Jan 2022 00:00:00 +0000

Description

Open-source intelligence (“OSINT”) is a rapidly growing area of cybersecurity. This project seeks to explore OSINT information available on GitHub. We’ll use the GitHub API and related tools to build networks to try to answer a number of interesting questions, such as “can you tell what software a company uses based on its employees networks?”, “do white hat hackers have social networks that look different than black hatters?” and others. If you’re interested in cybersecurity, OSINT, social networks, databases, APIs etc. then this is the project for you!

Advisors

Jeremy Abramson

Skills Required by the team

Python
Databases
GraphQL
Neo4j

Quantum Natural Language Processing for Fake News Identification

Sat, 01 Jan 2022 00:00:00 +0000

Description

Advancements in artificial intelligence, especially neural networks, have enabled more intelligent models that can distinguish between fake and real information. However, these models suffer from over-fitting: a phenomenon where models memorize certain patterns in the dataset instead of understanding the actual underlying task.This prevents the models from generalizing well, especially across domains. Quantum Natural Language Processing (QNLP) is a very nascent field where quantum computers solve NLP problems. It has been shown that QNLP models have been able to solve many of the aforementioned tasks difficult for neural networks to solve. This is attributed to the fact that QNLP models naturally incorporate rich linguistic meanings and structure. In this project we will create neural network like models for QNLP. This will be done on fact verification datasets, with the goal of improving the quality of fake news identification.

Awards

Best Interdisciplinary Data Science Team

Students

Advisors

Mitch Paul Mithun

Skills Required by the team

Python
Deep Learning
NLP

Scientific Concept Discovery: Using Machine Learning to Advance Scientific Research

Sat, 01 Jan 2022 00:00:00 +0000

Description

Our group focuses on the question of how to design a learning framework that promote the generalizability of machine learning models. In this project, you will focus on exploring how neural networks acquire information from the training examples and how they learn to solve various physical problems (e.g., emulation of simple quantum systems). The premise of this project is that by observing how a machine learning model learns to solve the specific task, we can learn about the underlying problem itself. As an example, by analyzing the weights of a trained neural network, you can discover non-trivial symmetries of the modeled physical system, determine the relative importance of features, or identify some non-trivial interplay between underlying physical mechanisms. Your task would be to learn various tools for interpreting deep neural networks. You will test them in practice and you will explore methods that promote model transparency and interpretability.

Awards

Best Project Achievement

Advisors

Marcin Abram

Skills Required by the team

Python
PyTorch
Tensorflow
Bash
Quantum Mechanics

Social media habits of misinformation spreaders

Sat, 01 Jan 2022 00:00:00 +0000

Description

Social media habits represent one of the most common – and controversial – forms of habitual behavior in contemporary society. This project will investigate whether and how social media habits are linked to the spread of misinformation. Specifically, this research aims at understanding whether there are habits that can be identified within social media data that are unique to misinformation spreaders. For example: do these users re-post, reply to, or post content in ways that seem habitual — as opposed to behaviors based on the rewards received from other users (e.g., likes, re-posting)? To perform this analysis, students will closely examine social media data across multiple platforms. The goal of this project will be to develop a model which can infer habit-based vs. non-habitual processes from existing user data, and identify how these processes play a role in the spread of misinformation. This project will be in collaboration with the USC Department of Psychology.

Awards

Best Interdisciplinary Data Science Team
Best Website

Students

Advisors

Skills Required by the team

Statistics
Data Analysis
Data Visualization

Studying Scientific Innovation with Temporal Knowledge Graph Representation Learning

Sat, 01 Jan 2022 00:00:00 +0000

Description

What’s the next big idea, and who’s going to discover it? Our project is trying to understand how researchers make new discoveries and innovate new ideas. To do that we will apply deep learning techniques for temporal knowledge graph learning (RE-NET, CyGNet, HINGE, StarE) to a huge citation network dataset. We have assembled a KG with 260M research papers, 270M authors, 700K fields. To learn representations, our training tasks include citation prediction, author collaboration prediction, and field of study prediction.

Advisors

Skills Required by the team

Python
Knowledge Graphs
PyTorch
Tensorflow
Wikidata

Turning READMEs into Chatbots

Sat, 01 Jan 2022 00:00:00 +0000

Description

Ever get frustrated reading a README? Wish you could just ask someone for help instead of reading pages of documentation, combing through StackOverflow posts, and consulting lecture slides? This ambitious project will build a team of students to convert documentation in README files, ReadTheDocs and other manual pages, and StackOverflow posts into short conversations. These conversations will be used to train a dialogue model like DialoGPT to help create an assistive chat bot that can answer questions about code.

Awards

Best Data Science Teamwork

Students

Advisors

Jay Pujara

Skills Required by the team

Python
PyTorch
Huggingface

Automatically segmenting and describing the human corpus callosum from brain MRIs

Fri, 01 Jan 2021 00:00:00 +0000

Description

The human corpus callosum is the largest pathway connecting the left and right hemispheres of the brain. The shape of the corpus callosum (CC) changes throughout the course of human development, and it can also be altered with respect to disease onset. We can explore the variation in CC shape along the middle of the brain, but we need to extract it reliably first. The lab currently has two methods for extracting the CC, one using only image processing techniques, and another using deep learning (UNet) but these methods do not always extract the CC accurately. The accuracy results often depend on the MRI scanner that was used, or the abnormalities present in the scan. Can we improve the performance of our deep learning model with additional training data? Can we change some processing steps to improve the model? Once we do have an accurate segmentation, then what shape metrics of the CC as a whole, or in parts, are most telling of the underlying biology, such as age and risk for disease?

Awards

Best Interdisciplinary Data Science Project
Best Project Achievement
Best Project Website
Highlighted Project

Students

Advisors

Neda Jahanshad

Skills Required by the team

Python
Deep Learning
Bash
R

Final presentation resources

Final presentation

Comparing Clinical Trials to Improve Cancer Treatments

Fri, 01 Jan 2021 00:00:00 +0000

Description

The goal of this project is to assist clinicians to find the best course of treatment for a cancer patient based on the latest and most appropriate clinical trials. Because new drugs are appearing increasingly fast, it is hard to keep track of the outcomes of all clinical trials and determine the best treatment. In collaboration with biomedical researchers, we have been developing algorithms to extract information about clinical trials from government websites, to structure the information, and to find the clinical trials that are most relevant for a given patient. We want to improve the algorithms to structure this information, and to develop similarity metrics that will help us retrieve and rank clinical trials.

Awards

Best Interdisciplinary Data Science Project

Students

Advisors

Yolanda Gil

Skills Required by the team

Python

COVID-19 misinformation

Fri, 01 Jan 2021 00:00:00 +0000

Description

This new project attempts to understand the interaction between anti-vaxxers (anti-vaccination groups) and alt-right groups on platforms such as Facebook. The goal of this project is to understand how do these two types of fringe groups interact over the years, and how do their interactions and discourse evolve during the COVID-19 pandemic. It would be interesting to explore the longitudinal patterns of network/discourse co-evolution and how such patterns may change in times of dramatic events.In terms of data, I have access to Facebook’s historical data archive and I have collected a dataset that contains a list of anti-vaxxer (n=158) and alt-right groups’ (n=183) Facebook posts over 10 years (2010-2021). The dataset can be further expanded with additional help.

Awards

Highlighted Project

Students

Advisors

Aimei Yang

Skills Required by the team

Social Network Analysis
NLP

Final presentation resources

Final presentation

Decoding How Humans Encode Memories

Fri, 01 Jan 2021 00:00:00 +0000

Description

Awards

Best Cyberphysical Data Science Project

Students

Advisors

Luis Garcia

Skills Required by the team

Python
Deep Learning

Detecting Biases in College Football Recruiting

Fri, 01 Jan 2021 00:00:00 +0000

Description

College football recruiting is big business. This project aims to build and analyze a comprehensive college football recruiting dataset, to help determine if there are biases in who and how college football coaches recruit players. This data set will combine college football recruiting data from the web with census and other socioeconomic data, to search for patterns in where and how college football coaches recruit players.

Awards

Highlighted Project

Students

Advisors

Jeremy Abramson

Skills Required by the team

Python
Web Scraping
SQL
NoSQL

Final presentation resources

Final presentation

Detecting Biases in College Football Recruiting (Spring - 2021)

Fri, 01 Jan 2021 00:00:00 +0000

Description

Students

Advisors

Jeremy Abramson

Skills Required by the team

Python
APIs
Databases
Web Scraping
SQL
NoSQL

Discovering and Measuring Biases in Commonsense Knowledge Bases

Fri, 01 Jan 2021 00:00:00 +0000

Description

Common sense knowledge bases are used widely in research, spanning many areas in artificial intelligence, including natural language understanding, computer vision, and planning. However, these resources may contain human biases, which will ultimately be embedded in the resulting AI solution and potentially have negative societal impacts. The extent to which these biases exist is unclear. In this project, you will define several well-motivated biases (location, gender, ethnicity) and measure the extent to which they are represented in ConceptNet.

Awards

Best Data Science Insight
Highlighted Project

Students

Advisors

Skills Required by the team

Python
Statistics
Machine Learning
Clustering
Language Models
Data Analysis

Drought prediction in Southern California using deep learning

Fri, 01 Jan 2021 00:00:00 +0000

Description

Seasonal drought predictions are important for the management of water resources for agriculture, urban consumption… Seasonal forecasts have traditionally been done using a physics-based model. In this project, we will use a deep learning approach for drought forecasting in CA.

Students

Advisors

Deborah Khider

Skills Required by the team

Python
PyTorch
Statistics
Deep Learning

Impacts of Smart Windows on Human’s Bio-Signals

Fri, 01 Jan 2021 00:00:00 +0000

Description

This research will be conducted to find out the relationship between humans’ bio-signals and electrochromic windows, which is useful to create a possible mechanism of using bio-signals to control the windows. By using wearable sensors and remote sensors, subjects’ bio-signals like heart rate, skin temperature, and pupil sizes, and indoor environmental quality like temperature and humidity could be monitored and analyzed. At last, by utilizing machine learning and data analysis skills, the impacts on humans’ bio-signals could be analyzed.

Students

Advisors

Zihan Wang

Skills Required by the team

Machine Learning
Data Analysis

Investigate the healthy indoor air quality under Covid-19 in Los Angeles based on machine learning

Fri, 01 Jan 2021 00:00:00 +0000

Description

So far we do not know how much ventilation quantity will be needed to effectively prevent infection with COVID-19, and there is no sensor can measure coronavirus. But PM 2.5 and CO2 are the good indicators to estimate the Covid-19 virus concentration. Moreover, bio-signals can be used to assess people’s state of health. In my experiment, I will find participants and collect data including indoor environmental data, outdoor environmental data and human bio-signals. Then, the data will be analyzed by machine learning to find the correlation between the indoor environmental factors and the outdoor environmental factors, the correlation between the indoor environmental factors and human factors, also I can find the appropriate range of every indoor air quality factors when people under human healthy state, therefore, finally I can control the window to keep the indoor CO2 and pm2.5 within that range of the conclusion to keep people in a healthy state.

Students

Advisors

Minghuan Gong

Skills Required by the team

Python
Statistics

Investigating disparities in the COVID-19 epidemic in Los Angeles County through fine-grained epidemic modeling

Fri, 01 Jan 2021 00:00:00 +0000

Description

Fine-grained epidemiological modeling of the spread of COVID-19 can inform public health policy that accounts for disparities in the risk of exposure, infection, and death across different locations and different demographic groups. In Los Angeles County, disparities in COVID-19 infection rates by neighborhood have been tremendous. Throughout the current large outbreak wave, infection incidence rates in low-income, predominantly Hispanic neighborhoods of East LA have consistently been 10-15 times higher than in wealthier, predominantly white neighborhoods in West LA. Many well-informed hypotheses exist to explain the cause of these disparities in infection, including employment sectors that require leaving homes to work, household density, and behavioral differences across cultures and age groups. But for Los Angeles County, these hypotheses have not been evaluated quantitatively in the context of an epidemic modeling framework.

To explain the disproportionate impact of the virus on disadvantaged demographic groups in Los Angeles County, we are developing a networked multiple-population epidemic model to investigate how epidemic dynamics and infection outcomes differ across fine-grained neighborhoods. Specifically, we will extend an already-developed stochastic SEIR+ disease model that includes healthcare, death, and vaccination compartments into the networked multiple-population framework, which will model movements, contacts, and infection pathways within and between neighborhoods. A key feature of this modeling framework will be the use of dynamic mobility data, derived from US cell phone data, to inform changes in the daily movements of people within and between neighborhoods. This data will provide the basis of a weighted infection-transmissible contact network between neighborhoods. The SEIR disease model is run on top of this contact network, determining infection dynamics across the neighborhoods. The model will allow obtaining estimates of key epidemic quantities including transmission rates (and the time-varying reproductive number, R(t)) and infection fatality rates for each neighborhood, and identifying the neighborhoods driving epidemic spread (through contacts within and across neighborhoods). Furthermore, hierarchical modeling techniques will be used to obtain estimates of infection and fatality rates for substrata representing combinations of ethnicity/race, age, and sex within each neighborhood.

CKIDS PROJECT TASKS

While the overarching goal of this project is to develop a multiple-population epidemic model for Los Angeles County (LAC) across a network of connected neighborhoods, it is also necessary to maintain a single-population model for LAC as a whole that estimates the epidemic parameters for this larger spatial level. Such a single-population model has been maintained since May 2020 by the USC Biostatistics COVID modeling team. This model serves two important purposes. First, since May 2020 it has supported the LAC Department of Public Health, which has requested updates on key epidemic predictions on a weekly basis. Second; the parameters estimated from the single population model will serve as prior distributions in the Bayesian parameter estimation framework used in the networked-neighborhood model.

The first task for the CKIDS student will be to re-implement the parameter estimation framework for the existing LAC-level model, such that parameters are estimated each week and fixed for future estimates forward in time. This can be done either through modification to the existing code and parameter estimation framework, written in R and using Approximate Bayesian Computation (ABC), or through a full reimplementation of the modeling code. The second task will be to maintain the model estimation and website displaying updates through weekly updates using data that comes directly from the LAC Department of Public Health. A third possible task, depending on the interest of the CKIDS student, will be to apply the modeling to California data, and other counties in California (so far it has only been applied to LAC data).

Students

Advisors

Abigail Horn

Skills Required by the team

Computational Simulation
R

Investigating disparities in the COVID-19 epidemic in Los Angeles County through fine-grained epidemic modeling (Spring - 2021)

Fri, 01 Jan 2021 00:00:00 +0000

Description

CKIDS PROJECT TASKS

Students

Advisors

Abigail Horn

Skills Required by the team

R
Statistics
Computational Simulation

Looking at White Hat (?) Hacker Social Networks on Github

Fri, 01 Jan 2021 00:00:00 +0000

Description

“Open-source intelligence (“OSINT”) is a rapidly growing area of cybersecurity. This project seeks to explore OSINT information available on GitHub. Specifically, we will build and analyze a dataset comprised of users on GitHub who show a specific interest in GitHub repos related to hacking artifacts. This dataset and social network analysis could help us determine what attributes lead to “black hat” — or malicious — cyber actors.”

Students

Advisors

Jeremy Abramson

Skills Required by the team

Python
Databases
APIs
GraphQL
OSINT
Cybersecurity

Looking at White Hat (?) Hacker Social Networks on Github (Spring - 2021)

Fri, 01 Jan 2021 00:00:00 +0000

Description

Open-source intelligence (“OSINT”) is a rapidly growing area of cybersecurity. This project seeks to explore OSINT information available on GitHub. Specifically, we will build and analyze a dataset comprised of users on GitHub who show a specific interest in GitHub repos related to hacking artifacts. This dataset and social network analysis could help us determine what attributes lead to “black hat” — or malicious — cyber actors.

Awards

Best Project Achievement

Students

Advisors

Jeremy Abramson

Skills Required by the team

Python
Databases
APIs
GraphQL

Machine Learning to Analyze Rock Microstructures

Fri, 01 Jan 2021 00:00:00 +0000

Description

Students will analyze images from optical microscopes that reveal features of materials and microstructures using machine learning techniques. These images have been collected by geologists, who use them to study the rock samples that they collect in the field and determine their properties and origins. We have a baseline system already implemented, and the goal is to improve it with new machine learning techniques, guided by the insights of our collaborating geologists.

Awards

Best Data Science Teamwork

Students

Advisors

Skills Required by the team

Python
Machine Learning
Computer Vision

Mapping the Ethical Concerns Surrounding AI Research

Fri, 01 Jan 2021 00:00:00 +0000

Description

With the recent enthusiasm about algorithmic fairness and responsible AI, many conferences are encouraging or requiring a broader impact section to assess societal harms and benefits of the AI research being presented. In this project, we will analyze the themes of these sections, with a particular focus on the ethical issues being addressed and acknowledged. We will develop tools and methods to evaluate the harms and benefits of the presented research. The goal is to see how is the community helping AI research to be less harmful but more beneficial for society. For more background on work in this area, please review this workshop.

Awards

Best Data Science Collaboration Practices

Students

Advisors

Fred Morstatter

Skills Required by the team

Python
AI ethics

Mapping the Uncanny Valley

Fri, 01 Jan 2021 00:00:00 +0000

Description

While many stories involve the friendly and familiar, scary stories across cultures, from Hamlet to Yotsuya Kaidan to Siren Head involve beings that are almost—but not quite—human. Can these stories give us insight into the “nearly-human” uncanny valley? Initial results from our group say yes! While some research has explored the uncanny valley for images, the research is limited and previously unexplored in text format. If we can extract human emotions surrounding text descriptions, we can exploit an enormous array of data. Our goal this semester is to analyze our objective definitions of “fear” or “creepiness” in a story and test how the similarity of words to “human” make them more or less creepy. Moreover, we will explore what features of objects make them more or less scary. These findings share a direct relationship to AI and robotics where our goal is always to improve pleasant interactions and affability in human-computer and human-robot interactions. The students will build on initial work to apply NLP methods to these texts and improve upon existing initial results.

Students

Advisors

Keith Burghardt

Skills Required by the team

Python
NLTK
NLP
Keras

Microtelcos and the Digital Divide in CA

Fri, 01 Jan 2021 00:00:00 +0000

Description

The COVID-19 pandemic has reinvigorated calls to close the digital divide in the US and elsewhere. Without adequate Internet access, households are at a disadvantage in education, jobs, health, and other key dimensions of wellbeing. While local broadband markets are increasingly concentrated, there is also increased interest in exploring the role those small local operators (“microtelcos”) can play in serving in low-income and rural communities. These microtelcos range from small wireless cooperatives to mom-and-pop private ISPs to municipal-backed operators. This project seeks to a) map and identify the characteristics of communities where microtelcos are present in CA, and b) to examine whether microtelcos presence affects broadband service quality and adoption by businesses and households in the community. The project will use broadband deployment data collected by the CPUC (California Public Utilities Commission) and socioeconomic data from the Census Bureau.

Awards

Best Data Science Poster Award

Students

Advisors

Hernan Galperin

Skills Required by the team

Statistics
GIS
Econometrics

NVISION: Network Visualization Interventions Supporting Interpretation of Objective News

Fri, 01 Jan 2021 00:00:00 +0000

Description

“The current fractured media landscape allows individuals to choose confirming over credible information, and information spreads quicker online than interventions like fact-checking. Misinformation can be debiased by identifying gaps in mental representations of the world (mental models) and prompt alerts to be vigilant about assessing information (Lewandowsky et al, 2012).

We aim to develop interventions to make media-balance salient to users to mitigate the spread of misinformation. Social sampling theory describes that our misperceptions of others is explained by the sample of people we encounter (Galesic, Olsson, and Rieskamp, 2012), and we are more likely to link to similar people online (Kossinets & Watts, 2009). Our interventions address limitations in individual views of the media landscape. We aim to attach visualizations of sharing network to news articles in real-time to make these biases explicit.”

Awards

Highlighted Project

Students

Advisors

Skills Required by the team

Data Analysis
Data Collection
Data Visualization

Final presentation resources

Final presentation

Object detection and classification APIs for urban street image analysis

Fri, 01 Jan 2021 00:00:00 +0000

Description

Developing models to analyze images is a demanding task that requires significant time, resources, and effort. Recently, companies such as Amazon and Google are providing services to make the modeling process easier so even users with little machine learning expertise can enjoy deep learning technologies. Based on our prior work in object detection and classification for smart city applications, we would like to compare and evaluate the process and performance of commercial services using our training datasets. This project will be a good practice to understand the image machine learning modeling process and the advantages/limitations of commercial services for customized learning.

Awards

Best Data Science Presentation
Highlighted Project

Students

Advisors

Seon Ho Kim

Final presentation resources

Final presentation

Object Detection and Classification for Street Cleanliness

Fri, 01 Jan 2021 00:00:00 +0000

Description

In collaboration with the Sanitation Department of LA, IMSC has been developing a framework to automatically detect the cleanliness of streets as well as any special objects in need of removal. The framework makes use of machine learning technology trained on images/videos collected by the city and/or taken by citizens. The images taken by mobile cameras (e.g., LA City’s garbage collection trucks and/or citizens’ smartphones using our own MediaQ App) are transferred to the MediaQ server, then these images can be automatically classified based on predefined cleanliness indexes and object types (such as bulky item, illegal dumping). In this project, we will focus on the detection and classification of homeless encampments in LA streets. Recorded images/videos with GPS location data will be processed and the classification results will be displayed on a map to understand the distribution of homeless people in LA, which is essential data to study the homeless issue.

Students

Advisors

Seon Ho Kim

Skills Required by the team

Python
Machine Learning
Data Visualization
Computer Vision

Omics and Aging in Killfish

Fri, 01 Jan 2021 00:00:00 +0000

Description

Students will analyze aging in the African turquoise killifish, a species with the shortest lifespan of all vertebrates. By analyzing multi-omic data over the lifetime of many individuals, we can begin to understand the cellular changes that reflect aging.

Students

Advisors

Skills Required by the team

Python
R
Machine Learning

Studying the Effects of Genes and Environment in Aging

Fri, 01 Jan 2021 00:00:00 +0000

Description

Students will analyze genomic and environmental data collected through the lifetime of individuals to investigate which genes and external conditions could be associated with aging. The goal of the project will be to reproduce an existing published paper and improve on its results.

Awards

Best Interdisciplinary Data Science Team

Students

Advisors

Skills Required by the team

Python
R
Machine Learning

Tracking health and nutrition signals from social media data (begun Spring 2020)

Fri, 01 Jan 2021 00:00:00 +0000

Description

Food environments (the physical spaces where people acquire and consume food) can profoundly impact diet and related diseases. Effective, robust measures of food environment nutritional quality are required by researchers and policymakers investigating their effects on individual dietary behavior and designing targeted public health interventions. The most commonly used indicators of food environment nutritional quality are limited to measuring the binary presence or absence of entire categories of food outlet type, such as ‘fast-food’ outlets, which can range from burger joints to salad chains. There would be great value in a summarizing indicator of restaurant nutritional quality that exists along a continuum, and which can be applied at the scale of large food environments, for example across Los Angeles County, to make distinctions between diverse restaurants within and across categories of food outlets.

This project will explore the ability to track real-life health and nutrition signals from social media data, focusing on data from Foursquare and Yelp. We will investigate the ability to access menu information from the APIs of these social media platforms, and develop measures to assess the nutritional content of these menus. Multiple aims will be investigated in this project, including scraping data from social media; NLP of menu text, tag, and comment data; developing predictive models of obesity; and more. “Ground truth” data on dietary patterns of LA residents will be available, enabling validation of dietary measures and predictive models built from menu data.

Students

Iris C. Liu

Advisors

Skills Required by the team

Python
R
NLP
Statistical Modeling

Transfer learning for adversarial machine translation

Fri, 01 Jan 2021 00:00:00 +0000

Description

Neural Machine Translation (NMT) is the process of mapping a segment of words from a source language to a target language using neural networks. However, NMT systems rely on large datasets for the source and target languages, and perform poorly on low-resource languages where there is insufficient parallel data. An effective method for improving NMT on low-resource languages is to employ transfer learning, where a model trained on a high-resource language pair is used to initialize training for the low-resource language pair. In this work, we will study the effect of employing transfer learning methods on an adversarial machine translation models based on Long Short-Term Memory Recurrent Neural Networks (LSTM).

Students

Advisors

Mohammad Reza Rajati

Skills Required by the team

Python
PyTorch
Machine Learning
Deep Learning

Turning Cyber Data into Language

Fri, 01 Jan 2021 00:00:00 +0000

Description

Cyber ontologies such as STIX and ATT&CK can represent complex relationships between cyber threat actors, attacks, and infrastructure. While such representations are easily processed by computers, cyber analysts often prefer dealing with written text. Natural language ontologies like FrameNet represent language in a structured manner as well, but frame specifications are often not specific enough for a given domain (like cybersecurity). In this project, students will learn about cybersecurity threat ontologies and build a GUI web app tool that annotates provided cyber threat documents. No previous knowledge of cybersecurity necessary!

Students

Advisors

Jeremy Abramson

Skills Required by the team

Python
Flask
Streamlit

A Data Challenge for Parkinson’s Disease

Wed, 01 Jan 2020 00:00:00 +0000

Description

This project will assemble a team to participate in the Biomarker and Endpoint Assessment to Track Parkinson’s Disease (BEAT-PD) DREAM Challenge by the Michael J. Fox Foundation and Sage Bionetworks. The challenge is designed to benchmark new methods to predict Parkinson’s disease progression. Teams participating in the Challenge will have access to raw sensor data that can be used to predict individual medication state and symptom severity. Specifically, teams are asked to develop methods to predict on/off medication status, dyskinesia severity, and/or tremor severity.

Awards

Highlighted Project

Students

Advisors

Neda Jahanshad

Skills Required by the team

Data Science

Final presentation resources

Final presentation

A framework for enabling software comparison and classification

Wed, 01 Jan 2020 00:00:00 +0000

Description

The number of scientific products, including scientific software, has been steadily growing in the last years. This growth makes it difficult for researchers to understand all the latest code and publications available. A great body of research has attempted at classifying similar papers and literature. However, there aren’t to date good approaches for finding similar or related code. In this project, the students will analyze different unsupervised methods to find scientific software similarities based on a) An automated analysis of their dependencies; b) By classifying the main functionality of software components.

Students

Advisors

Daniel Garijo

Skills Required by the team

Python
Machine Learning
Sklearn
Data Manipulation

A Knowledge Graph for Cybersecurity Experiments

Wed, 01 Jan 2020 00:00:00 +0000

Description

The DETER cybersecurity testbed has been running experiments for several years, collecting information about intrusions, vulnerabilities, and mitigation strategies. This project will capture cybersecurity experiments as a knowledge graph that can be browsed, queried, and mined to find patterns and create models of cyberattacks.

Awards

Highlighted Project

Students

Advisors

Skills Required by the team

Python
Knowledge Graphs

Final presentation resources

Final presentation

Annotating Paleoclimate Data

Wed, 01 Jan 2020 00:00:00 +0000

Description

Paleoclimate data is highly diverse, requiring different sets of metadata to describe the various datasets. In this project, you will help build an interface to assist researchers in annotating their paleoclimate datasets according to an evolving reporting standard (PaCTS) and download them in the Linked Paleo Data (LiPD) format. The interface should be highly interactive (wizard) to accommodate the diversity of the data as well as offer editing capabilities for existing datasets (upload LiPD files) and check their compliance with PaCTS, plot location information and the time series, download into the LiPD format, upload to a semantic wiki and/or an SQL database. In addition, the interface should support the use of a recommender system (to be build) to help researchers in annotating their datasets.

Awards

Highlighted Project

Students

Advisors

Skills Required by the team

Python
Javascript
Web Technologies

Final presentation resources

Final presentation

Brain morphometry from contrast-enhanced T1-weighted brain MRIs

Wed, 01 Jan 2020 00:00:00 +0000

Description

Cancer remains the second leading cause of death in the US. However, recent advancements have increased cancer survivorship, now numbering tens of millions. Given this, there is tremendous interest in studying cancer-related cognitive impairment (CRCI) and CRCI due to chemotherapy or “chemobrain”, can afflict up to 78% of survivors. The neural substrates of CRCI are unknown and understanding this may improve survivors’ quality of life. The CRCI neuroimaging literature is still in its infancy and these studies have used small sample sizes from traditional research-dedicated nCE scans. Because conducting well-powered neuroimaging studies is very expensive, adapting clinical CE T1w scans could prove useful for CRCI and many other diseases like dementia. The promary objective of this project is to develop a novel deep learning method to generate nCE images from acquired CE T1w scans to allow accurate brain morphometry and be a plentiful source of inexpensive neuroimaging data.

Students

Advisors

Skills Required by the team

AI
Familiarity with MRIs

Capturing the Provenance of Data Analysis Using the PROV Standard

Wed, 01 Jan 2020 00:00:00 +0000

Description

Documenting how a result was obtained from data analysis involves documenting the software, software settings, and datasets used to obtain that result so it can be explained properly. The current ASSET interface enables users to document the provenance of data analysis no matter what infrastructure they used (R scripts, sk-learn, etc). This project will focus on capturing provenance records for data science projects and using the W3C PROV standard to export those records. It will also develop tools to mine provenance data to find common patterns of use.

Advisors

Yolanda Gil

Skills Required by the team

Knowledge Graphs
Javascript
Firebase
UI Development

Characterizing the counter-narratives of climate change

Wed, 01 Jan 2020 00:00:00 +0000

Description

Awards

Best Data Collection
Highlighted Project

Students

Advisors

Skills Required by the team

Classification
Data Collection

Characterizing the counter-narratives of climate change (Spring - 2020)

Wed, 01 Jan 2020 00:00:00 +0000

Description

Awards

Best Data Science Teamwork
Highlighted Project

Students

Advisors

Skills Required by the team

Classification
Data Collection

Final presentation resources

Final presentation

Connections within Contemporary Feminism Movements

Wed, 01 Jan 2020 00:00:00 +0000

Description

This project will look at events data collected at several recent feminist social movements and understand their connection to each other. Specifically, it will explore individuals or organizations that played instrumental role in movement mobilization, relationship brokerage between movements, and building and sustaining activist communities. Previous research suggests that those distinctive movements are often not isolated incidents, but mobilized by a core group of “leaders” or similar ideas and frames. The goal is to understand how seemingly disconnected movements relate to one another help to reveal the lasting impact of mediated movements.

Awards

Best Interdisciplinary Data Science Team
Highlighted Project

Students

Advisors

Aimei Yang

Skills Required by the team

Python
Social Network Analysis
Data Mining

Final presentation resources

Final presentation

Data scraping for salary benchmarking

Wed, 01 Jan 2020 00:00:00 +0000

Description

This project will develop a data scraper to collect salary records from a website that provides compensation data for faculty at public universities. When provided a list of faculty names and institutional affiliations, this program will search for the associated records, extract the relevant results, and copy the data into a spreadsheet. The purpose of this project is to explore the feasibility of automating an otherwise time-consuming data collection task required for benchmarking of faculty salaries in relation to peer institutions. This will ultimately facilitate a number of important tasks, including analysis of potential salary disparities within certain disciplines and faculty tracks.

Awards

Highlighted Project

Students

Advisors

Skills Required by the team

Web Scraping

Final presentation resources

Final presentation

Detecting Biases in College Football Recruiting (Fall - 2020)

Wed, 01 Jan 2020 00:00:00 +0000

Description

College football recruiting is big business. This project aims to determine if there are biases in who and how college football coaches recruit players. By creating a comprehensive data set of college recruits and integrating relevant data with current socioeconomic markers (i.e. census data) we hope to determine if there are patterns in who and where football coaches recruit their players, regardless of talent.

Students

Advisors

Jeremy Abramson

Skills Required by the team

Python
Data Science
Social Science
Mapmaking
Economics

Digital Democracy: Using Social Media to Improve Political Discourse

Wed, 01 Jan 2020 00:00:00 +0000

Description

Politicians in modern democracies across the world have eagerly adopted social media for engaging their constituents, entering into direct dialogs with citizens. From the perspective of political actors, there is a need to continuously gather, monitor, analyze, and visualize politically relevant information from online social media with the goal to improve communication with citizens and voters. The goal of this proposal is to create a tool that enhances interaction and dialogue between political actors and their followers. This will be achieved by creating compact and comprehensive summaries that aggregate and visualize common narratives, thus, reducing the cognitive load required to read all the messages and streamlining the dialogue experience.

Students

Advisors

Andrés Abeliuk

Skills Required by the team

NLP
Programming

Digital Democracy: Using Social Media to Improve Political Discourse (Spring - 2020)

Wed, 01 Jan 2020 00:00:00 +0000

Description

Politicians in modern democracies across the world have eagerly adopted social media for engaging their constituents, entering into direct dialogs with citizens. From the perspective of political actors, there is a need to continuously gather, monitor, analyze, and visualize politically relevant information from online social media with the goal to improve communication with citizens and voters. The goal of this proposal is to create a tool that enhances interaction and dialogue between political actors and their followers. This will be achieved by creating compact and comprehensive summaries that aggregate and visualizes common narratives, thus, reducing the cognitive load required to read all the messages and streamlining the dialogue experience.

Awards

Best Data Science Collaboration Practices
Highlighted Project

Students

Advisors

Andrés Abeliuk

Skills Required by the team

NLP
Programming

Final presentation resources

Final presentation

Disparities in educational achievement

Wed, 01 Jan 2020 00:00:00 +0000

Description

The project will combine socio-economic data from US Census with college and K-12 performance data to identify correlates of positive educational outcomes. Of specific interest will be assessing how economic inequalities and racial disparities affect educational achievement in different regions of US.

Awards

Highlighted Project

Students

Advisors

Kristina Lerman

Skills Required by the team

Python
Statistics

Final presentation resources

Final presentation

Gender inclusion in science

Wed, 01 Jan 2020 00:00:00 +0000

Description

This project will measure representation of women in various scientific disciplines across time (and different countries) and identify institutions who have succeeded in creating a more welcoming environment for women. While there are already studies that use bibliographic data to map career trajectories of women, they do not focus on the role that institutions (and countries) – and their policies – play in retaining female researchers.

The data comes from Microscoft Academic Graph, containing millions of papers from institutions around the world across many decades. We will use Ethenea API to extract gender (and ethnicity) of authors.

Awards

Highlighted Project

Students

Advisors

Skills Required by the team

Python
Statistics

Final presentation resources

Final presentation

Generation of a Sports-based Introductory Data Science Curriculum to Increase Participation of Underrepresented Groups in STEM

Wed, 01 Jan 2020 00:00:00 +0000

Description

As the requirements for success in the workforce become increasingly technical, there is a commensurate need for curricula that can engage and capture the imagination of students, especially those from traditionally underrepresented groups in STEM. One way to reach these groups is via curricula that appeals to contexts in which they’re familiar and engaged, such as sports. To that end, this project will explore the development of a sports-based introductory data science curriculum with the goal of engaging students who might otherwise not be interested in pursuing data science as a career. Students will work on generation of illustrative code examples/problem sets in Python using sports examples.

Students

Advisors

Jeremy Abramson

Skills Required by the team

Python
Pedagogy
Social Science

Integration of Frame Semantics to Cyber Ontologies

Wed, 01 Jan 2020 00:00:00 +0000

Description

Cyber ontologies such as STIX and ATT&CK can represent complex relationships between cyber threat actors, attacks and infrastructure. While such representations are conducive to interoperability between systems, they are often unwieldy for human cyber analysts to deal with directly. Conversely, Natural language generation (NLG) frameworks like FrameNet represent language in a structured manner, but frame specifications are often not specific enough for specialized domains (such as cyber security). Leveraging and combining the semantic structure of both forms can create a tool that can translate cyber threat data in standard interoperable formats (such as STIX) to human-readable reports, via existing NLG frameworks. Working on a project such as this provides an opportunity for significant impact, as the fusion of these two structures could greatly increase both the adoption and the utility of cyber threat ontologies.

Students

Advisors

Jeremy Abramson

Skills Required by the team

Python
NLP
Data Science
OSINT
Cybersecurity

Investigations of a Data Science Online Community

Wed, 01 Jan 2020 00:00:00 +0000

Description

The Kaggle.com competition ecosystem is a rich and active community with a designed Progression System that uses performance medals to rank and differentiate users into tiers. However, winning performance medals in Kaggle is more complex than it appears. Users are bound by the available competitions, characteristics of the competition’s problem statement, the quality of their software submissions, and the quality of other competitors (including collaborators). With these factors, one user’s earned “Gold” medal from one competition may have required more effort and a higher quality solution than another user’s earned “Gold” medal in a different competition. This project has great potential to learn about open competitions in data science. Some example questions are: What features help predict whether a user will win a medal in a competition? How can users be clustered and differentiated from one another using their competition patterns and medal-winning solutions? How quickly (in days) will a user win their next competition medal? What is the probability that a user will assemble a team for a competition? What are features that predict high-performing teams? What features help generate teammate recommendations?

Awards

Best Interdisciplinary Data Science Team

Students

Advisors

Marlon Twyman

Skills Required by the team

Python
R
Statistics
Machine Learning
R
Matlab

Machine Learning to Analyze Rock Microstructures (Fall - 2020)

Wed, 01 Jan 2020 00:00:00 +0000

Description

Awards

Best Project Presentation
Highlighted Project

Students

Advisors

Yolanda Gil

Skills Required by the team

Python
Machine Learning
Image Analysis

Mapping the impacts of climate change across LA County

Wed, 01 Jan 2020 00:00:00 +0000

Description

We are increasingly hyper-aware of the effects of climate change. But we are not always aware of how hyper-local those effects can be. Across the 2,500 square miles of Los Angeles County, the impact of climate change is playing out in different ways. For example, some areas are experiencing more frequent spikes in extreme temperatures, while others are not. We propose a project that would fuse together several different datasets in order to map how temperature changes and other variables are hitting some corners of Los Angeles harder than others. Often, these areas are inhabited by people facing numerous other inequities, such as poor healthcare access. By examining several years’ worth of hourly average temperatures from thousands of spots across Los Angeles County, and combining that with other datasets, such as tree cover, cases of asthma, and so forth, it is possible to create an interactive map that illustrates where the impacts of climate change are most acute. This project would be published by Annenberg’s Crosstown publishing outlet and would be distributed widely. The project would have immediate practical applications and could inform policy decisions on issues such as where to place parks and green spaces.

Awards

Best Interdisciplinary Data Science Team
Highlighted Project

Students

Advisors

Gabriel Kahn

Skills Required by the team

Mapping the Uncanny Valley (Fall - 2020)

Wed, 01 Jan 2020 00:00:00 +0000

Description

While many stories involve the friendly and familiar, scary stories across cultures, from Hamlet to Yotsuya Kaidan to Siren Head involve beings that are almost—but not quite—human. Can these stories give us insight into the “nearly-human” uncanny valley? And are the most popular stories at the nadir of this valley? In this project we aim to explore the uncanny valley through analysis of several thousand stories posted on Reddit posted over a decade. These data contain stories that cross a range of topics, and include user comments and story scores. We will explore the prevalence of the monsters over time, and explore whether there is some optimal characteristics of these monsters that make them so scary. While some research has explored the uncanny valley for images, the research is limited and virtually unexplored in text format. The students will build on initial work by the advisor to apply NLP methods to these texts and improve upon existing initial results.

Awards

Best Cyberphysical Data Science Team
Highlighted Project

Students

Advisors

Keith Burghardt

Skills Required by the team

Python
NLTK
NLP
Keras
Gensim

Microtelcos and the Digital Divide in CA (Spring - 2020)

Wed, 01 Jan 2020 00:00:00 +0000

Description

The broadband access market is increasingly dominated by a few large ISPs. However, small and medium-size operators (“microtelcos”) are critical to connectivity in low-income and rural communities in CA, serving markets of little interest to large operators. The primary goal of this project is to combine broadband infrastructure deployment data from the CPUC (California Public Utilities Commission) and socioeconomic data from the Census Bureau to understand the characteristics of the communities served by microtelcos, and to analyze whether the presence of a microtelco operator contributes to higher levels of connectivity in the community. The main technical challenge is to combine spatial data found in CPUC files with census block level data provided by the Census Bureau. This is part of an ongoing research program called Connected Communities and Inclusive Growth (CCIG).

Advisors

Hernan Galperin

Skills Required by the team

GIS
R
Stata

Modeling Uncertainty in Drought Products

Wed, 01 Jan 2020 00:00:00 +0000

Description

Droughts can have a substantial impact on agricultural systems and human livelihood. A Python package to calculate various drought indices in being developed. In this project, you will expand on this package and develop methods to test the sensitivity of the models to various input datasets and parameters. In addition, you will develop post-processing code to determine the return period of the drought (is it a 1 in 20 yr event or 1 in 5 yr event?).

Advisors

Deborah Khider

Skills Required by the team

Python
Statistics
Probability Theory

Modelling Spatiotemporal Relationships between Waste Water Injection and Induced Seismicity

Wed, 01 Jan 2020 00:00:00 +0000

Description

Induced seismicity refers to earthquakes that are caused as a result of human activity, such as disposing of wastewater by injecting it into the subsurface. This project will focus on spatiotemporal statistics to model space-time relationships between injected wastewater and induced earthquakes. The model will incorporate space-time data pertaining to seismic activity and associated human systems to create forecasts of induced earthquakes.

Advisors

Orhun Aydin

Skills Required by the team

Python
R
Statistics
GIS
Data Analysis
Programming

Predicting Effective Tax Rate of Publicly-Traded Firms

Wed, 01 Jan 2020 00:00:00 +0000

Description

The purpose of this project is to analyze business firms’ text disclosures to determine if those text disclosures are related to firms’ tax rates. In so doing we first capture information about the text and then relate that text information to quantitative information, using statistical modeling. So far, we have generated and used some bags of words to capture information that we expect will provide insight into the tax rates that those firms incur. Our knowledge acquisition approach, to gather those bags of words, was to interview an expert. We then counted the number of occurrences of those words in our text, and used statistical models to relate the number of those occurrences to different measures of tax rates. We find that those bags of words are statistically significantly related to measures of tax rates that firms pay. In addition, we find that “tax specific bags of words” work “better” than “generic accounting bags of words.”

Awards

Highlighted Project

Students

Advisors

Daniel O'Leary

Skills Required by the team

Statistical Analysis
Text Analysis

Final presentation resources

Final presentation

Social Graph Analysis and Attribution of Software Exploit Contributors Using GitHub

Wed, 01 Jan 2020 00:00:00 +0000

Description

Attribution of cyber threat actors is an increasingly important and difficult problem. One potential mitigation is the early detection of potential threat actors via analysis of open-source intelligence (OSINT). This project will analyze the social graph of users who contribute to, follow, star, and otherwise interact with proof-of-concept CVE implementations and other relevant potentially malicious (e.g. software vulnerability) repositories on GitHub.

Students

Advisors

Jeremy Abramson

Skills Required by the team

Python
Data Science
Cybersecurity
Graph Analysis

Team Dynamics in Online Multiplayer Games

Wed, 01 Jan 2020 00:00:00 +0000

Description

Competitive online multiplayer team games such as CounterStrike, PUBG, or League of Legends are extremely popular. Multiple teams of professional players compete in hundreds of tournaments yearly. Player transfer between teams is common. The goal of the project is to measure the effects of player transfers and to answer some of the questions such as: How does a new player affect the team’s performance?; How does the change of a team affect a player’s performance? The world of online games can be used as a fruitful area for tackling more fundamental questions on human society and collaboration dynamics in different settings.

Students

Advisors

Goran Muric

Skills Required by the team

Python

Team Dynamics in Online Multiplayer Games (Spring - 2020)

Wed, 01 Jan 2020 00:00:00 +0000

Description

Competitive online multiplayer team games such as CounterStrike, PUBG or League of Legends are extremely popular. Multiple teams of professional players compete in hundreds of tournaments yearly. Player transfer between teams are common. The goal of the project is to measure the effects of player transfers and to answer some of the questions such as: How does a new player affect the team’s performance?; How does the change of a team affects player’s performance? The world of online games can be used as a fruitful area for tackling more fundamental questions on human society and collaboration dynamics in different settings.

Awards

Best Data Collection
Highlighted Project

Students

Advisors

Goran Muric

Skills Required by the team

Python

Final presentation resources

Final presentation

Text Analysis, Social Networks and Crowdsourcing

Wed, 01 Jan 2020 00:00:00 +0000

Description

The purpose of this project is to analyze a crowdsourcing setting for both the sentiment and other categories of meaning in the text, and the roles and impact of a network of contributors on the votes and potentially on the content.

Awards

Highlighted Project

Students

Advisors

Skills Required by the team

Social Network Analysis
Text Analysis

Final presentation resources

Final presentation

The aging individual brain

Wed, 01 Jan 2020 00:00:00 +0000

Description

Deep learning models are now able to predict how an individual’s face will age in a very realistic manner — ensuring key individual features, for example eye color, are maintained, while more age related features, such as the texture of skin, are altered to be representative of a desired age group. However, little work has been done to try and predict how an individual’s brain will age. Such models may be able to help predict early signatures of neurodegenerative disorders. The goal of this project will be to test several models to realistically predict how a given individual’s brain will look at any age in mid to late adulthood. Either or both deep learning based and image processing based methods would be encouraged. Students will work with a dataset of over 20,000 brain scans of individuals aged 45-80, approximately 1000 of whom have a scan again after two years. This is currently an active project in the lab and students will join researchers already working on this problem to further explore and improve methodology.

Students

Advisors

Neda Jahanshad

Skills Required by the team

Python
Statistics
Deep Learning
Git

Towards Automated Understanding of Scientific Software

Wed, 01 Jan 2020 00:00:00 +0000

Description

Data science projects require knowledge of software that changes rapidly. As a result, scientists spend hours reading long documentations and manuals instead of advancing their scientific fields. In this project, we aim to automatically extract relevant aspects of scientific software (e.g., what does it do, how to install it, how to operate with it or how to cite it) from documentation and code using machine learning techniques. The students will build on an existing baseline of classifiers and try to improve the existing results.

Awards

Best Data Science Open and Sharing Practices
Best Project Presentation
Highlighted Project

Students

Advisors

Daniel Garijo

Skills Required by the team

Python
Knowledge Graphs
Machine Learning
Sklearn

Final presentation resources

Final presentation

Tracking health and nutrition signals from social media data

Wed, 01 Jan 2020 00:00:00 +0000

Description

This project will explore the ability to track real-life health and nutrition signals from social media data, focusing on data from Instagram and Foursquare. We will investigate the quality of Instagram posts as a source of data for measurements of dietary patterns and nutrition quality, focusing on spatial, textual, and (new in this semester) image content of posts linked to food outlets in Los Angeles, as well as nutritional content analysis of menus available online. Multiple aims will be investigated in this project, including: scraping data from social media; NLP of tag, comments, and menu data; image analysis; predictive models and social network analysis; and more. Also new in this semester: “ground truth” data on dietary patterns of LA residents will be available, enabling validation of dietary measures and predictive models built from Instagram posts.

The project will build on the DataFest 2019 project, and will expand the scope to actually access up-to-date data from Instagram, in particular: data with images, the underlying social connections / social network, and of course more timely (which requires data scraping).

Students

Advisors

Skills Required by the team

Python
Machine Learning
NLP
Statistical Analysis
Social Network Analysis
Sentiment Analysis
R
Image Analysis

Tracking health and nutrition signals from social media data (Spring - 2020)

Wed, 01 Jan 2020 00:00:00 +0000

Description

Awards

Best Project Achievement
Highlighted Project

Students

Advisors

Skills Required by the team

Python
R
Machine Learning
NLP
Statistical Analysis
Social Network Analysis
Sentiment Analysis
Image Analysis

Final presentation resources

Final presentation

Turning Library Collections into Data Science Challenges and Resources

Wed, 01 Jan 2020 00:00:00 +0000

Description

Libraries, museums, and archives hold unique collections that may be very useful for data science. These collections include photographs, videos, letters, and other artifacts that could give unique insights when analyzed. In this project, students will work with the USC Libraries to identify existing collections that would be potentially interesting as targets for data science, describe those collections in collaboration with the USC Libraries so they can be promoted as data science resources, and create APIs and other access mechanisms for data science researchers on campus and beyond.

Awards

Best Interdisciplinary Data Science Team

Students

Advisors

Skills Required by the team

Data Science

User-centered building design preference assessment to develop data-driven interactive architectural design guideline models

Wed, 01 Jan 2020 00:00:00 +0000

Description

In many architectural designing scenarios, architects and clients inevitably spend a lot of time determining design agreements due to a lack of understanding about the client’s design needs and preferences. An architectural design process could be significantly expedited and simplified if modeling software can accurately extract the user’s preferred design features and integrate them into the design process. In this project, we addressed the challenges of demonstrating a stochastic model with the consideration of the user’s physiological responses and subjective design perceptions by using data analytic methods. This technical principle exploited personal design preferences that would adopt them to the design process to effectively complete an architecture project.

Students

Advisors

Joon-Ho Choi

Skills Required by the team

Python
Statistics
Machine Learning
Sklearn
WEKA
R

Using Biomedical Researcher Judgments to Predict Clinical Trial Outcomes

Wed, 01 Jan 2020 00:00:00 +0000

Description

Human patients should only be assigned to experimental medical treatments when investigators are truly uncertain about the novel treatment’s clinical utility. As such, the outcomes of clinical trials are difficult to predict by design. The goal of this project is to work toward building a predictive model of clinical trials. The first step is to categorize treatments based on their history and diseases based on their treatability using FDA records among other data sources. In collaboration with the Biomedical Ethics Unit at McGill University, we have collected many probability predictions about scientific and operational outcomes of newly registered clinical trials. When pre-processing is completed, we will begin building a model to predict the judgments of medical experts based on several trial and researcher characteristics. This model can be used to assess whether medical researchers are biased in their judgments about their own trials. Finally, we aim to assemble these components to develop a model to predict the outcomes of the clinical trials by accounting for the history of the treatment, treatability of the disease, and judgments of medical research accounting for revealed biases.

Advisors

Daniel Benjamin

Skills Required by the team

Classification
Predictive modeling
Data Collection

Worldwide Survey Estimates of Maternal Bereavement

Wed, 01 Jan 2020 00:00:00 +0000

Description

Infant and child mortality rates have been steadily declining worldwide over the last fifty years. Without reservation, these trends represent good news for children and for their parents, but the link between child mortality and parents’ experiences, however, remains loosely defined. Documenting global inequality in maternal bereavement offers a window into how health disparities directly affect the lives and well-being of mothers. In this project, we will offer the first, global analysis of the prevalence of bereaved mothers by leveraging data collected between 2010 and 2018 from 168 countries. I request student support to expand current survey coverage. Student(s) will work to identify and access public-use, nationally-representative reproductive history survey data for select European, Asian, and Latin American countries to supplement current data coverage, and to offer direct estimations to compare to indirectly derived ones based on current fertility and mortality levels. Students will work to adapt code used for other surveys to generate descriptive statistics of the prevalence of ever bereaved mothers in each country. Students will also work to improve and supplement the illustration of key study findings.

Awards

Best Data Science Teamwork
Highlighted Project

Students

Advisors

Emily Smith-Greenaway

Skills Required by the team

Stata
Statistics

A visual analytic toolkit for cultural biases

Tue, 01 Jan 2019 00:00:00 +0000

Description

This project will result in a visual analytics toolkit that will enable social scientists to understand the cultural groups and biases at play in a social dataset. News, books, and social media all contain biases that stem from the cultural background of the author(s). We have developed algorithms to identify the cultural groups at play in an arbitrary dataset, as well as natural language processing approaches that can discover the biases of each group. This project would help bring put these tools into the hands of social scientists by displaying the output of these algorithms in novel visualizations.

Students

Advisors

Fred Morstatter

Skills Required by the team

Python
Javascript

A workshop Tutorial on R and R Studio for Environmental Sciences Curriculum

Tue, 01 Jan 2019 00:00:00 +0000

Description

This project created an electronic notebook using R and R studio for environmental sciences curriculum. The notebook will be used for undergraduate students to teach advanced statistical analysis about population health in Fall 2019.

Students

Huy Nghiem

Advisors

Analyzing Paleoclimate Data

Tue, 01 Jan 2019 00:00:00 +0000

Description

This project focused on a causality analysis of paleoclimate time series using Pyleoclim, which is a Python package geared towards the analysis and visualization of paleoclimate data. Future work includes exploring and testing additional algorithms for time series analysis.

Students

Han Wu

Advisors

Deborah Khider

Automated generation of paper authors

Tue, 01 Jan 2019 00:00:00 +0000

Description

This project will result in an open-source software tool that will have general applicability for scientific publications. Papers with hundreds of authors are not uncommon in science, and it often takes many weeks to compile an author list in the desired order with proper affiliations and acknowledgments. We have implemented an algorithm that generates the author information for a paper based on the type of contribution of each author within the ENIGMA neuroscience consortium. This project would extend this software to read in compiled spreadsheets or forms and extract information about universities and other institutions from structured web sources, to interoperate with widely-used frameworks such as Wikidata.

Awards

Best Data Science Open and Sharing Practices

Students

Advisors

Neda Jahanshad

Skills Required by the team

Python
RDF
UI Development

Automated time series analysis

Tue, 01 Jan 2019 00:00:00 +0000

Description

This project will result in a Python package for automated time series analysis. Based on the characteristics of the data, you will design functions that (1) perform essential tasks in data cleaning and select appropriate methodologies, (2) implement various algorithms currently not supported through pandas and scikit-learn, and (3) create appropriate visualizations.

Students

Advisors

Deborah Khider

Skills Required by the team

Python
Pypi

Behavioral Context Recognition

Tue, 01 Jan 2019 00:00:00 +0000

Description

This project studied behavioral patterns by analyzing data from personal senses collected from 60 subjects. This data can be used to predict activities and infer people’s lifestyle and habits.

Students

Ian Myoungsu Choi

Building an open catalog of integrated datasets for Los Angeles

Tue, 01 Jan 2019 00:00:00 +0000

Description

While many open data efforts have managed to successfully expose public data in the web, it is often complicated to determine how these records can be integrated with each other (due to heterogeneous ids, not clear how to place them into a map, etc.). In this project, the student will leverage the novel techniques for integrating, registering and connecting datasets with overlapping elements. The results will be visualized by the student using interactive maps.

Students

Advisors

Daniel Garijo

Skills Required by the team

Python

Building Sports Data Knowledge Graphs

Tue, 01 Jan 2019 00:00:00 +0000

Description

Public sports data is often spread across many differing sources, creating issues of entity resolution and record linkage. Knowledge graphs are a popular conceptual technology for storing, fusing and querying information from such disparate sources. This project will focus on building a sports data knowledge graph, from various open data and asset (e.g. video) sources/API, based on a Wikidata infrastructure.

Advisors

Jeremy Abramson

Skills Required by the team

Python
Unix System
SPARQL

Capturing provenance of data analyses

Tue, 01 Jan 2019 00:00:00 +0000

Description

Documenting how a result was obtained from data analysis involves documenting the software, software settings, and datasets used to obtain that result so it can be explained properly. This project will design and develop a user interface for specifying provenance records using W3C standards. The interface will enable users to document the provenance of data analysis no matter what infrastructure they used (R scripts, sk-learn, etc).

Students

Rahul Jeswani

Advisors

Yolanda Gil

Skills Required by the team

Javascript
Firebase

Creating and visualizing a linked knowledge base of crime data

Tue, 01 Jan 2019 00:00:00 +0000

Description

A lot of data is available in the web in a tabular manner, but it’s difficult to manipulate and visualize without a significant effort. In this project, we aim to test a novel framework created at ISI to build and visualize knowledge bases. The objective is to create a knowledge base that extends the other resources in the Web such as Wikidata or Wikipedia, and visualize the results using interactive maps and plots.

Students

Advisors

Daniel Garijo

Skills Required by the team

Python
Knowledge Representation
RDF

Crosstown

Tue, 01 Jan 2019 00:00:00 +0000

Description

Giorgos and John are developing a machine learning system to automatically detect, prioritize, and alert journalists in the presence of abnormalities in crime data. Through this project, they want to assist journalists to identify interesting stories in data. The data is rich in features, and by using general feature engineering.

Students

Advisors

Data infrastructure for USC

Tue, 01 Jan 2019 00:00:00 +0000

Description

Developing software and data resources for USC students. Software resources include tools to process and analyze specific types of data (eg social networks, images, text, etc), data preparation tools, or machine learning libraries. Data resources include thematic data repositories, such as urban LA data, environmental LA data, entertainment LA data, etc.

Students

Advisors

Yolanda Gil

Skills Required by the team

Open Source Software Development
Data Services

Data Mining Over Past Climates

Tue, 01 Jan 2019 00:00:00 +0000

Description

Estimates of climate variations over the past 1,000 years play an increasing role in climate assessments. A key quantity to derive from them is the transient climate response (TCR), which quantifies the warming at expected from slowly-rising CO2 concentrations. TCR helps constrain the climate models used to predict the future evolution of Earth’s climate. In this project, you will help design an efficient workflow to estimate TCR from existing paleoclimate datasets and emerging statistical methods.

Students

Advisors

Skills Required by the team

Python

Detecting deep fakes

Tue, 01 Jan 2019 00:00:00 +0000

Description

Spread of misinformation has become a significant problem, raising the importance of relevant detection methods. While there are different manifestations of misinformation, in this work we focus on detecting face manipulations in videos. This project will focus on detecting face manipulations in videos. We exploit the temporal dynamics of videos with recurrent networks.

Students

Shenoy Pratik Gurudatt

Advisors

Wael Abd-Almageed

Skills Required by the team

Programming

Enhancing Thermal Control in Buildings

Tue, 01 Jan 2019 00:00:00 +0000

Description

Mengqi presented her study about thermal control using Excel, Minitab, and Python for data analysis. She collected data in a lab setting and tried two models, a group model and an individual model. She found that a group model, which consisted of 20 subjects, did not work well, while individual models gave better results.

Awards

Highlighted Project

Students

Mengqi Jia

Advisors

Joon-Ho Choi

Final presentation resources

Final presentation

Foster Care Children: Administrative Data and Computational Methods

Tue, 01 Jan 2019 00:00:00 +0000

Description

This project is using population-based administrative data, including birth, medical, and education records to study child welfare services. The research topics include: a family-level analysis of first births and sibling re-reports in the foster care system; identifying mothers who gave subsequent birth after the termination of parental rights; modeling the child protective services system using Markov models; and predicting risks for aging youth.

Students

Eunhye Ahn

Advisors

Emily Putnam-Hornstein

Final presentation resources

Final presentation

Game Data and Social Capital

Tue, 01 Jan 2019 00:00:00 +0000

Description

Natalie and Calvin look at social capital from two angles: social capital as a predictor and social capital as an outcome. First, they want to observe whether people who display social capital exhibit certain characteristics or behavior. Second, they want to study if people who are interested in a specific topic will exhibit certain types of social capital.

Students

Advisors

Dmitri Williams

Learning to Connect: Modeling Social Network Dynamics and Evolution by Imitation Learning

Tue, 01 Jan 2019 00:00:00 +0000

Description

In this research, we aim to model how human players make connection decisions in an online game where players are free to add or delete a friend, as well as join a clan.

Students

Yiley Zeng

Advisors

Lighting Control in Buildings for Visual Comfort

Tue, 01 Jan 2019 00:00:00 +0000

Description

Lingkai collected data in a controlled setting and studied lighting control in buildings for visual comfort. For data preprocessing, he used Excel and MathWorks; for data analysis, he used Python and Scikit learn.

Awards

Highlighted Project

Students

Lingkai Cen

Advisors

Joon-Ho Choi

Final presentation resources

Final presentation

Measuring Pollution Benefits from Congestion Pricing Initiatives

Tue, 01 Jan 2019 00:00:00 +0000

Description

Using real-time big data from Los Angeles freeways on traffic and Aclima data on pollution measurements, this project will estimate the links between speed and pollution. Estimating this relationship properly is crucial for knowing the benefits that congestion pricing may generate in terms of pollution reduction. Computer Science methods will be used to guide the choice of policy intervention and guide prediction.

Advisors

Antonio Bento

Skills Required by the team

Machine Learning
R

Measuring population-level nutrition and dietary habits from Instagram

Tue, 01 Jan 2019 00:00:00 +0000

Description

This project will investigate the quality of Instagram textual posts as a source of data for measurements of dietary patterns and nutrition quality, focusing on spatial and textual features of posts linked to food outlets. Using an Instagram dataset of all geo-located posts at food outlets in Los Angeles for 3 months in 2014, this project will investigate whether Instagram posts, despite implicit biases (and to the extent possible, accounting for these biases), can provide a representative health signal, informative of the quality of population nutrition and dietary patterns at a highly-resolved (e.g. census tract level) spatial scale.

Awards

Best Data Science Collaboration Practices

Students

Advisors

Skills Required by the team

Social Network Analysis
Statistical Modeling

Mining Side Effects in Cancer Treatment

Tue, 01 Jan 2019 00:00:00 +0000

Description

SideEffects is a cancer patient’s resource to access treatment and side effects tailored to the patient’s treatment and disease history. The app sources content from clinical data, National Cancer Institute, social media, and user input from an app.

Awards

Best Project Presentation

Students

Advisors

Modeling the career trajectory of music artists

Tue, 01 Jan 2019 00:00:00 +0000

Description

Many musicians, from up-and-comers to established artists, rely heavily on performing live to promote and disseminate their music. To advertise live shows, artists often use concert discovery platforms that make it easier for their fans to track tour dates. In this project, we ask whether digital traces of musical performances generated on those platforms can be used to understand career trajectories of artists. We have amassed a dataset we constructed by cross-referencing data from such platforms (Songkick, and Discogs). In this project, you will identify and explore patterns that can be used to identify successful musicians.

Advisors

Fred Morstatter

Skills Required by the team

Python

Modeling Uncertainty in Drought Data

Tue, 01 Jan 2019 00:00:00 +0000

Description

Awards

Best Project Achievement

Students

Advisors

Deborah Khider

Skills Required by the team

Python

Multiplayer Game’s Solo Players

Tue, 01 Jan 2019 00:00:00 +0000

Description

Even when engaging in a multiplayer online game, some players play by themselves. Do Own is interested in investigating personality, motivation, and behavioral patterns of social network isolates.

Students

Do Own (Donna) Kim

Advisors

Dmitri Williams

Final presentation resources

Final presentation

Overview of Multiplayer Games Dataset

Tue, 01 Jan 2019 00:00:00 +0000

Description

Online multiplayer games provide a wealth of data that can be used to study human behaviors. Professor Williams describes the kinds of questions that can be investigated with rich datasets of online game player actions, interactions, and targeted survey questions. Students in his group focused on a wide range of projects that use this data to study a range of human behaviors.

Advisors

Dmitri Williams

Tell us where it hurts

Tue, 01 Jan 2019 00:00:00 +0000

Description

LA Care has a mission to “provide access to quality health care for Los Angeles County’s vulnerable and low-income communities and residents and to support the safety net required to achieve that purpose.” In the many coordinated activities LA Care conducts to provide a comprehensive health insurance safety net, it collects massive amounts of healthcare data. With advances in analytics enabled by AI approaches (e.g. predictive modeling, machine learning, model refinement and validation), the organization is looking for ways to mine and analyze its data to drive optimization and improvement of product development, marketing techniques and business strategies. Students will work with stakeholders throughout the organization to identify opportunities for leveraging company data to drive business solutions. The ability to identify and address “pain points” will depend on the skills that students bring to the project.

Students

Advisors

Skills Required by the team

R
Python
Machine Learning
Javascript
Data Mining

Tracking Coastal Change at Catalina Island

Tue, 01 Jan 2019 00:00:00 +0000

Description

Since 1992, the USC Wrigley Institute for Environmental Studies ‘Catalina Conservation Divers’ have been collecting underwater biological and environmental data from coastal ocean sites around Catalina Island, California. In cooperation with the USC Wrigley Institute, the CCD team (made up of community scientists and volunteer SCUBA divers) conducts quarterly surveys of marine species and benthic water temperatures at various depths and locations. The Wrigley Institute has been collecting and archiving this data for years, and the data has not been holistically studied to date. We need assistance in analyzing data for trends across location, ocean depth, and time.

Awards

Best Data Science Poster

Students

Advisors

Skills Required by the team

Statistics
Programming

Understanding human environmental perceptions using multi-biometric signals in the built environment

Tue, 01 Jan 2019 00:00:00 +0000

Description

Human, as a building occupant, is always surrounded by several indoor environmental quality (IEQ) elements, such as thermal, visual, air, and acoustic conditions. Therefore, the user’s environmental comfort and work productivity are significantly affected by the IEQ conditions, especially in residential, office, and educational facilities. This research is to investigate the relationships between the user’s IEQ comfort perceptions, IEQ conditions and his/her bio-metric signals to understand how to identify individual IEQ perception as a function of single or combined bio-signals (changes). The study outcome will have a potential to be integrated with the existing building mechanical/electrical control systems to enhance the user’s IEQ conditions while contributing to his/her comfort and well-being in the built environment.

Students

Advisors

Skills Required by the team

Python
Data Mining
SPSS
R

Understanding Internet Communities through Videogames

Tue, 01 Jan 2019 00:00:00 +0000

Description

Online multiplayer games provide a wealth of data that can be used to study human behaviors. Many questions that can be investigated with rich datasets of online game player actions, interactions, and targeted survey questions. We have a wide range of ongoing student projects that use this data to study a range of human behaviors.

Students

Advisors

Skills Required by the team

Who is the Best Game Mentor?

Tue, 01 Jan 2019 00:00:00 +0000

Description

This project explores the influence of the personality of game players who become mentors on mentoring outcomes using machine learning. The project will use survey data to analyze mentors’ extraversion and agreeableness as well as mentees’ game performance and churn rates.

Students

Joo-Wha Hong

Advisors

Dmitri Williams

Who is the One staying?

Tue, 01 Jan 2019 00:00:00 +0000

Description

Motivation plays a strong role as a moderator in the relationship between gamers’ in-game performances and enjoyment and churn, respectively. The research question for this project is ‘what is the relationship between players’ competitiveness and their level of enjoyment/churn?'

Students

Advisors

Dmitri Williams