Data Assessment

Introduction

The Data Assessment Document introduces a comprehensive overview of our project, emphasizing its role as an interim report in the initial project stages. This document serves to evaluate and refine our understanding of the data utilized.

Data Overview and Examples

This is an overview of a Automated Question Type Coding dataset:

Data Accessibility

Our groundbreaking dataset, constituting the first large language model trained on authentic forensic interviews and court transcripts, encompasses 349,033 utterances drawn from 1,851 transcripts. This diverse compilation includes forensic interviews conducted in California from 2004 to 2022 (1,435 transcripts, ages 3-17, M = 7.81), court trials in Los Angeles County Court from 1997 to 2001 (416 transcripts, ages 4-17, M = 11.85), manually question type coded by coders trained to achieve high reliability.

Data Formats

Data is presented in comma seperated values (CSV) format.

Data Challenges

A notable challenge in our dataset lies in the low frequency of invitations, posing a potential hurdle for training models effectively. This scarcity demands strategic approaches to ensure the robustness and accuracy of automated question type coding systems in capturing the nuances of less common question types.

Edit this page