graph TD A[Start] --> B[Load Log files] B --> C[Parse filenames] C --> D[Apply Exclusion rules] D --> E[Export Included/Excluded lists] E --> F[Load Included Log files] F --> G[Parse and extract data] G --> H[Insert data into `DuckDB`] H --> I[Load and Apply Corrections - CB only] I --> J[End] classDef process fill:#f9f,stroke:#333,stroke-width:2px; class A,B,C,D,E,F,G,H,I,J process;
About
About This Repository
This repository is designed and built by Michael Booth of DataBooth in colloboration with the Research Leader Cathrynne Henshall.
Purpose
The provided notebooks and supporting code are part of a data analysis pipeline for horse behavioral experiments. The primary goal is to process and analyse log files generated during these experiments, and to store this data in databases, for both Reward Prediction Error (RPE) and Cognitive Bias (CB) experiments.
The repository also contains the code for controlling the experiments performed on the RaspeberryPi (see the src
directory).
High Level Workflow
The provided code and notebooks focus on two types of experiments: RPE and CB. The process is essentially the same for both RPE and CB experiments, with minor differences in data structure (e.g. an additional table ResponsesCB
for CB experiments).
- Logfile Reconciliation:
Purpose: Identify which log files should be included or excluded from the analysis for each experiment type.
Steps:
- Load and Parse Log Files: Load log files and parse their filenames to extract metadata.
- Apply Exclusion Rules: Exclude log files based on predefined rules (e.g., test runs, bad data).
- Export Lists: Export lists of included and excluded log files for further processing.
- Logfile to Database:
Purpose: Load the reconciled log files into a DuckDB
database for each experiment type.
Steps:
- Load Log Files: Load the log files determined to be included from the reconciliation step.
- Parse and Extract Data: Extract relevant data from each log file and structure it for database insertion.
- Insert Data into Database: Insert the parsed data into a
DuckDB
database.
- Database Queries:
Purpose: Cross-check the data in the databases and provide examples of typical queries for each experiment type.
Steps:
- Setup Database Connection: Connect to the
DuckDB
database. - Run SQL Queries: Execute SQL queries to verify data integrity and demonstrate typical data retrieval operations.
- Provide Query Examples: Showcase various SQL queries that might be useful for future analysis.
Note: These notebooks do not perform specific data analysis at this stage. They serve as a foundation for future analysis, which may be conducted in additional notebooks or external tools.
- Database Consistency Checks
Purpose: Perform consistency checks on the local DuckDB CB database created using logfile-to-database-CB.ipynb
.
Steps:
- Run Integrity Checks: Execute a series of SQL queries to verify the integrity of the CB database:
- Check for orphaned records (foreign key constraint violations)
- Check for duplicate primary keys
- Check for NULL values in primary key columns
- Check for records with ID = 0
- Check for consistency in trial numbering within experiments
- Verify that all trials have associated events
- Check for mismatches in SessionType between ExperimentCBs and TrialCBs
- Display Results: Show the results of each query, highlighting any potential issues with data integrity.
This notebook serves as a crucial step in the data analysis pipeline, ensuring the consistency and integrity of the database before proceeding with further analysis. It helps identify any potential issues that may have occurred during the data import process or due to inconsistencies in the original log files.
- Corrections to CB database:
Purpose: For CB experiments only, load manual corrections to the CB database.
Steps:
- Load the corrections from the Corrections workbook (field, current value, corrected value).
- Insert these into a CB database table.
- Apply corrections to a new view on the data.
- Apply corrections to CB database:
Purpose: For CB experiments only, apply the manual corrections to the CB database.
Steps:
- Load the corrections from the Corrections table (field, current value, corrected value).
- Apply corrections to create new “corrected” views on the data.
Process Flow Diagram
Explanation of the Diagram
- Start: The process begins with the loading of log files.
- Load Log Files: Log files are loaded from a specified directory.
- Parse Filenames: Filenames are parsed to extract metadata such as subject name and experiment type.
- Apply Exclusion Rules: Predefined rules are applied to exclude certain log files. Handle Specific Files: Additional criteria are used to include or exclude specific log files.
- Export Included/Excluded Lists: Lists of included and excluded log files are exported for further processing.
- Load Included Log Files: The included log files are loaded for detailed analysis.
- Parse and Extract Data: Relevant data is extracted from each log file and structured for database insertion.
- Insert Data into
DuckDB
: The parsed data is inserted into aDuckDB
database. For CB experiments only, manual corrections are loaded and applied. - Load and Apply Corrections (CB only).
- End: The process concludes.
This workflow ensures that only relevant and high-quality log files are included in the analysis, and the data is structured and stored in a database for both RPE and CB experiments. The process is designed to be flexible and repeatable for both experiment types, with minor adjustments to account for the differences in data structure. The database queries serve as a foundation for future in-depth analysis, providing a way to verify data integrity and demonstrate typical data retrieval operations.
Code Features
The following (custom) modules are used to assist with the data transformations and analysis:
- Logfiles Module (
logfiles.py
): Handles the loading, parsing, and management of log files. - Project Module (
project.py
): Manages project configuration, directory setup, and database initialisation. - Subject Module (
subject.py
): Loads and processes subject information from external files, providing crucial data for both RPE and CB experiments. - Utils Module: Provides utility functions for displaying class definitions.
The key custom classes to assist with the analysis are:
Project Class
The Project class is like a central organiser for your research project. It helps you keep everything tidy and in the right place. Here’s what it does:
- Sets up your project’s structure: It creates folders for your data, results, and other important files.
- Manages your database: It sets up and maintains a database (by experiment type i.e. RPE or CB) where you can store all your experimental data.
- Keeps track of important information: It remembers things like what type of experiment you’re running and where all your files are located.
- Provides useful tools: It offers methods to help you do common tasks, like creating links to your files or exporting data.
Think of the Project class as your personal assistant for your research project. It helps you stay organised and provides you with the tools you need to manage your experiments efficiently.
Subject Class
The Subject class represents an individual participant in your experiment - in this case, a horse. It’s like a digital profile for each horse in your study. Here’s what it does:
- Stores information about the horse: It keeps track of details like the horse’s name, cohort, and any other relevant characteristics.
- Manages experimental data: It helps you organise and access the data collected for each horse during the experiments.
- Provides easy access to subject-specific information: It allows you to quickly retrieve information about a particular horse.
Think of the Subject class as a digital folder for each horse in your study. It keeps all the information about that horse in one place, making it easy for you to access and manage data for individual subjects.
Together, these classes help you organise your research project, manage your data efficiently, and keep track of information about each participant in your study. They’re designed to make your research process smoother and more organised.
Logfile Classes
Logfile
Class: This is like a digital version of a physical log file. It can:
- Read and store the contents of a log file
- Parse the filename to extract important information
- Interpret the contents of the file to create
Experiment
,Trial
, andEvent
objects (for both RPE and CB experiments)
Logs
Class: This is like a file cabinet for all your log files. It can:
- Find all the log files in a specific folder
- Load them into Logfile objects
- Keep track of which files are included or excluded from analysis
Experimental Classes
These are also located in logfiles.py
.
Experiment
andExperimentCB
Classes: These are like digital record cards for each experiment. They store important information about an experiment, such as:
- Who participated (the subject)
- When it happened
- What type of experiment it was
- Any comments or notes
- The name of the log file
The ExperimentCB
class is specifically for cognitive bias experiments and includes some extra details.
Trial
andTrialCB
Classes: These represent individual trials within an experiment. They keep track of:
- When the trial started and ended
- What number trial it was
- For cognitive bias trials (
TrialCB
), additional information like the type of response and direction
Event
andEventCB
Classes: These capture specific moments or actions during a trial. They record:
- What happened (the type of event)
- When it happened
- How long into the trial it occurred
RepsonseCB
Class (CB experiments only): This is specific to cognitive bias experiments and records the subject’s responses, including:
- When the response occurred
- How long it took (response time)
These classes work together to organise and make sense of the data from your experiments. They turn raw log files into structured data that’s easier to analyse and understand. The Logfile
class does the hard work of interpreting each file, while the Logs
class manages the collection of all your files.
Corrected views
While the “corrected” views are defined in the corrections-to-database-CB.ipynb
and database-apply-corrections-CB.ipynb
notebooks, they are also been extracted into sql/corrections-to-database-CB.sql
and sql/database-apply-corrections-CB.sql
for ease of reference.
Database details
RPE - Entity Relationship Diagram
erDiagram Experiments { int ExperimentID PK "Primary Key" varchar SubjectName int SessionNumber varchar ExperimentType text Comment varchar StartDateTime varchar LogFileName varchar MeasurementFileName } Parameters { int ParameterID PK "Primary Key" int ExperimentID FK "Foreign Key" varchar ParameterName varchar ParameterValue } Trials { int TrialID PK "Primary Key" int ExperimentID FK "Foreign Key" int TrialNumber datetime StartToneDateTime } Events { int EventID PK "Primary Key" int TrialID FK "Foreign Key" varchar EventType datetime EventDateTime } Experiments ||--o{ Parameters : "has" Experiments ||--o{ Trials : "has" Trials ||--o{ Events : "has"
CB - Entity Relationship Diagrams (ERD)
The Entity-Relationship Diagram (ERD) below can be summarised as follows:
- Core Data Tables:
ExperimentCBs
: This is the main table that stores information about each experiment. It includes details like the subject’s name, experiment date, and type of experiment.TrialCBs
: This table contains information about individual trials within each experiment. It’s connected to ExperimentCBs because each experiment consists of multiple trials.ResponseCBs
: This table records the responses given during each trial. It’s linked toTrialCBs
because each trial can have a response.EventCBs
: This table stores events that occur during trials. It’s also connected toTrialCBs
as events are associated with specific trials.
- Supporting Tables:
SubjectCBs
: This table holds information about the subjects (e.g. horses) involved in the experiments. It’s connected toExperimentCBs
as each experiment involves a specific subject.Corrections
: This is a special table that contains information about corrections or additions that need to be made to the existing data (which is a consolidated table of the information in the Corrections workbook).
- Correction Process:
CorrectionsSplit
: This view breaks down theCorrections
table into more specific pieces of information.AddExperiments
andReformattedAddExperiments
: These views identify and format new experiments that need to be added to the database. It turns out there is only one experiment to add.AddTrials
: This view identifies new trials that need to be added to existing experiments.CorrectTrials
: This view shows what corrections need to be made to existing trials.
- Corrected Data Views:
ExperimentCBsCorrected
: This view combines the original experiment data with any new experiments that need to be added.TrialCBsCorrected
: This view shows all trials, including both original and new ones, with any necessary corrections applied.ResponseCBsCorrected
: This view presents all responses, including both original and new ones, with any necessary corrections applied.EventCBsCorrected
: This view shows all events, including both original and new ones, with any necessary additions for new trials.
The overall data flow is:
- Original data is stored in the core tables (
ExperimentCBs
,TrialCBs
,ResponseCBs
,EventCBs
). - Corrections and additions are specified in the
Corrections
table. - Various views process these corrections and additions.
- The final corrected views (
ExperimentCBsCorrected
,TrialCBsCorrected
,ResponseCBsCorrected
,EventCBsCorrected
) present the updated and corrected data for use in subsequent analysis.
erDiagram ExperimentCBs ||--o{ TrialCBs : contains TrialCBs ||--o{ ResponseCBs : has TrialCBs ||--o{ EventCBs : has ExperimentCBs ||--o{ SubjectCBs : aggregates Corrections ||--|{ CorrectionsSplit : splits Corrections ||--o{ AddExperiments : identifies Corrections ||--o{ AddTrials : identifies Corrections ||--o{ CorrectTrials : informs Corrections ||--o{ ResponseCBsCorrected : informs ExperimentCBs ||--o{ ExperimentCBsCorrected : corrects TrialCBs ||--o{ TrialCBsCorrected : corrects ResponseCBs ||--o{ ResponseCBsCorrected : corrects EventCBs ||--o{ EventCBsCorrected : corrects ExperimentCBsCorrected ||--o{ TrialCBsCorrected : contains TrialCBsCorrected ||--o{ ResponseCBsCorrected : has TrialCBsCorrected ||--o{ EventCBsCorrected : has AddExperiments ||--o{ ReformattedAddExperiments : reformats ReformattedAddExperiments ||--o{ ExperimentCBsCorrected : adds AddTrials ||--o{ TrialCBsCorrected : adds CorrectTrials ||--o{ TrialCBsCorrected : informs CorrectionsSplit ||--o{ CorrectTrials : informs CorrectionsSplit ||--o{ ResponseCBsCorrected : informs ExperimentCBs { int ExperimentID PK string LogFileName } TrialCBs { int TrialID PK int ExperimentID FK int TrialNumber string ResponseType } ResponseCBs { int ResponseID PK int TrialID FK float ResponseTime } EventCBs { int EventID PK int TrialID FK } Corrections { string LogFilename int TrialNumber string ResponseType float ResponseTime } CorrectionsSplit { string LogFilename int TrialNumber string TableToCorrect string FieldToCorrect string ValueToCorrect } ExperimentCBsCorrected { int ExperimentID PK string DataSource } TrialCBsCorrected { int TrialID PK string CorrectedFlag } ResponseCBsCorrected { int ResponseID PK string CorrectedFlag } EventCBsCorrected { int EventID PK string CorrectedFlag } AddExperiments { string LogFilename string DataSource } ReformattedAddExperiments { int ExperimentID PK string DataSource } AddTrials { int ExperimentID FK int TrialNumber string CorrectedFlag } CorrectTrials { int TrialID PK string CorrectedFlag }
ExperimentCBsCorrected
This view combines original experiment data with new experiments that need to be added:
- It starts with the original experiment data from ExperimentCBs.
- It looks at a list of new experiments (AddExperiments) that need to be added derived from the
Corrections
table. - These new experiments are reformatted (
ReformattedAddExperiments
) to match the structure of existing experiments. - The view then combines the original experiments with these new, reformatted experiments.
- It also looks up information from the
SubjectCBs
table to fill in some details about the properties of the new experiments.
erDiagram ExperimentCBs ||--|{ ExperimentCBsCorrected : "original data" AddExperiments ||--|{ ReformattedAddExperiments : reformats ReformattedAddExperiments ||--|{ ExperimentCBsCorrected : "adds new experiments" Corrections ||--|{ AddExperiments : "identifies new experiments" SubjectCBs ||--o{ ReformattedAddExperiments : "provides subject info" ExperimentCBs { int ExperimentID PK string LogFileName string DataSource } AddExperiments { string LogFilename string DataSource } ReformattedAddExperiments { int ExperimentID PK string LogFilename string DataSource } ExperimentCBsCorrected { int ExperimentID PK string LogFileName string DataSource } Corrections { string LogFilename } SubjectCBs { string SubjectName PK int SubjectNumber string Direction }
TrialCBsCorrected
This view handles both existing trials and any corrections needed and new trials that need to be added (due to any new experiments):
- It starts with the original trial data from
TrialCBs
. - It applies corrections to existing trials (
CorrectTrials
), fixing theResponseType
. - It also adds completely new trials (
AddTrials
) for new experiments. - The view combines the corrected existing trials and the new trials.
- It ensures that each trial has a unique identifier.
erDiagram TrialCBs ||--|{ CorrectTrials : "corrects existing trials" Corrections ||--|{ AddTrials : "identifies new trials" CorrectTrials ||--|{ TrialCBsCorrected : combines AddTrials ||--|{ TrialCBsCorrected : combines ExperimentCBsCorrected ||--o{ AddTrials : "links new trials" ExperimentCBsCorrected ||--o{ CorrectTrials : "links existing trials" ResponseCBs ||--o{ CorrectTrials : "provides ResponseTime for info" TrialCBs { int TrialID PK int ExperimentID FK string ResponseType } CorrectTrials { int TrialID PK int ExperimentID FK string ResponseType string CorrectedFlag } AddTrials { int ExperimentID FK int TrialNumber string ResponseType string CorrectedFlag } TrialCBsCorrected { int TrialID PK int ExperimentID FK string ResponseType string CorrectedFlag } Corrections { string LogFilename int TrialNumber string ResponseType } ExperimentCBsCorrected { int ExperimentID PK string LogFileName } ResponseCBs { int ResponseID PK int TrialID FK float ResponseTime }
ResponseCBsCorrected
This view deals with the responses recorded during trials:
- It starts with the original response data from
ResponseCBs
. - It applies any corrections to existing responses as necessary, fixing
ResponseTime
. - It adds new responses for trials in new experiments.
- The view combines the corrected existing responses and the new responses.
- It uses information from the
TrialCBsCorrected
andExperimentCBsCorrected
views to ensure everything lines up correctly.
erDiagram ResponseCBs ||--|{ ExistingResponses : "corrects existing responses" Corrections ||--|{ NewResponses : "adds new responses" ExistingResponses ||--|{ ResponseCBsCorrected : combines NewResponses ||--|{ ResponseCBsCorrected : combines TrialCBsCorrected ||--o{ NewResponses : "links new responses" TrialCBsCorrected ||--o{ ExistingResponses : "links existing responses" ExperimentCBsCorrected ||--o{ NewResponses : "links to experiments" CorrectionsSplit ||--o{ ExistingResponses : "provides correction info" ResponseCBs { int ResponseID PK int TrialID FK float ResponseTime } ExistingResponses { int ResponseID PK int TrialID FK float ResponseTime string CorrectedFlag } NewResponses { int TrialID FK float ResponseTime string CorrectedFlag } ResponseCBsCorrected { int ResponseID PK int TrialID FK float ResponseTime string CorrectedFlag } Corrections { string LogFilename int TrialNumber float ResponseTime } TrialCBsCorrected { int TrialID PK int ExperimentID FK } ExperimentCBsCorrected { int ExperimentID PK string LogFileName } CorrectionsSplit { string LogFilename int TrialNumber string TableToCorrect string FieldToCorrect string ValueToCorrect }
EventCBsCorrected
This view handles events that occur during trials:
- It starts with the original event data from
EventCBs
. - It keeps all existing events as is.
- For new trials that were added, it creates placeholder events (
NewTrialEvents
). - The view combines the existing events and the placeholder events for new trials
erDiagram EventCBs ||--|{ ExistingEvents : "includes existing events" TrialCBsCorrected ||--|{ NewTrialEvents : "generates placeholder events" ExistingEvents ||--|{ EventCBsCorrected : combines NewTrialEvents ||--|{ EventCBsCorrected : combines EventCBs { int EventID PK int TrialID FK string EventType datetime EventTime float ElapsedTime } ExistingEvents { int EventID PK int TrialID FK string EventType datetime EventTime float ElapsedTime string CorrectedFlag } NewTrialEvents { int TrialID FK string CorrectedFlag } EventCBsCorrected { int EventID PK int TrialID FK string EventType datetime EventTime float ElapsedTime string CorrectedFlag } TrialCBsCorrected { int TrialID PK string CorrectedFlag }
Mappings of dataclasses to database tables
Mappings between the data classes and their corresponding DuckDB
database tables:
RPE
Experiment: Maps to the
Experiments
table, storing details about the experiment such as subject name, session number, and parameters.Trial: Maps to the
Trials
table, representing individual trials within an experiment with start and end times.Event: Maps to the
Events
table, capturing specific events that occur during trials, including event types and timings.
CB
ExperimentCB: Maps to the
ExperimentsCB
table, similar toExperiment
but includes a session type field for cognitive bias experiments.TrialCB: Maps to the
TrialsCB
table, representing trials in cognitive bias experiments with additional fields for response type and criteria.RepsonseCB
: Maps to theResponsesCB
table, recording responses specific to cognitive bias trials, including response times.EventCB
: Maps to theEventsCB
table, capturing events during cognitive bias trials with relevant timing information.