About

About This Repository

This repository is designed and built by Michael Booth of DataBooth in colloboration with the Research Leader Cathrynne Henshall.

Purpose

The provided notebooks and supporting code are part of a data analysis pipeline for horse behavioral experiments. The primary goal is to process and analyse log files generated during these experiments, and to store this data in databases, for both Reward Prediction Error (RPE) and Cognitive Bias (CB) experiments.

The repository also contains the code for controlling the experiments performed on the RaspeberryPi (see the src directory).

High Level Workflow

The provided code and notebooks focus on two types of experiments: RPE and CB. The process is essentially the same for both RPE and CB experiments, with minor differences in data structure (e.g. an additional table ResponsesCB for CB experiments).

  1. Logfile Reconciliation:

Purpose: Identify which log files should be included or excluded from the analysis for each experiment type.

Steps:

  • Load and Parse Log Files: Load log files and parse their filenames to extract metadata.
  • Apply Exclusion Rules: Exclude log files based on predefined rules (e.g., test runs, bad data).
  • Export Lists: Export lists of included and excluded log files for further processing.
  1. Logfile to Database:

Purpose: Load the reconciled log files into a DuckDB database for each experiment type.

Steps:

  • Load Log Files: Load the log files determined to be included from the reconciliation step.
  • Parse and Extract Data: Extract relevant data from each log file and structure it for database insertion.
  • Insert Data into Database: Insert the parsed data into a DuckDB database.
  1. Database Queries:

Purpose: Cross-check the data in the databases and provide examples of typical queries for each experiment type.

Steps:

  • Setup Database Connection: Connect to the DuckDB database.
  • Run SQL Queries: Execute SQL queries to verify data integrity and demonstrate typical data retrieval operations.
  • Provide Query Examples: Showcase various SQL queries that might be useful for future analysis.

Note: These notebooks do not perform specific data analysis at this stage. They serve as a foundation for future analysis, which may be conducted in additional notebooks or external tools.

  1. Database Consistency Checks

Purpose: Perform consistency checks on the local DuckDB CB database created using logfile-to-database-CB.ipynb.

Steps:

  • Run Integrity Checks: Execute a series of SQL queries to verify the integrity of the CB database:
    • Check for orphaned records (foreign key constraint violations)
    • Check for duplicate primary keys
    • Check for NULL values in primary key columns
    • Check for records with ID = 0
    • Check for consistency in trial numbering within experiments
    • Verify that all trials have associated events
    • Check for mismatches in SessionType between ExperimentCBs and TrialCBs
  • Display Results: Show the results of each query, highlighting any potential issues with data integrity.

This notebook serves as a crucial step in the data analysis pipeline, ensuring the consistency and integrity of the database before proceeding with further analysis. It helps identify any potential issues that may have occurred during the data import process or due to inconsistencies in the original log files.

  1. Corrections to CB database:

Purpose: For CB experiments only, load manual corrections to the CB database.

Steps:

  • Load the corrections from the Corrections workbook (field, current value, corrected value).
  • Insert these into a CB database table.
  • Apply corrections to a new view on the data.
  1. Apply corrections to CB database:

Purpose: For CB experiments only, apply the manual corrections to the CB database.

Steps:

  • Load the corrections from the Corrections table (field, current value, corrected value).
  • Apply corrections to create new “corrected” views on the data.

Process Flow Diagram

graph TD
    A[Start] --> B[Load Log files]
    B --> C[Parse filenames]
    C --> D[Apply Exclusion rules]
    D --> E[Export Included/Excluded lists]
    E --> F[Load Included Log files]
    F --> G[Parse and extract data]
    G --> H[Insert data into `DuckDB`]
    H --> I[Load and Apply Corrections - CB only]
    I --> J[End]

    classDef process fill:#f9f,stroke:#333,stroke-width:2px;
    class A,B,C,D,E,F,G,H,I,J process;

Explanation of the Diagram

  1. Start: The process begins with the loading of log files.
  2. Load Log Files: Log files are loaded from a specified directory.
  3. Parse Filenames: Filenames are parsed to extract metadata such as subject name and experiment type.
  4. Apply Exclusion Rules: Predefined rules are applied to exclude certain log files. Handle Specific Files: Additional criteria are used to include or exclude specific log files.
  5. Export Included/Excluded Lists: Lists of included and excluded log files are exported for further processing.
  6. Load Included Log Files: The included log files are loaded for detailed analysis.
  7. Parse and Extract Data: Relevant data is extracted from each log file and structured for database insertion.
  8. Insert Data into DuckDB: The parsed data is inserted into a DuckDB database. For CB experiments only, manual corrections are loaded and applied.
  9. Load and Apply Corrections (CB only).
  10. End: The process concludes.

This workflow ensures that only relevant and high-quality log files are included in the analysis, and the data is structured and stored in a database for both RPE and CB experiments. The process is designed to be flexible and repeatable for both experiment types, with minor adjustments to account for the differences in data structure. The database queries serve as a foundation for future in-depth analysis, providing a way to verify data integrity and demonstrate typical data retrieval operations.

Code Features

The following (custom) modules are used to assist with the data transformations and analysis:

  • Logfiles Module (logfiles.py): Handles the loading, parsing, and management of log files.
  • Project Module (project.py): Manages project configuration, directory setup, and database initialisation.
  • Subject Module (subject.py): Loads and processes subject information from external files, providing crucial data for both RPE and CB experiments.
  • Utils Module: Provides utility functions for displaying class definitions.

The key custom classes to assist with the analysis are:

Project Class

The Project class is like a central organiser for your research project. It helps you keep everything tidy and in the right place. Here’s what it does:

  1. Sets up your project’s structure: It creates folders for your data, results, and other important files.
  2. Manages your database: It sets up and maintains a database (by experiment type i.e. RPE or CB) where you can store all your experimental data.
  3. Keeps track of important information: It remembers things like what type of experiment you’re running and where all your files are located.
  4. Provides useful tools: It offers methods to help you do common tasks, like creating links to your files or exporting data.

Think of the Project class as your personal assistant for your research project. It helps you stay organised and provides you with the tools you need to manage your experiments efficiently.

Subject Class

The Subject class represents an individual participant in your experiment - in this case, a horse. It’s like a digital profile for each horse in your study. Here’s what it does:

  1. Stores information about the horse: It keeps track of details like the horse’s name, cohort, and any other relevant characteristics.
  2. Manages experimental data: It helps you organise and access the data collected for each horse during the experiments.
  3. Provides easy access to subject-specific information: It allows you to quickly retrieve information about a particular horse.

Think of the Subject class as a digital folder for each horse in your study. It keeps all the information about that horse in one place, making it easy for you to access and manage data for individual subjects.

Together, these classes help you organise your research project, manage your data efficiently, and keep track of information about each participant in your study. They’re designed to make your research process smoother and more organised.

Logfile Classes

  1. Logfile Class: This is like a digital version of a physical log file. It can:
  • Read and store the contents of a log file
  • Parse the filename to extract important information
  • Interpret the contents of the file to create Experiment, Trial, and Event objects (for both RPE and CB experiments)
  1. Logs Class: This is like a file cabinet for all your log files. It can:
  • Find all the log files in a specific folder
  • Load them into Logfile objects
  • Keep track of which files are included or excluded from analysis

Experimental Classes

These are also located in logfiles.py.

  1. Experiment and ExperimentCB Classes: These are like digital record cards for each experiment. They store important information about an experiment, such as:
  • Who participated (the subject)
  • When it happened
  • What type of experiment it was
  • Any comments or notes
  • The name of the log file

The ExperimentCB class is specifically for cognitive bias experiments and includes some extra details.

  1. Trial and TrialCB Classes: These represent individual trials within an experiment. They keep track of:
  • When the trial started and ended
  • What number trial it was
  • For cognitive bias trials (TrialCB), additional information like the type of response and direction
  1. Event and EventCB Classes: These capture specific moments or actions during a trial. They record:
  • What happened (the type of event)
  • When it happened
  • How long into the trial it occurred
  1. RepsonseCB Class (CB experiments only): This is specific to cognitive bias experiments and records the subject’s responses, including:
  • When the response occurred
  • How long it took (response time)

These classes work together to organise and make sense of the data from your experiments. They turn raw log files into structured data that’s easier to analyse and understand. The Logfile class does the hard work of interpreting each file, while the Logs class manages the collection of all your files.

Corrected views

While the “corrected” views are defined in the corrections-to-database-CB.ipynb and database-apply-corrections-CB.ipynb notebooks, they are also been extracted into sql/corrections-to-database-CB.sql and sql/database-apply-corrections-CB.sql for ease of reference.

Database details

RPE - Entity Relationship Diagram

erDiagram
    Experiments {
        int ExperimentID PK "Primary Key"
        varchar SubjectName
        int SessionNumber
        varchar ExperimentType
        text Comment
        varchar StartDateTime
        varchar LogFileName
        varchar MeasurementFileName
    }

    Parameters {
        int ParameterID PK "Primary Key"
        int ExperimentID FK "Foreign Key"
        varchar ParameterName
        varchar ParameterValue
    }

    Trials {
        int TrialID PK "Primary Key"
        int ExperimentID FK "Foreign Key"
        int TrialNumber
        datetime StartToneDateTime
    }

    Events {
        int EventID PK "Primary Key"
        int TrialID FK "Foreign Key"
        varchar EventType
        datetime EventDateTime
    }

    Experiments ||--o{ Parameters : "has"
    Experiments ||--o{ Trials : "has"
    Trials ||--o{ Events : "has"

CB - Entity Relationship Diagrams (ERD)

The Entity-Relationship Diagram (ERD) below can be summarised as follows:

  1. Core Data Tables:
    • ExperimentCBs: This is the main table that stores information about each experiment. It includes details like the subject’s name, experiment date, and type of experiment.
    • TrialCBs: This table contains information about individual trials within each experiment. It’s connected to ExperimentCBs because each experiment consists of multiple trials.
    • ResponseCBs: This table records the responses given during each trial. It’s linked to TrialCBs because each trial can have a response.
    • EventCBs: This table stores events that occur during trials. It’s also connected to TrialCBs as events are associated with specific trials.
  2. Supporting Tables:
    • SubjectCBs: This table holds information about the subjects (e.g. horses) involved in the experiments. It’s connected to ExperimentCBs as each experiment involves a specific subject.
    • Corrections: This is a special table that contains information about corrections or additions that need to be made to the existing data (which is a consolidated table of the information in the Corrections workbook).
  3. Correction Process:
    • CorrectionsSplit: This view breaks down the Corrections table into more specific pieces of information.
    • AddExperiments and ReformattedAddExperiments: These views identify and format new experiments that need to be added to the database. It turns out there is only one experiment to add.
    • AddTrials: This view identifies new trials that need to be added to existing experiments.
    • CorrectTrials: This view shows what corrections need to be made to existing trials.
  4. Corrected Data Views:
    • ExperimentCBsCorrected: This view combines the original experiment data with any new experiments that need to be added.
    • TrialCBsCorrected: This view shows all trials, including both original and new ones, with any necessary corrections applied.
    • ResponseCBsCorrected: This view presents all responses, including both original and new ones, with any necessary corrections applied.
    • EventCBsCorrected: This view shows all events, including both original and new ones, with any necessary additions for new trials.

The overall data flow is:

  1. Original data is stored in the core tables (ExperimentCBs, TrialCBs, ResponseCBs, EventCBs).
  2. Corrections and additions are specified in the Corrections table.
  3. Various views process these corrections and additions.
  4. The final corrected views (ExperimentCBsCorrected, TrialCBsCorrected, ResponseCBsCorrected, EventCBsCorrected) present the updated and corrected data for use in subsequent analysis.

erDiagram
    ExperimentCBs ||--o{ TrialCBs : contains
    TrialCBs ||--o{ ResponseCBs : has
    TrialCBs ||--o{ EventCBs : has
    ExperimentCBs ||--o{ SubjectCBs : aggregates
    Corrections ||--|{ CorrectionsSplit : splits
    Corrections ||--o{ AddExperiments : identifies
    Corrections ||--o{ AddTrials : identifies
    Corrections ||--o{ CorrectTrials : informs
    Corrections ||--o{ ResponseCBsCorrected : informs
    ExperimentCBs ||--o{ ExperimentCBsCorrected : corrects
    TrialCBs ||--o{ TrialCBsCorrected : corrects
    ResponseCBs ||--o{ ResponseCBsCorrected : corrects
    EventCBs ||--o{ EventCBsCorrected : corrects
    ExperimentCBsCorrected ||--o{ TrialCBsCorrected : contains
    TrialCBsCorrected ||--o{ ResponseCBsCorrected : has
    TrialCBsCorrected ||--o{ EventCBsCorrected : has
    AddExperiments ||--o{ ReformattedAddExperiments : reformats
    ReformattedAddExperiments ||--o{ ExperimentCBsCorrected : adds
    AddTrials ||--o{ TrialCBsCorrected : adds
    CorrectTrials ||--o{ TrialCBsCorrected : informs
    CorrectionsSplit ||--o{ CorrectTrials : informs
    CorrectionsSplit ||--o{ ResponseCBsCorrected : informs

    ExperimentCBs {
        int ExperimentID PK
        string LogFileName
    }

    TrialCBs {
        int TrialID PK
        int ExperimentID FK
        int TrialNumber
        string ResponseType
    }

    ResponseCBs {
        int ResponseID PK
        int TrialID FK
        float ResponseTime
    }

    EventCBs {
        int EventID PK
        int TrialID FK
    }

    Corrections {
        string LogFilename
        int TrialNumber
        string ResponseType
        float ResponseTime
    }

    CorrectionsSplit {
        string LogFilename
        int TrialNumber
        string TableToCorrect
        string FieldToCorrect
        string ValueToCorrect
    }

    ExperimentCBsCorrected {
        int ExperimentID PK
        string DataSource
    }

    TrialCBsCorrected {
        int TrialID PK
        string CorrectedFlag
    }

    ResponseCBsCorrected {
        int ResponseID PK
        string CorrectedFlag
    }

    EventCBsCorrected {
        int EventID PK
        string CorrectedFlag
    }

    AddExperiments {
        string LogFilename
        string DataSource
    }

    ReformattedAddExperiments {
        int ExperimentID PK
        string DataSource
    }

    AddTrials {
        int ExperimentID FK
        int TrialNumber
        string CorrectedFlag
    }

    CorrectTrials {
        int TrialID PK
        string CorrectedFlag
    }

ExperimentCBsCorrected

This view combines original experiment data with new experiments that need to be added:

  • It starts with the original experiment data from ExperimentCBs.
  • It looks at a list of new experiments (AddExperiments) that need to be added derived from the Corrections table.
  • These new experiments are reformatted (ReformattedAddExperiments) to match the structure of existing experiments.
  • The view then combines the original experiments with these new, reformatted experiments.
  • It also looks up information from the SubjectCBs table to fill in some details about the properties of the new experiments.

erDiagram
    ExperimentCBs ||--|{ ExperimentCBsCorrected : "original data"
    AddExperiments ||--|{ ReformattedAddExperiments : reformats
    ReformattedAddExperiments ||--|{ ExperimentCBsCorrected : "adds new experiments"
    Corrections ||--|{ AddExperiments : "identifies new experiments"
    SubjectCBs ||--o{ ReformattedAddExperiments : "provides subject info"

    ExperimentCBs {
        int ExperimentID PK
        string LogFileName
        string DataSource
    }
    AddExperiments {
        string LogFilename
        string DataSource
    }
    ReformattedAddExperiments {
        int ExperimentID PK
        string LogFilename
        string DataSource
    }
    ExperimentCBsCorrected {
        int ExperimentID PK
        string LogFileName
        string DataSource
    }
    Corrections {
        string LogFilename
    }
    SubjectCBs {
        string SubjectName PK
        int SubjectNumber
        string Direction
    }

TrialCBsCorrected

This view handles both existing trials and any corrections needed and new trials that need to be added (due to any new experiments):

  • It starts with the original trial data from TrialCBs.
  • It applies corrections to existing trials (CorrectTrials), fixing the ResponseType.
  • It also adds completely new trials (AddTrials) for new experiments.
  • The view combines the corrected existing trials and the new trials.
  • It ensures that each trial has a unique identifier.

erDiagram
    TrialCBs ||--|{ CorrectTrials : "corrects existing trials"
    Corrections ||--|{ AddTrials : "identifies new trials"
    CorrectTrials ||--|{ TrialCBsCorrected : combines
    AddTrials ||--|{ TrialCBsCorrected : combines
    ExperimentCBsCorrected ||--o{ AddTrials : "links new trials"
    ExperimentCBsCorrected ||--o{ CorrectTrials : "links existing trials"
    ResponseCBs ||--o{ CorrectTrials : "provides ResponseTime for info"

    TrialCBs {
        int TrialID PK
        int ExperimentID FK
        string ResponseType
    }
    CorrectTrials {
        int TrialID PK
        int ExperimentID FK
        string ResponseType
        string CorrectedFlag
    }
    AddTrials {
        int ExperimentID FK
        int TrialNumber
        string ResponseType
        string CorrectedFlag
    }
    TrialCBsCorrected {
        int TrialID PK
        int ExperimentID FK
        string ResponseType
        string CorrectedFlag
    }
    Corrections {
        string LogFilename
        int TrialNumber
        string ResponseType
    }
    ExperimentCBsCorrected {
        int ExperimentID PK
        string LogFileName
    }
    ResponseCBs {
        int ResponseID PK
        int TrialID FK
        float ResponseTime
    }

ResponseCBsCorrected

This view deals with the responses recorded during trials:

  • It starts with the original response data from ResponseCBs.
  • It applies any corrections to existing responses as necessary, fixing ResponseTime.
  • It adds new responses for trials in new experiments.
  • The view combines the corrected existing responses and the new responses.
  • It uses information from the TrialCBsCorrected and ExperimentCBsCorrected views to ensure everything lines up correctly.

erDiagram
    ResponseCBs ||--|{ ExistingResponses : "corrects existing responses"
    Corrections ||--|{ NewResponses : "adds new responses"
    ExistingResponses ||--|{ ResponseCBsCorrected : combines
    NewResponses ||--|{ ResponseCBsCorrected : combines
    TrialCBsCorrected ||--o{ NewResponses : "links new responses"
    TrialCBsCorrected ||--o{ ExistingResponses : "links existing responses"
    ExperimentCBsCorrected ||--o{ NewResponses : "links to experiments"
    CorrectionsSplit ||--o{ ExistingResponses : "provides correction info"

    ResponseCBs {
        int ResponseID PK
        int TrialID FK
        float ResponseTime
    }
    ExistingResponses {
        int ResponseID PK
        int TrialID FK
        float ResponseTime
        string CorrectedFlag
    }
    NewResponses {
        int TrialID FK
        float ResponseTime
        string CorrectedFlag
    }
    ResponseCBsCorrected {
        int ResponseID PK
        int TrialID FK
        float ResponseTime
        string CorrectedFlag
    }
    Corrections {
        string LogFilename
        int TrialNumber
        float ResponseTime
    }
    TrialCBsCorrected {
        int TrialID PK
        int ExperimentID FK
    }
    ExperimentCBsCorrected {
        int ExperimentID PK
        string LogFileName
    }
    CorrectionsSplit {
        string LogFilename
        int TrialNumber
        string TableToCorrect
        string FieldToCorrect
        string ValueToCorrect
    }

EventCBsCorrected

This view handles events that occur during trials:

  • It starts with the original event data from EventCBs.
  • It keeps all existing events as is.
  • For new trials that were added, it creates placeholder events (NewTrialEvents).
  • The view combines the existing events and the placeholder events for new trials

erDiagram
    EventCBs ||--|{ ExistingEvents : "includes existing events"
    TrialCBsCorrected ||--|{ NewTrialEvents : "generates placeholder events"
    ExistingEvents ||--|{ EventCBsCorrected : combines
    NewTrialEvents ||--|{ EventCBsCorrected : combines

    EventCBs {
        int EventID PK
        int TrialID FK
        string EventType
        datetime EventTime
        float ElapsedTime
    }
    ExistingEvents {
        int EventID PK
        int TrialID FK
        string EventType
        datetime EventTime
        float ElapsedTime
        string CorrectedFlag
    }
    NewTrialEvents {
        int TrialID FK
        string CorrectedFlag
    }
    EventCBsCorrected {
        int EventID PK
        int TrialID FK
        string EventType
        datetime EventTime
        float ElapsedTime
        string CorrectedFlag
    }
    TrialCBsCorrected {
        int TrialID PK
        string CorrectedFlag
    }

Mappings of dataclasses to database tables

Mappings between the data classes and their corresponding DuckDB database tables:

RPE

  1. Experiment: Maps to the Experiments table, storing details about the experiment such as subject name, session number, and parameters.

  2. Trial: Maps to the Trials table, representing individual trials within an experiment with start and end times.

  3. Event: Maps to the Events table, capturing specific events that occur during trials, including event types and timings.

CB

  1. ExperimentCB: Maps to the ExperimentsCB table, similar to Experiment but includes a session type field for cognitive bias experiments.

  2. TrialCB: Maps to the TrialsCB table, representing trials in cognitive bias experiments with additional fields for response type and criteria.

  3. RepsonseCB: Maps to the ResponsesCB table, recording responses specific to cognitive bias trials, including response times.

  4. EventCB: Maps to the EventsCB table, capturing events during cognitive bias trials with relevant timing information.