Corrections to database - Cognitive Bias (CB)

Purpose of This Notebook

This notebook serves to load data from the manually recorded corrections spreadsheet into a table in the CB database so that they can be applied to the data loaded from the log files.

We create a new Corrections table which is a replica of the data in the Corrections workbook (.xlsx).

We also create a view SubjectCBs.

Setup project & directories

This is the init_notebook_mode cell from ITables v2.1.4
(you should not see this message - is your notebook trusted?)

INFO     | Data directory: /Users/mjboothaus/code/github/databooth/horse-logic/data
INFO     | Data directory purpose: Parent directory for raw and processed data
INFO     | Sql directory: /Users/mjboothaus/code/github/databooth/horse-logic/sql
INFO     | Sql directory purpose: Store SQL scripts
INFO     | Output directory: /Users/mjboothaus/code/github/databooth/horse-logic/notebooks/results/CB
INFO     | Output directory purpose: Store output files and results by experiment type
INFO     | Logfiles directory: /Users/mjboothaus/code/github/databooth/horse-logic/data/results/zips/cb_data
INFO     | Logfiles directory purpose: Store for the raw log files
INFO     | Notebooks directory: /Users/mjboothaus/code/github/databooth/horse-logic/notebooks
INFO     | Notebooks directory purpose: Jupyter notebooks for performing analysis
INFO     | Database file path: /Users/mjboothaus/code/github/databooth/horse-logic/data/Experiments_CB_2023_Q4.ddb
INFO     | Database purpose: Main project databases (outputs) by experiment type
INFO     | Project initialised (CB): config defined in project_config.yaml
INFO     | GitHub repo: https://github.com/DataBooth/horse-logic
INFO     | Connected to database: /Users/mjboothaus/code/github/databooth/horse-logic/data/Experiments_CB_2023_Q4.ddb
INFO     | DuckDB version: v1.1.3
INFO     | Dropped 14 views.
INFO     | Dropped 0 non-default tables.

Process Corrections workbook (lookup log files from subject/horse name and date)

Corrections workbook: docs/from_CH/Exp1 Errors_Corrections_Template_26Oct2024.xlsx

Worksheets: ['CBHD_Times', 'CBCSU_Times']

SubjectCBs view

While there is not explicitly a Subjects information table created in earlier workbooks, for reference, we can create one from the ExperimentCBs table.

NOTE: There is no SubjectNumber = 20 in this table as all of the SubjectName = Olive experiments were excluded from the analysis.

INFO     | Exported table: SubjectCBs. 21 total rows exported to XLSX across 1 sheets.

results/CB/SubjectCBs.xlsx

Table: SubjectCBs

SubjectName	SubjectNumber	Cohort	Direction	TotalExperiments
Loading ITables v2.1.4 from the `init_notebook_mode` cell... (need help?)

Low Cardinality columns for SubjectCBs

column_name	distinct_count	unique_values_with_counts
Loading ITables v2.1.4 from the `init_notebook_mode` cell... (need help?)

Load and extract the Corrections data into the CB database

We store the tables of corrections (for HD & CSU) in the Corrections table and then apply them to create new views (in each row of the spreadsheet we have specified the unique log filename (i.e. Experiment) that the correction(s) apply to.

Summary of Approach:

The original data which is stored in the ExperimentCBs, TrialCBs, ResponseCBs and EventCBs tables does not get changed by this Corrections notebook.
The Corrections table contains the information about what corrections need to be made, but it doesn’t directly modify the original data tables and is applied via Views.

i.e. this approach maintains data integrity by:

Preserving the original experimental data in ExperimentCBs, TrialCBs, ResponseCBs and EventCBs.
Storing correction information separately in the Corrections table.
Creating views ExperimentCBsCorrected, TrialCBsCorrected, ResponseCBsCorrected and EventCBsCorrected that combine the original data with the corrections (via some intermediate views), providing corrected versions of the data for analysis without modifying the initial source tables.

`Corrections` table

The Corrections table primary purpose is to store and manage manual corrections to the experimental data, ensuring data accuracy and integrity.

Source: - It is derived from the worksheets CBHD_Times (HD cohort) and CBCSU_Times (CSU cohort) in the workbook labelled Corrections workbook above.

Purpose: - Data Accuracy: To address and correct any discrepancies or errors in the recorded experimental data (via the log files generated by the RPi). - Flexibility: To allow researchers to adjust data points based on observed behaviour that may not have been accurately captured by the automated system. Often this was obtained by watching the videos of experiments that were not faithfully captured by the RPi. - Traceability: To maintain a record of all manual changes made to the original data, ensuring transparency in the research process. - Quality Control: To improve the overall quality of the dataset by incorporating human expertise and observations.

Structure Overview: - Cohort: Indicates the subject group (i.e., HD, CSU) - inferred from the worksheet name. - ExperimentDate: The date of the experiment. - SessionType: Type of session conducted. - SubjectName: Name of the subject involved. - TrialNumber: Identifier for specific trials within an experiment. - TrialType: The nature of the trial (e.g., positive or negative). - ResponseTime: Time taken for the subject to respond. - ResponseType: Type of response given (e.g., GO, NOGO). - Comment: Notes regarding the correction. - LogFilename: The associated log file for reference. - CorrectTable: Specifies which database table(s) requires correction (e.g. ResponseCBs). - CorrectField: Indicates which field(s) in the table is being corrected (e.g. ResponseTime). - CurrentValue: The original recorded value before correction. - CorrectValue: The new value(s) to be applied after correction.

The last 4 fields can be comma separated lists of items if there are more than one field that is being corrected for a particular trial with an experiment.

`Corrections` table

Purpose: Stores manual corrections and additions to experiments (replica of Corrections workbook).
Approach: Flat table structure with columns for various experiment details and correction information.
Structure: Single table with columns for experiment metadata and correction data.
Features: Includes fields for identifying experiments, trials, and specific corrections to be made from Excel workbook.

Additional Notes:

The number of each of these match those in the workbook. ✅
Note that the CurrentValue (with all values of CH, CH) from the Corrections table is not being used so dropping.

INFO     | Exported table: Corrections. 306 total rows exported to XLSX across 1 sheets.

results/CB/Corrections.xlsx

Table: Corrections

Cohort	ExperimentDate	SessionType	SubjectName	TrialNumber	TrialType	ResponseTime	ResponseType	Comment	LogFilename	CorrectTable	CorrectField	CorrectValue
Loading ITables v2.1.4 from the `init_notebook_mode` cell... (need help?)

Low Cardinality columns for Corrections

column_name	distinct_count	unique_values_with_counts
Loading ITables v2.1.4 from the `init_notebook_mode` cell... (need help?)

Check for any duplicates in Corrections table.

┌────────────────┬─────────────┬─────────────┬─────────────┬─────────────┬───────┐
│ ExperimentDate │ SessionType │ SubjectName │ TrialNumber │ LogFilename │ Count │
│      date      │   varchar   │   varchar   │    int32    │   varchar   │ int64 │
├────────────────┴─────────────┴─────────────┴─────────────┴─────────────┴───────┤
│                                     0 rows                                     │
└────────────────────────────────────────────────────────────────────────────────┘

Let’s look for ResponseType or ResponseTime is NULL.

con.sql("SELECT COUNT (*) FROM Corrections WHERE ResponseType IS NULL")

┌──────────────┐
│ count_star() │
│    int64     │
├──────────────┤
│            0 │
└──────────────┘

con.sql("SELECT COUNT (*) FROM Corrections WHERE ResponseTime IS NULL")

┌──────────────┐
│ count_star() │
│    int64     │
├──────────────┤
│            0 │
└──────────────┘

con.sql("SELECT COUNT(*) FROM Corrections WHERE TRIM(CorrectValue) = ',' ")

┌──────────────┐
│ count_star() │
│    int64     │
├──────────────┤
│            0 │
└──────────────┘

Let’s examine the total count of corrections by Cohort and similarly by Cohort and SubjectName. ✅

┌─────────┬────────────────┐
│ Cohort  │ nCorrectionRow │
│ varchar │     int64      │
├─────────┼────────────────┤
│ HD      │             61 │
│ CSU     │            245 │
└─────────┴────────────────┘

Cohort	SubjectName	nCorrectionRow
Loading ITables v2.1.4 from the `init_notebook_mode` cell... (need help?)

Identify any filenames in the Corrections workbook that don’t match filenames in the ExperimentCBs table.

┌────────────────────────────────────────────────┐
│               NoLogfileAvailable               │
│                    varchar                     │
├────────────────────────────────────────────────┤
│ NO_LOGFILE_AVAILABLE_CSU_2023-11-17_CBF3_yoshi │
└────────────────────────────────────────────────┘

As expected there is only the one Experiment where we are creating a psuedo-log filename to insert/add this into the database.

Let’s also look at the breakdown of number of corrections by cohort, experiment date, session type and subject.

project.view_and_export_sql(sql, "Corrections_by_Cohort_Date_SessionType_Subject")

INFO     | Exported query results: Corrections_by_Cohort_Date_SessionType_Subject. 67 total rows exported to XLSX across 1 sheets.

results/CB/Corrections_by_Cohort_Date_SessionType_Subject.xlsx

Query Results: Corrections_by_Cohort_Date_SessionType_Subject

Cohort	ExperimentDate	SessionType	SubjectName	nCorrectionRow
Loading ITables v2.1.4 from the `init_notebook_mode` cell... (need help?)

Corrections Split into individual corrections

So as all of the “corrections” have two comma-separated values whether there is a true correction or not we get two rows for each (so originally there are 308 rows in the Corrections table and 616 rows in the CorrectionsSplit view.) ✅

sql = """ 
CREATE OR REPLACE VIEW CorrectionsSplit AS
SELECT 
    *,
    TRIM(UNNEST(STRING_TO_ARRAY(CorrectTable, ','))) AS TableToCorrect,
    TRIM(UNNEST(STRING_TO_ARRAY(CorrectField, ','))) AS FieldToCorrect,
    TRIM(UNNEST(STRING_TO_ARRAY(CorrectValue, ','))) AS ValueToCorrect
FROM Corrections;
"""

con.sql(sql)

┌──────────────────┐
│ nCorrection_XLSX │
│      int64       │
├──────────────────┤
│              306 │
└──────────────────┘

┌───────────────────┐
│ nCorrectionSingle │
│       int64       │
├───────────────────┤
│               612 │
└───────────────────┘

INFO     | Exported table: CorrectionsSplit. 612 total rows exported to XLSX across 1 sheets.

results/CB/CorrectionsSplit.xlsx

Table: CorrectionsSplit

Cohort	ExperimentDate	SessionType	SubjectName	TrialNumber	TrialType	ResponseTime	ResponseType	Comment	LogFilename	CorrectTable	CorrectField	CorrectValue	TableToCorrect	FieldToCorrect	ValueToCorrect
Loading ITables v2.1.4 from the `init_notebook_mode` cell... (need help?)

Low Cardinality columns for CorrectionsSplit

column_name	distinct_count	unique_values_with_counts
Loading ITables v2.1.4 from the `init_notebook_mode` cell... (need help?)

`CorrectionWithExistingValues` table

Purpose: Compares correction values with existing values in the database.
Approach: Joins CorrectionsSplit with relevant tables to fetch existing values.
Structure: View that extends CorrectionsSplit with additional columns for existing values and comparison results.
Features: Provides a comparison between correction values and current database values.

INFO     | Exported table: CorrectionWithExistingValues. 612 total rows exported to XLSX across 1 sheets.

results/CB/CorrectionWithExistingValues.xlsx

Table: CorrectionWithExistingValues

Cohort	ExperimentDate	SessionType	SubjectName	TrialNumber	TrialType	ResponseTime	ResponseType	Comment	LogFilename	CorrectTable	CorrectField	CorrectValue	TableToCorrect	FieldToCorrect	ValueToCorrect	ExistingResponseType	ExistingResponseTime	ComparisonResult
Loading ITables v2.1.4 from the `init_notebook_mode` cell... (need help?)

Low Cardinality columns for CorrectionWithExistingValues

column_name	distinct_count	unique_values_with_counts
Loading ITables v2.1.4 from the `init_notebook_mode` cell... (need help?)

┌────────────────┐
│ nNotCorrection │
│     int64      │
├────────────────┤
│            235 │
└────────────────┘

┌─────────────┐
│ nCorrection │
│    int64    │
├─────────────┤
│         377 │
└─────────────┘

Actual Corrections

sql = """
CREATE OR REPLACE VIEW ActualCorrections AS
SELECT *
FROM CorrectionWithExistingValues
WHERE ComparisonResult = 'NOT EQUAL';
"""

con.sql(sql)

project.view_and_export_table("ActualCorrections")

INFO     | Exported table: ActualCorrections. 377 total rows exported to XLSX across 1 sheets.

results/CB/ActualCorrections.xlsx

Table: ActualCorrections

Cohort	ExperimentDate	SessionType	SubjectName	TrialNumber	TrialType	ResponseTime	ResponseType	Comment	LogFilename	CorrectTable	CorrectField	CorrectValue	TableToCorrect	FieldToCorrect	ValueToCorrect	ExistingResponseType	ExistingResponseTime	ComparisonResult
Loading ITables v2.1.4 from the `init_notebook_mode` cell... (need help?)

Low Cardinality columns for ActualCorrections

column_name	distinct_count	unique_values_with_counts
Loading ITables v2.1.4 from the `init_notebook_mode` cell... (need help?)

So there are 386 actual corrections to make (and 230 “corrections” which are not actual corrections).

INFO     | Database connection closed

Purpose of This Notebook

Setup project & directories

Process Corrections workbook (lookup log files from subject/horse name and date)

SubjectCBs view

Table: SubjectCBs

Low Cardinality columns for SubjectCBs

Load and extract the Corrections data into the CB database

Corrections table

Corrections table

Table: Corrections

Low Cardinality columns for Corrections

Query Results: Corrections_by_Cohort_Date_SessionType_Subject

Corrections Split into individual corrections

Table: CorrectionsSplit

Low Cardinality columns for CorrectionsSplit

CorrectionWithExistingValues table

Table: CorrectionWithExistingValues

Low Cardinality columns for CorrectionWithExistingValues

Actual Corrections

Table: ActualCorrections

Low Cardinality columns for ActualCorrections

`Corrections` table

`Corrections` table

`CorrectionWithExistingValues` table