INFO | Data directory: /Users/mjboothaus/code/github/databooth/horse-logic/data
INFO | Data directory purpose: Parent directory for raw and processed data
INFO | Sql directory: /Users/mjboothaus/code/github/databooth/horse-logic/sql
INFO | Sql directory purpose: Store SQL scripts
INFO | Output directory: /Users/mjboothaus/code/github/databooth/horse-logic/notebooks/results/CB
INFO | Output directory purpose: Store output files and results by experiment type
INFO | Logfiles directory: /Users/mjboothaus/code/github/databooth/horse-logic/data/results/zips/cb_data
INFO | Logfiles directory purpose: Store for the raw log files
INFO | Notebooks directory: /Users/mjboothaus/code/github/databooth/horse-logic/notebooks
INFO | Notebooks directory purpose: Jupyter notebooks for performing analysis
INFO | Database file path: /Users/mjboothaus/code/github/databooth/horse-logic/data/Experiments_CB_2023_Q4.ddb
INFO | Database purpose: Main project databases (outputs) by experiment type
INFO | Project initialised (CB): config defined in project_config.yaml
INFO | GitHub repo: https://github.com/DataBooth/horse-logic
INFO | Connected to database: /Users/mjboothaus/code/github/databooth/horse-logic/data/Experiments_CB_2023_Q4.ddb
INFO | DuckDB version: v1.1.3
INFO | Dropped 14 views.
INFO | Dropped 0 non-default tables.
Purpose of This Notebook
This notebook serves to load data from the manually recorded corrections spreadsheet into a table in the CB database so that they can be applied to the data loaded from the log files.
We create a new Corrections
table which is a replica of the data in the Corrections workbook (.xlsx).
We also create a view SubjectCBs
.
Setup project & directories
Process Corrections workbook (lookup log files from subject/horse name and date)
Corrections workbook: docs/from_CH/Exp1 Errors_Corrections_Template_26Oct2024.xlsx
Worksheets: ['CBHD_Times', 'CBCSU_Times']
SubjectCBs view
While there is not explicitly a Subjects information table created in earlier workbooks, for reference, we can create one from the ExperimentCBs
table.
NOTE: There is no SubjectNumber
= 20 in this table as all of the SubjectName
= Olive experiments were excluded from the analysis.
INFO | Exported table: SubjectCBs. 21 total rows exported to XLSX across 1 sheets.
Table: SubjectCBs
SubjectName | SubjectNumber | Cohort | Direction | TotalExperiments |
---|---|---|---|---|
Loading ITables v2.1.4 from the init_notebook_mode cell...
(need help?) |
Low Cardinality columns for SubjectCBs
column_name | distinct_count | unique_values_with_counts |
---|---|---|
Loading ITables v2.1.4 from the init_notebook_mode cell...
(need help?) |
Load and extract the Corrections data into the CB database
We store the tables of corrections (for HD & CSU) in the Corrections
table and then apply them to create new views (in each row of the spreadsheet we have specified the unique log filename (i.e. Experiment) that the correction(s) apply to.
Summary of Approach:
- The original data which is stored in the
ExperimentCBs
,TrialCBs
,ResponseCBs
andEventCBs
tables does not get changed by this Corrections notebook. - The
Corrections
table contains the information about what corrections need to be made, but it doesn’t directly modify the original data tables and is applied via Views.
i.e. this approach maintains data integrity by:
- Preserving the original experimental data in
ExperimentCBs
,TrialCBs
,ResponseCBs
andEventCBs
. - Storing correction information separately in the
Corrections
table. - Creating views
ExperimentCBsCorrected
,TrialCBsCorrected
,ResponseCBsCorrected
andEventCBsCorrected
that combine the original data with the corrections (via some intermediate views), providing corrected versions of the data for analysis without modifying the initial source tables.
Corrections
table
The Corrections
table primary purpose is to store and manage manual corrections to the experimental data, ensuring data accuracy and integrity.
Source: - It is derived from the worksheets CBHD_Times
(HD cohort) and CBCSU_Times
(CSU cohort) in the workbook labelled Corrections workbook
above.
Purpose: - Data Accuracy: To address and correct any discrepancies or errors in the recorded experimental data (via the log files generated by the RPi). - Flexibility: To allow researchers to adjust data points based on observed behaviour that may not have been accurately captured by the automated system. Often this was obtained by watching the videos of experiments that were not faithfully captured by the RPi. - Traceability: To maintain a record of all manual changes made to the original data, ensuring transparency in the research process. - Quality Control: To improve the overall quality of the dataset by incorporating human expertise and observations.
Structure Overview: - Cohort: Indicates the subject group (i.e., HD, CSU) - inferred from the worksheet name. - ExperimentDate: The date of the experiment. - SessionType: Type of session conducted. - SubjectName: Name of the subject involved. - TrialNumber: Identifier for specific trials within an experiment. - TrialType: The nature of the trial (e.g., positive or negative). - ResponseTime: Time taken for the subject to respond. - ResponseType: Type of response given (e.g., GO, NOGO). - Comment: Notes regarding the correction. - LogFilename: The associated log file for reference. - CorrectTable: Specifies which database table(s) requires correction (e.g. ResponseCBs). - CorrectField: Indicates which field(s) in the table is being corrected (e.g. ResponseTime). - CurrentValue: The original recorded value before correction. - CorrectValue: The new value(s) to be applied after correction.
The last 4 fields can be comma separated lists of items if there are more than one field that is being corrected for a particular trial with an experiment.
Corrections
table
- Purpose: Stores manual corrections and additions to experiments (replica of Corrections workbook).
- Approach: Flat table structure with columns for various experiment details and correction information.
- Structure: Single table with columns for experiment metadata and correction data.
- Features: Includes fields for identifying experiments, trials, and specific corrections to be made from Excel workbook.
Additional Notes:
- The number of each of these match those in the workbook. ✅
- Note that the
CurrentValue
(with all values ofCH, CH
) from the Corrections table is not being used so dropping.
INFO | Exported table: Corrections. 306 total rows exported to XLSX across 1 sheets.
Table: Corrections
Cohort | ExperimentDate | SessionType | SubjectName | TrialNumber | TrialType | ResponseTime | ResponseType | Comment | LogFilename | CorrectTable | CorrectField | CorrectValue |
---|---|---|---|---|---|---|---|---|---|---|---|---|
Loading ITables v2.1.4 from the init_notebook_mode cell...
(need help?) |
Low Cardinality columns for Corrections
column_name | distinct_count | unique_values_with_counts |
---|---|---|
Loading ITables v2.1.4 from the init_notebook_mode cell...
(need help?) |
Check for any duplicates in Corrections
table.
┌────────────────┬─────────────┬─────────────┬─────────────┬─────────────┬───────┐
│ ExperimentDate │ SessionType │ SubjectName │ TrialNumber │ LogFilename │ Count │
│ date │ varchar │ varchar │ int32 │ varchar │ int64 │
├────────────────┴─────────────┴─────────────┴─────────────┴─────────────┴───────┤
│ 0 rows │
└────────────────────────────────────────────────────────────────────────────────┘
Let’s look for ResponseType
or ResponseTime
is NULL
.
"SELECT COUNT (*) FROM Corrections WHERE ResponseType IS NULL") con.sql(
┌──────────────┐
│ count_star() │
│ int64 │
├──────────────┤
│ 0 │
└──────────────┘
"SELECT COUNT (*) FROM Corrections WHERE ResponseTime IS NULL") con.sql(
┌──────────────┐
│ count_star() │
│ int64 │
├──────────────┤
│ 0 │
└──────────────┘
"SELECT COUNT(*) FROM Corrections WHERE TRIM(CorrectValue) = ',' ") con.sql(
┌──────────────┐
│ count_star() │
│ int64 │
├──────────────┤
│ 0 │
└──────────────┘
Let’s examine the total count of corrections by Cohort and similarly by Cohort and SubjectName. ✅
┌─────────┬────────────────┐
│ Cohort │ nCorrectionRow │
│ varchar │ int64 │
├─────────┼────────────────┤
│ HD │ 61 │
│ CSU │ 245 │
└─────────┴────────────────┘
Cohort | SubjectName | nCorrectionRow |
---|---|---|
Loading ITables v2.1.4 from the init_notebook_mode cell...
(need help?) |
Identify any filenames in the Corrections workbook that don’t match filenames in the ExperimentCBs table.
┌────────────────────────────────────────────────┐
│ NoLogfileAvailable │
│ varchar │
├────────────────────────────────────────────────┤
│ NO_LOGFILE_AVAILABLE_CSU_2023-11-17_CBF3_yoshi │
└────────────────────────────────────────────────┘
As expected there is only the one Experiment where we are creating a psuedo-log filename to insert/add this into the database.
Let’s also look at the breakdown of number of corrections by cohort, experiment date, session type and subject.
"Corrections_by_Cohort_Date_SessionType_Subject") project.view_and_export_sql(sql,
INFO | Exported query results: Corrections_by_Cohort_Date_SessionType_Subject. 67 total rows exported to XLSX across 1 sheets.
Query Results: Corrections_by_Cohort_Date_SessionType_Subject
Cohort | ExperimentDate | SessionType | SubjectName | nCorrectionRow |
---|---|---|---|---|
Loading ITables v2.1.4 from the init_notebook_mode cell...
(need help?) |
Corrections Split into individual corrections
So as all of the “corrections” have two comma-separated values whether there is a true correction or not we get two rows for each (so originally there are 308 rows in the Corrections
table and 616 rows in the CorrectionsSplit
view.) ✅
= """
sql CREATE OR REPLACE VIEW CorrectionsSplit AS
SELECT
*,
TRIM(UNNEST(STRING_TO_ARRAY(CorrectTable, ','))) AS TableToCorrect,
TRIM(UNNEST(STRING_TO_ARRAY(CorrectField, ','))) AS FieldToCorrect,
TRIM(UNNEST(STRING_TO_ARRAY(CorrectValue, ','))) AS ValueToCorrect
FROM Corrections;
"""
con.sql(sql)
┌──────────────────┐
│ nCorrection_XLSX │
│ int64 │
├──────────────────┤
│ 306 │
└──────────────────┘
┌───────────────────┐
│ nCorrectionSingle │
│ int64 │
├───────────────────┤
│ 612 │
└───────────────────┘
INFO | Exported table: CorrectionsSplit. 612 total rows exported to XLSX across 1 sheets.
Table: CorrectionsSplit
Cohort | ExperimentDate | SessionType | SubjectName | TrialNumber | TrialType | ResponseTime | ResponseType | Comment | LogFilename | CorrectTable | CorrectField | CorrectValue | TableToCorrect | FieldToCorrect | ValueToCorrect |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Loading ITables v2.1.4 from the init_notebook_mode cell...
(need help?) |
Low Cardinality columns for CorrectionsSplit
column_name | distinct_count | unique_values_with_counts |
---|---|---|
Loading ITables v2.1.4 from the init_notebook_mode cell...
(need help?) |
CorrectionWithExistingValues
table
- Purpose: Compares correction values with existing values in the database.
- Approach: Joins CorrectionsSplit with relevant tables to fetch existing values.
- Structure: View that extends CorrectionsSplit with additional columns for existing values and comparison results.
- Features: Provides a comparison between correction values and current database values.
INFO | Exported table: CorrectionWithExistingValues. 612 total rows exported to XLSX across 1 sheets.
Table: CorrectionWithExistingValues
Cohort | ExperimentDate | SessionType | SubjectName | TrialNumber | TrialType | ResponseTime | ResponseType | Comment | LogFilename | CorrectTable | CorrectField | CorrectValue | TableToCorrect | FieldToCorrect | ValueToCorrect | ExistingResponseType | ExistingResponseTime | ComparisonResult |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Loading ITables v2.1.4 from the init_notebook_mode cell...
(need help?) |
Low Cardinality columns for CorrectionWithExistingValues
column_name | distinct_count | unique_values_with_counts |
---|---|---|
Loading ITables v2.1.4 from the init_notebook_mode cell...
(need help?) |
┌────────────────┐
│ nNotCorrection │
│ int64 │
├────────────────┤
│ 235 │
└────────────────┘
┌─────────────┐
│ nCorrection │
│ int64 │
├─────────────┤
│ 377 │
└─────────────┘
Actual Corrections
= """
sql CREATE OR REPLACE VIEW ActualCorrections AS
SELECT *
FROM CorrectionWithExistingValues
WHERE ComparisonResult = 'NOT EQUAL';
"""
con.sql(sql)
"ActualCorrections") project.view_and_export_table(
INFO | Exported table: ActualCorrections. 377 total rows exported to XLSX across 1 sheets.
Table: ActualCorrections
Cohort | ExperimentDate | SessionType | SubjectName | TrialNumber | TrialType | ResponseTime | ResponseType | Comment | LogFilename | CorrectTable | CorrectField | CorrectValue | TableToCorrect | FieldToCorrect | ValueToCorrect | ExistingResponseType | ExistingResponseTime | ComparisonResult |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Loading ITables v2.1.4 from the init_notebook_mode cell...
(need help?) |
Low Cardinality columns for ActualCorrections
column_name | distinct_count | unique_values_with_counts |
---|---|---|
Loading ITables v2.1.4 from the init_notebook_mode cell...
(need help?) |
So there are 386 actual corrections to make (and 230 “corrections” which are not actual corrections).
INFO | Database connection closed