Logfiles to CB database

Purpose of This Notebook

This notebook serves as an exploratory tool for examining the log files produced during the horse behavioural experiments conducted in October and November 2023. It facilitates the experimentation with text parsing techniques on the files before they are imported into a database. The primary objectives are:

To conduct experiments with regular expressions (regex) aimed at extracting pertinent data and fields from the log files.

Setup project & directories

This is the init_notebook_mode cell from ITables v2.1.4
(you should not see this message - is your notebook trusted?)

INFO     | Data directory: /Users/mjboothaus/code/github/databooth/horse-logic/data
INFO     | Data directory purpose: Parent directory for raw and processed data
INFO     | Sql directory: /Users/mjboothaus/code/github/databooth/horse-logic/sql
INFO     | Sql directory purpose: Store SQL scripts
INFO     | Output directory: /Users/mjboothaus/code/github/databooth/horse-logic/notebooks/results/CB
INFO     | Output directory purpose: Store output files and results by experiment type
INFO     | Logfiles directory: /Users/mjboothaus/code/github/databooth/horse-logic/data/results/zips/cb_data
INFO     | Logfiles directory purpose: Store for the raw log files
INFO     | Notebooks directory: /Users/mjboothaus/code/github/databooth/horse-logic/notebooks
INFO     | Notebooks directory purpose: Jupyter notebooks for performing analysis
INFO     | Database file path: /Users/mjboothaus/code/github/databooth/horse-logic/data/Experiments_CB_2023_Q4.ddb
INFO     | Database purpose: Main project databases (outputs) by experiment type
INFO     | Project initialised (CB): config defined in project_config.yaml
INFO     | Connected to database: /Users/mjboothaus/code/github/databooth/horse-logic/data/Experiments_CB_2023_Q4.ddb
INFO     | DuckDB version: v1.1.2
INFO     | Tables created successfully from /Users/mjboothaus/code/github/databooth/horse-logic/sql/create_cb_experiment_tables_ddb.sql

### Get Subject info

INFO     | Loaded subject info from: /Users/mjboothaus/code/github/databooth/horse-logic/docs/from_CH/Cohort data for MB.xlsx
INFO     | Subject count: 22
INFO     | Sorted subject names:
    apollo, ash, atom, bonnie, clover, dodge, dougie, dusty, filly, freya, george, gio, jelly, molly, mowgli, myrtle, 
    nix, olive, pumba, smudge, teddy, yoshi

Load logfiles from CSV list of included files

As determined in logfiles-reconciliation-CB.ipynb

df = pd.read_csv(OUTPUT_DIR / "all_included_files.csv")
print(df.head())
print(df.columns)
print(df.index)

   0 Experiment_2023-10-06T12:49:44.174442_atom_2_Training - fixed.log
0  1  Experiment_2023-10-06T13:21:53.770221_ash_2_Tr...               
1  2  Experiment_2023-10-06T14:14:41.653330_mowgli_1...               
2  3  Experiment_2023-10-06T14:48:40.419276_teddy_1_...               
3  4  Experiment_2023-10-06T15:37:08.619329_dodge_1_...               
4  5  Experiment_2023-10-06T16:43:49.516057_filly_4_...               
Index(['0', 'Experiment_2023-10-06T12:49:44.174442_atom_2_Training - fixed.log'], dtype='object')
RangeIndex(start=0, stop=182, step=1)

INFO     | Logs: 183 log files in /Users/mjboothaus/code/github/databooth/horse-logic/data/results/zips/cb_data

Explore `regex` parsing of a single example log file

Analysis code is in logfiles.py. Explore parsing log information for specific example defined below.

/Users/mjboothaus/code/github/databooth/horse-logic/data/results/zips/cb_data/Experiment_2023-10-06T12:49:44.174442_atom_2_Training - fixed.log

Example log file for regex parsing: Experiment_2023-10-06T12:49:44.174442_atom_2_Training - fixed.log

Data structures

These are created as dataclasses in Python (in logfiles.py) and TABLES in SQL (sql/create_experiment_tables_ddb.sql).

Experiment table

class ExperimentCB:
    ExperimentID: int
    Cohort: str
    SubjectName: str
    SubjectNumber: int
    SessionNumber: int
    SessionType: str  # "training" or "test"
    ExperimentType: str
    Direction: str  # New field
    Comment: Optional[str]
    DateTime: datetime
    LogFileName: str
    Parameters: Dict

e.g. Parse experiment details and comment from log file

ExperimentCB(ExperimentID=None,
             Cohort=None,
             SubjectName='atom',
             SubjectNumber=None,
             SessionNumber=2,
             SessionType='test',
             ExperimentType='Training - fixed',
             Direction=None,
             Comment='First session',
             DateTime=datetime.datetime(2023, 10, 6, 12, 49, 44, 174442),
             LogFileName='Experiment_2023-10-06T12:49:44.174442_atom_2_Training '
                         '- fixed.log',
             Parameters=None)

Trials and Events tables

log_eg.parse_filename_components_cog_bias().LogFileName

'Experiment_2023-10-06T12:49:44.174442_atom_2_Training - fixed.log'

class TrialCB:
    TrialID: int
    ExperimentID: int
    TrialNumber: int
    ResponseType: str
    TrialSubType: str
    CriterionType: Optional[str]
    CriterionCount: Optional[int]
    TrialStartTime: datetime
    TrialEndTime: datetime
    # Direction and SessionType fields removed

class ResponseCB:
    ResponseID: int
    TrialID: int
    ResponseTime: Optional[float]
    EventDateTime: datetime

class EventCB:
    EventID: int
    TrialID: int
    EventType: str
    EventTime: datetime
    ElapsedTime: float

e.g. Parsing of trials and events info from an example log file

log_eg.parse_comments()

'First session'

experiment_df

ExperimentID	Cohort	SubjectName	SubjectNumber	SessionNumber	SessionType	ExperimentType	Direction	Comment	DateTime	LogFileName	Parameters
Loading ITables v2.1.4 from the `init_notebook_mode` cell... (need help?)

trials_df

TrialID	ExperimentID	TrialNumber	ResponseType	TrialSubType	CriterionType	CriterionCount	TrialStartTime	TrialEndTime
Loading ITables v2.1.4 from the `init_notebook_mode` cell... (need help?)

responses_df

ResponseID	TrialID	ResponseTime	EventDateTime
Loading ITables v2.1.4 from the `init_notebook_mode` cell... (need help?)

events_df

EventID	TrialID	EventType	EventTime	ElapsedTime
Loading ITables v2.1.4 from the `init_notebook_mode` cell... (need help?)

INFO     | Example of parsing of the Trial/Event info for one sample experiment. 19 rows exported to CSV.

results/CB/Experiment_2023-10-06T12:49:44.174442_atom_2_Training - fixed.csv

INFO     | Example of parsing of the Trial/Event info for one sample experiment. 71 total rows exported to XLSX across 4 sheets.

results/CB/Experiment_2023-10-06T12:49:44.174442_atom_2_Training - fixed.xlsx

INFO     | Additional log files excluded as bad data. 0 rows exported to CSV.

results/CB/extra_excludes.csv

INFO     | List of logfiles to be loaded to database. 183 rows exported to CSV.

results/CB/logfiles-to-database.csv

Bringing this all together and putting the data in `DuckDB` database

def parse_and_load_logfiles_to_database(
    logfiles_to_process: List[str], con, subject_df: pd.DataFrame, debug=False
):
    skipped_logfiles = []
    experiment_id = 0
    trial_id = 0
    response_id = 0
    event_id = 0

    for log_filename in sorted(logfiles_to_process):
        logger.info(f"Processing: {experiment_id+1} - {logs.path}/{log_filename}")
        logfile = Logfile(f"{logs.path}/{log_filename}")

        try:
            exp_details = logfile.parse_filename_components_cog_bias()
        except ValueError as e:
            logger.warning(f"Skipping {log_filename}: {str(e)}")
            skipped_logfiles.append(log_filename)
            continue

        # Look up additional Experiment details from subject_df
        if exp_details.SubjectName:
            subject_row = subject_df[subject_df["subject_name"] == exp_details.SubjectName]
            if not subject_row.empty:
                exp_details.Cohort = subject_row["Cohort"].iloc[0]
                exp_details.SubjectNumber = int(subject_row["subject_number"].iloc[0])
                exp_details.Direction = subject_row["direction"].iloc[0]
            else:
                logger.warning(
                    f"Subject {exp_details.SubjectName} not found in subject_df. Skipping this file."
                )
                skipped_logfiles.append(log_filename)
                continue
        else:
            logger.warning(f"Unable to determine SubjectName for {log_filename}. Skipping this file.")
            skipped_logfiles.append(log_filename)
            continue

        exp_details.Parameters = logfile.parse_parameters()

        _, trials_df, responses_df, events_df = logfile.parse_trials_and_events_cog_bias()
        if trials_df is None or responses_df is None or events_df is None:
            logger.warning(f"Skipping {log_filename} due to parsing error")
            skipped_logfiles.append(log_filename)
            continue

        # Insert experiment record
        experiment_id += 1
        exp_details.ExperimentID = experiment_id
        logfile.insert_record_to_database(con, exp_details)
        last_experiment_id = logfile.get_last_insert_id(con, "ExperimentCBs")
        # assert experiment_id == last_experiment_id, f"Experiment ID mismatch: {experiment_id} != {last_experiment_id}"

        first_trial_id = trial_id
        first_response_id = response_id
        first_event_id = event_id

        # Process and insert Trials
        for i_trial, trial in trials_df.iterrows():
            trial_id += 1
            local_trial_id = trial["TrialID"]
            trial["TrialID"] = trial_id
            trial["ExperimentID"] = exp_details.ExperimentID
            logfile.insert_record_to_cb_database(con, trial.to_dict(), "TrialCBs")
            # last_trial_id = logfile.get_last_insert_id(con, "TrialCBs")
            # assert trial_id == last_trial_id, f"Trial ID mismatch: {trial_id} != {last_trial_id}"

            # Process and insert Responses
            for i_response, response in responses_df[
                responses_df["TrialID"] == local_trial_id
            ].iterrows():
                response_id += 1
                response["ResponseID"] = response_id
                response["TrialID"] = trial_id
                logfile.insert_record_to_cb_database(con, response.to_dict(), "ResponseCBs")
                last_response_id = logfile.get_last_insert_id(con, "ResponseCBs")
                # assert response_id == last_response_id, f"Response ID mismatch: {response_id} != {last_response_id}"

            # Process and insert Events for this trial
            for i_event, event in events_df[events_df["TrialID"] == local_trial_id].iterrows():
                event_id += 1
                event["EventID"] = event_id
                event["TrialID"] = trial_id
                logfile.insert_record_to_cb_database(con, event.to_dict(), "EventCBs")
                last_event_id = logfile.get_last_insert_id(con, "EventCBs")
                # assert event_id == last_event_id, f"Event ID mismatch: {event_id} != {last_event_id}"

        if debug:
            logger.info(f"IDs for Experiment {experiment_id}:")
            logger.info(f"  First Trial ID: {first_trial_id}")
            logger.info(f"  First Response ID: {first_response_id}")
            logger.info(f"  First Event ID: {first_event_id}")
            logger.info(f"  Last Trial ID: {trial_id}")
            logger.info(f"  Last Response ID: {response_id}")
            logger.info(f"  Last Event ID: {event_id}")

    total_experiments = logfile.get_last_insert_id(con, "ExperimentCBs")
    total_trials = logfile.get_last_insert_id(con, "TrialCBs")
    total_responses = logfile.get_last_insert_id(con, "ResponseCBs")
    total_events = logfile.get_last_insert_id(con, "EventCBs")

    return total_experiments, total_trials, total_responses, total_events, skipped_logfiles

Number of Experiments: 183
Number of Trials: 3171
Number of Responses: 1735
Number of Events: 6161
Number of Skipped files: 0

INFO     | Exported database tables to: /Users/mjboothaus/code/github/databooth/horse-logic/notebooks/results/CB/export_tables/all_tables.xlsx

results/CB/export_tables/all_tables.xlsx

# Cross-check queries

con.sql("SHOW TABLES;")

┌───────────────┐
│     name      │
│    varchar    │
├───────────────┤
│ EventCBs      │
│ ExperimentCBs │
│ ResponseCBs   │
│ TrialCBs      │
└───────────────┘

con.sql("SELECT COUNT(*) FROM ExperimentCBs")

┌──────────────┐
│ count_star() │
│    int64     │
├──────────────┤
│          183 │
└──────────────┘

con.sql("SELECT COUNT(*) FROM TrialCBs")

┌──────────────┐
│ count_star() │
│    int64     │
├──────────────┤
│         3171 │
└──────────────┘

con.sql("SELECT COUNT(*) FROM ResponseCBs")

┌──────────────┐
│ count_star() │
│    int64     │
├──────────────┤
│         1735 │
└──────────────┘

con.sql("SELECT COUNT(*) FROM EventCBs")

┌──────────────┐
│ count_star() │
│    int64     │
├──────────────┤
│         6161 │
└──────────────┘

con.sql("SELECT DISTINCT ExperimentType FROM ExperimentCBs ORDER BY ExperimentType")

┌────────────────────────────┐
│       ExperimentType       │
│          varchar           │
├────────────────────────────┤
│ Test Type 1                │
│ Test Type 2                │
│ Test Type 3                │
│ Training - fixed           │
│ Training randomised Type 1 │
│ Training randomised Type 2 │
└────────────────────────────┘

con.sql("SELECT DISTINCT SubjectName FROM ExperimentCBs ORDER BY SubjectName")

┌─────────────┐
│ SubjectName │
│   varchar   │
├─────────────┤
│ apollo      │
│ ash         │
│ atom        │
│ bonnie      │
│ clover      │
│ dodge       │
│ dougie      │
│ dusty       │
│ filly       │
│ freya       │
│ george      │
│ gio         │
│ jelly       │
│ molly       │
│ mowgli      │
│ myrtle      │
│ nix         │
│ pumba       │
│ smudge      │
│ teddy       │
│ yoshi       │
├─────────────┤
│   21 rows   │
└─────────────┘

# Sample of 100 rows from the TrialCBs table

con.sql("SELECT * FROM TrialCBs LIMIT 100").df()

TrialID	ExperimentID	TrialNumber	ResponseType	TrialSubType	CriterionType	CriterionCount	TrialStartTime	TrialEndTime
Loading ITables v2.1.4 from the `init_notebook_mode` cell... (need help?)

INFO     | Database connection closed