INFO | Data directory: /Users/mjboothaus/code/github/databooth/horse-logic/data
INFO | Data directory purpose: Parent directory for raw and processed data
INFO | Sql directory: /Users/mjboothaus/code/github/databooth/horse-logic/sql
INFO | Sql directory purpose: Store SQL scripts
INFO | Output directory: /Users/mjboothaus/code/github/databooth/horse-logic/notebooks/results/CB
INFO | Output directory purpose: Store output files and results by experiment type
INFO | Logfiles directory: /Users/mjboothaus/code/github/databooth/horse-logic/data/results/zips/cb_data
INFO | Logfiles directory purpose: Store for the raw log files
INFO | Notebooks directory: /Users/mjboothaus/code/github/databooth/horse-logic/notebooks
INFO | Notebooks directory purpose: Jupyter notebooks for performing analysis
INFO | Database file path: /Users/mjboothaus/code/github/databooth/horse-logic/data/Experiments_CB_2023_Q4.ddb
INFO | Database purpose: Main project databases (outputs) by experiment type
INFO | Project initialised (CB): config defined in project_config.yaml
INFO | Connected to database: /Users/mjboothaus/code/github/databooth/horse-logic/data/Experiments_CB_2023_Q4.ddb
INFO | DuckDB version: v1.1.2
INFO | Tables created successfully from /Users/mjboothaus/code/github/databooth/horse-logic/sql/create_cb_experiment_tables_ddb.sql
Purpose of This Notebook
This notebook serves as an exploratory tool for examining the log files produced during the horse behavioural experiments conducted in October and November 2023. It facilitates the experimentation with text parsing techniques on the files before they are imported into a database. The primary objectives are:
To conduct experiments with regular expressions (
regex
) aimed at extracting pertinent data and fields from the log files.
Setup project & directories
### Get Subject info
INFO | Loaded subject info from: /Users/mjboothaus/code/github/databooth/horse-logic/docs/from_CH/Cohort data for MB.xlsx
INFO | Subject count: 22
INFO | Sorted subject names:
apollo, ash, atom, bonnie, clover, dodge, dougie, dusty, filly, freya, george, gio, jelly, molly, mowgli, myrtle,
nix, olive, pumba, smudge, teddy, yoshi
Load logfiles from CSV list of included files
As determined in logfiles-reconciliation-CB.ipynb
= pd.read_csv(OUTPUT_DIR / "all_included_files.csv")
df print(df.head())
print(df.columns)
print(df.index)
0 Experiment_2023-10-06T12:49:44.174442_atom_2_Training - fixed.log
0 1 Experiment_2023-10-06T13:21:53.770221_ash_2_Tr...
1 2 Experiment_2023-10-06T14:14:41.653330_mowgli_1...
2 3 Experiment_2023-10-06T14:48:40.419276_teddy_1_...
3 4 Experiment_2023-10-06T15:37:08.619329_dodge_1_...
4 5 Experiment_2023-10-06T16:43:49.516057_filly_4_...
Index(['0', 'Experiment_2023-10-06T12:49:44.174442_atom_2_Training - fixed.log'], dtype='object')
RangeIndex(start=0, stop=182, step=1)
INFO | Logs: 183 log files in /Users/mjboothaus/code/github/databooth/horse-logic/data/results/zips/cb_data
Explore regex
parsing of a single example log file
Analysis code is in logfiles.py
. Explore parsing log information for specific example defined below.
Example log file for regex parsing: Experiment_2023-10-06T12:49:44.174442_atom_2_Training - fixed.log
Data structures
These are created as dataclasses
in Python (in logfiles.py
) and TABLES in SQL (sql/create_experiment_tables_ddb.sql
).
Experiment table
class ExperimentCB:
ExperimentID: int
Cohort: str
SubjectName: str
SubjectNumber: int
SessionNumber: int
SessionType: str # "training" or "test"
ExperimentType: str
Direction: str # New field
Comment: Optional[str]
DateTime: datetime
LogFileName: str
Parameters: Dict
e.g. Parse experiment details and comment from log file
ExperimentCB(ExperimentID=None,
Cohort=None,
SubjectName='atom',
SubjectNumber=None,
SessionNumber=2,
SessionType='test',
ExperimentType='Training - fixed',
Direction=None,
Comment='First session',
DateTime=datetime.datetime(2023, 10, 6, 12, 49, 44, 174442),
LogFileName='Experiment_2023-10-06T12:49:44.174442_atom_2_Training '
'- fixed.log',
Parameters=None)
Trials and Events tables
log_eg.parse_filename_components_cog_bias().LogFileName
'Experiment_2023-10-06T12:49:44.174442_atom_2_Training - fixed.log'
class TrialCB:
TrialID: int
ExperimentID: int
TrialNumber: int
ResponseType: str
TrialSubType: str
CriterionType: Optional[str]
CriterionCount: Optional[int]
TrialStartTime: datetime
TrialEndTime: datetime
# Direction and SessionType fields removed
class ResponseCB:
ResponseID: int
TrialID: int
ResponseTime: Optional[float]
EventDateTime: datetime
class EventCB:
EventID: int
TrialID: int
EventType: str
EventTime: datetime
ElapsedTime: float
e.g. Parsing of trials and events info from an example log file
log_eg.parse_comments()
'First session'
experiment_df
ExperimentID | Cohort | SubjectName | SubjectNumber | SessionNumber | SessionType | ExperimentType | Direction | Comment | DateTime | LogFileName | Parameters |
---|---|---|---|---|---|---|---|---|---|---|---|
Loading ITables v2.1.4 from the init_notebook_mode cell...
(need help?) |
trials_df
TrialID | ExperimentID | TrialNumber | ResponseType | TrialSubType | CriterionType | CriterionCount | TrialStartTime | TrialEndTime |
---|---|---|---|---|---|---|---|---|
Loading ITables v2.1.4 from the init_notebook_mode cell...
(need help?) |
responses_df
ResponseID | TrialID | ResponseTime | EventDateTime |
---|---|---|---|
Loading ITables v2.1.4 from the init_notebook_mode cell...
(need help?) |
events_df
EventID | TrialID | EventType | EventTime | ElapsedTime |
---|---|---|---|---|
Loading ITables v2.1.4 from the init_notebook_mode cell...
(need help?) |
INFO | Example of parsing of the Trial/Event info for one sample experiment. 19 rows exported to CSV.
INFO | Example of parsing of the Trial/Event info for one sample experiment. 71 total rows exported to XLSX across 4 sheets.
INFO | Additional log files excluded as bad data. 0 rows exported to CSV.
INFO | List of logfiles to be loaded to database. 183 rows exported to CSV.
Bringing this all together and putting the data in DuckDB
database
def parse_and_load_logfiles_to_database(
str], con, subject_df: pd.DataFrame, debug=False
logfiles_to_process: List[
):= []
skipped_logfiles = 0
experiment_id = 0
trial_id = 0
response_id = 0
event_id
for log_filename in sorted(logfiles_to_process):
f"Processing: {experiment_id+1} - {logs.path}/{log_filename}")
logger.info(= Logfile(f"{logs.path}/{log_filename}")
logfile
try:
= logfile.parse_filename_components_cog_bias()
exp_details except ValueError as e:
f"Skipping {log_filename}: {str(e)}")
logger.warning(
skipped_logfiles.append(log_filename)continue
# Look up additional Experiment details from subject_df
if exp_details.SubjectName:
= subject_df[subject_df["subject_name"] == exp_details.SubjectName]
subject_row if not subject_row.empty:
= subject_row["Cohort"].iloc[0]
exp_details.Cohort = int(subject_row["subject_number"].iloc[0])
exp_details.SubjectNumber = subject_row["direction"].iloc[0]
exp_details.Direction else:
logger.warning(f"Subject {exp_details.SubjectName} not found in subject_df. Skipping this file."
)
skipped_logfiles.append(log_filename)continue
else:
f"Unable to determine SubjectName for {log_filename}. Skipping this file.")
logger.warning(
skipped_logfiles.append(log_filename)continue
= logfile.parse_parameters()
exp_details.Parameters
= logfile.parse_trials_and_events_cog_bias()
_, trials_df, responses_df, events_df if trials_df is None or responses_df is None or events_df is None:
f"Skipping {log_filename} due to parsing error")
logger.warning(
skipped_logfiles.append(log_filename)continue
# Insert experiment record
+= 1
experiment_id = experiment_id
exp_details.ExperimentID
logfile.insert_record_to_database(con, exp_details)= logfile.get_last_insert_id(con, "ExperimentCBs")
last_experiment_id # assert experiment_id == last_experiment_id, f"Experiment ID mismatch: {experiment_id} != {last_experiment_id}"
= trial_id
first_trial_id = response_id
first_response_id = event_id
first_event_id
# Process and insert Trials
for i_trial, trial in trials_df.iterrows():
+= 1
trial_id = trial["TrialID"]
local_trial_id "TrialID"] = trial_id
trial["ExperimentID"] = exp_details.ExperimentID
trial["TrialCBs")
logfile.insert_record_to_cb_database(con, trial.to_dict(), # last_trial_id = logfile.get_last_insert_id(con, "TrialCBs")
# assert trial_id == last_trial_id, f"Trial ID mismatch: {trial_id} != {last_trial_id}"
# Process and insert Responses
for i_response, response in responses_df[
"TrialID"] == local_trial_id
responses_df[
].iterrows():+= 1
response_id "ResponseID"] = response_id
response["TrialID"] = trial_id
response["ResponseCBs")
logfile.insert_record_to_cb_database(con, response.to_dict(), = logfile.get_last_insert_id(con, "ResponseCBs")
last_response_id # assert response_id == last_response_id, f"Response ID mismatch: {response_id} != {last_response_id}"
# Process and insert Events for this trial
for i_event, event in events_df[events_df["TrialID"] == local_trial_id].iterrows():
+= 1
event_id "EventID"] = event_id
event["TrialID"] = trial_id
event["EventCBs")
logfile.insert_record_to_cb_database(con, event.to_dict(), = logfile.get_last_insert_id(con, "EventCBs")
last_event_id # assert event_id == last_event_id, f"Event ID mismatch: {event_id} != {last_event_id}"
if debug:
f"IDs for Experiment {experiment_id}:")
logger.info(f" First Trial ID: {first_trial_id}")
logger.info(f" First Response ID: {first_response_id}")
logger.info(f" First Event ID: {first_event_id}")
logger.info(f" Last Trial ID: {trial_id}")
logger.info(f" Last Response ID: {response_id}")
logger.info(f" Last Event ID: {event_id}")
logger.info(
= logfile.get_last_insert_id(con, "ExperimentCBs")
total_experiments = logfile.get_last_insert_id(con, "TrialCBs")
total_trials = logfile.get_last_insert_id(con, "ResponseCBs")
total_responses = logfile.get_last_insert_id(con, "EventCBs")
total_events
return total_experiments, total_trials, total_responses, total_events, skipped_logfiles
Number of Experiments: 183
Number of Trials: 3171
Number of Responses: 1735
Number of Events: 6161
Number of Skipped files: 0
INFO | Exported database tables to: /Users/mjboothaus/code/github/databooth/horse-logic/notebooks/results/CB/export_tables/all_tables.xlsx
# Cross-check queries
"SHOW TABLES;") con.sql(
┌───────────────┐
│ name │
│ varchar │
├───────────────┤
│ EventCBs │
│ ExperimentCBs │
│ ResponseCBs │
│ TrialCBs │
└───────────────┘
"SELECT COUNT(*) FROM ExperimentCBs") con.sql(
┌──────────────┐
│ count_star() │
│ int64 │
├──────────────┤
│ 183 │
└──────────────┘
"SELECT COUNT(*) FROM TrialCBs") con.sql(
┌──────────────┐
│ count_star() │
│ int64 │
├──────────────┤
│ 3171 │
└──────────────┘
"SELECT COUNT(*) FROM ResponseCBs") con.sql(
┌──────────────┐
│ count_star() │
│ int64 │
├──────────────┤
│ 1735 │
└──────────────┘
"SELECT COUNT(*) FROM EventCBs") con.sql(
┌──────────────┐
│ count_star() │
│ int64 │
├──────────────┤
│ 6161 │
└──────────────┘
"SELECT DISTINCT ExperimentType FROM ExperimentCBs ORDER BY ExperimentType") con.sql(
┌────────────────────────────┐
│ ExperimentType │
│ varchar │
├────────────────────────────┤
│ Test Type 1 │
│ Test Type 2 │
│ Test Type 3 │
│ Training - fixed │
│ Training randomised Type 1 │
│ Training randomised Type 2 │
└────────────────────────────┘
"SELECT DISTINCT SubjectName FROM ExperimentCBs ORDER BY SubjectName") con.sql(
┌─────────────┐
│ SubjectName │
│ varchar │
├─────────────┤
│ apollo │
│ ash │
│ atom │
│ bonnie │
│ clover │
│ dodge │
│ dougie │
│ dusty │
│ filly │
│ freya │
│ george │
│ gio │
│ jelly │
│ molly │
│ mowgli │
│ myrtle │
│ nix │
│ pumba │
│ smudge │
│ teddy │
│ yoshi │
├─────────────┤
│ 21 rows │
└─────────────┘
# Sample of 100 rows from the TrialCBs table
"SELECT * FROM TrialCBs LIMIT 100").df() con.sql(
TrialID | ExperimentID | TrialNumber | ResponseType | TrialSubType | CriterionType | CriterionCount | TrialStartTime | TrialEndTime |
---|---|---|---|---|---|---|---|---|
Loading ITables v2.1.4 from the init_notebook_mode cell...
(need help?) |
INFO | Database connection closed