Automation with doit
#
This repository uses doit
to manage data pulls, documentation builds, and other repeatable tasks. Think of it as a Python-friendly build system that handles dependencies automatically.
Installation#
Set up the workshop environment before installing packages:
conda create -n finm python=3.12
conda activate finm
pip install -r requirements.txt
This installs doit
alongside the other packages used in the workshop within an isolated Python 3.12 environment.
Primary Usage: Run Everything#
doit
This single command runs all tasks in the proper order based on their dependencies and targets. doit
automatically:
Determines which tasks need to run based on file timestamps and dependencies
Executes tasks in the correct order
Skips tasks whose targets are already up-to-date
Handles the entire pipeline: data pulls β processing β documentation build
Parallel Execution#
Speed up the build with parallel task execution:
doit -n 4 # Run up to 4 tasks in parallel
This is especially useful when you have multiple independent data processing or documentation tasks.
What Gets Built#
Running doit
executes the default tasks defined in dodo.py
:
Data Pipeline: Pulls CRSP data from WRDS (or generates synthetic data as fallback), then creates excerpts
Documentation: Builds and publishes the Sphinx documentation
Output artifacts:
_data/crsp_streamlit_excerpt.csv
β Data for Streamlit apps_data/crsp_streamlit_excerpt.parquet
β Efficient data format_data/crsp_data_metadata.json
β Data pipeline metadatadocs/
β Published documentation site
Rebuilding from Scratch#
doit clean # Remove all generated files
doit # Rebuild everything
Advanced Usage: Individual Tasks#
While doit
handles everything automatically, you can inspect or run individual tasks when needed:
doit list # Show all available tasks
Key tasks include:
pull_crsp_data
β WRDS data pull and processingshow_crsp_excerpt_info
β Display data summarybuild_docs
β Generate Sphinx documentationpublish_docs
β Copy docs for GitHub Pages
Run a specific task:
doit pull_crsp_data # Run just the CRSP data pipeline
Force a task to rerun:
doit forget pull_crsp_data # Clear task status
doit pull_crsp_data # Run again
Tips#
After running
doit
, restart your Streamlit apps to see fresh dataUse
doit -n 1
for verbose output during debuggingThe
dodo.py
file defines all task dependencies and build rules
By using doit
as your primary command, you ensure all components stay synchronizedβideal for rapid iteration during the workshop.