misc_tools#

Collection of miscelaneous tools useful in a variety of situations (not specific to the current project)

Module Contents#

Functions#

merge_stats

Provide statistics to assess the completeness of the merge.

dataframe_set_difference

Gives the rows that appear in dff but not in df

freq_counts

Like value_counts, but normalizes to give frequency Polars function df is a polars dataframe

move_column_inplace

https://stackoverflow.com/a/58686641

move_columns_to_front

Move a list of columns cols so that they appear first

weighted_average

Simple calculation of weighted average.

groupby_weighted_average

Faster method for calculating grouped weighted average.

groupby_weighted_std

Method for calculating grouped weighted standard devation.

weighted_quantile

Very close to numpy.percentile, but supports weights.

groupby_weighted_quantile

This can already be accomplished with weighted_quantile, as demonstrated above. This function is for convenience.

load_date_mapping

calc_check_digit

Calculate the check digits for the 8-digit cusip. This function is taken from arthurdejong/python-stdnum

convert_cusips_from_8_to_9_digit

_with_lagged_column_no_resample

This can be easily accomplished with the shift method. For example,

with_lagged_columns

Add lagged columns to a dataframe, respecting frequency of the data.

leave_one_out_sums

Compute leave-one-out sums,

get_most_recent_quarter_end

Take a datetime and find the most recent quarter end date

get_next_quarter_start

Take a datetime and find the start date of the next quarter

get_end_of_current_month

Take a datetime and find the last date of the current month and also reset time to zero.

get_end_of_current_quarter

Take a datetime and find the last date of the current quarter and also reset time to zero.

add_vertical_lines_to_plot

plot_weighted_median_with_distribution_bars

Plot the weighted median of a variable over time. Optionally, plot the 25th and 75th percentiles

_demo

Data#

API#

misc_tools.merge_stats(df_left, df_right, on=[])#

Provide statistics to assess the completeness of the merge.

To assess the completeness of the merge, this function counts the number of unique elements in the index of the left and right dataframes. It produces the following:

(First, ensure that the index of left and right are sets of unique elements.)

‘union’: num of elements in union of indices. ‘intersection’: num of elements in intersection of indices ‘union-intersection’: difference between union and intersection (unmatched symmetric) ‘intersection/union’: percentage of symmetric matched ‘left’: num of elements in left index ‘right’: num of elements in in right index ‘left-intersection’: number of excess elements in left index ‘right-intersection’: number of excess elements in right index ‘intersection/left’: percentage of matched based on total in left index ‘intersection/right’: percentage of matched based on total in right index

misc_tools.dataframe_set_difference(dff, df, library='pandas', show='rows_and_numbers')#

Gives the rows that appear in dff but not in df

Example

rows = data_frame_set_difference(dff, df)

misc_tools.freq_counts(df, col=None, with_count=True, with_cum_freq=True)#

Like value_counts, but normalizes to give frequency Polars function df is a polars dataframe

Example

df.filter(
    (pl.col("fdate") > pl.datetime(2020,1,1)) &
    (pl.col("bus_dt") == pl.col("fdate")) 
).pipe(freq_counts, col="bus_tenor_bin")
misc_tools.move_column_inplace(df, col, pos=0)#

https://stackoverflow.com/a/58686641

Use pos=0 to move to the front

misc_tools.move_columns_to_front(df, cols=[])#

Move a list of columns cols so that they appear first

misc_tools.weighted_average(data_col=None, weight_col=None, data=None)#

Simple calculation of weighted average.

Examples

>>> df_nccb = pd.DataFrame({
...     'rate': [2, 3, 2],
...     'start_leg_amount': [100, 200, 100]},
... )
>>> weighted_average(data_col='rate', weight_col='start_leg_amount', data=df_nccb)
2.5

misc_tools.groupby_weighted_average(data_col=None, weight_col=None, by_col=None, data=None, transform=False, new_column_name='')#

Faster method for calculating grouped weighted average.

From: https://stackoverflow.com/a/44683506

Examples

>>> df_nccb = pd.DataFrame({
...     'trade_direction': ['RECEIVED', 'RECEIVED', 'DELIVERED'],
...     'rate': [2, 3, 2],
...     'start_leg_amount': [100, 200, 100]},
... )
>>> groupby_weighted_average(data=df_nccb, data_col='rate', weight_col='start_leg_amount', by_col='trade_direction')
trade_direction
DELIVERED   2.00
RECEIVED    2.67
dtype: float64

misc_tools.groupby_weighted_std(data_col=None, weight_col=None, by_col=None, data=None, ddof=1)#

Method for calculating grouped weighted standard devation.

From: https://stackoverflow.com/a/72915123

Examples

>>> df_nccb = pd.DataFrame({
...     'trade_direction': ['RECEIVED', 'RECEIVED', 'RECEIVED', 'RECEIVED',
...         'DELIVERED', 'DELIVERED', 'DELIVERED', 'DELIVERED'],
...     'rate': [2, 2, 2, 3, 2, 2, 2, 3],
...     'start_leg_amount': [300, 300, 300, 0, 200, 200, 200, 200]},
... )
>>> groupby_weighted_std(data=df_nccb, data_col='rate', weight_col='start_leg_amount', by_col='trade_direction', ddof=1)
trade_direction
DELIVERED   0.50
RECEIVED    0.00
dtype: float64
>>> np.std([2,2,2,3], ddof=1)
0.5
>>> np.std([2,2,2], ddof=1)
0.0
>>> groupby_weighted_std(data=df_nccb, data_col='rate', weight_col='start_leg_amount', by_col='trade_direction', ddof=0)
trade_direction
DELIVERED   0.43
RECEIVED    0.00
dtype: float64
>>> np.std([2,2,2,3])
0.4330127018922193
>>> np.std([2,2,2])
0.0

misc_tools.weighted_quantile(values, quantiles, sample_weight=None, values_sorted=False, old_style=False)#

Very close to numpy.percentile, but supports weights.

Parameters

values: numpy.array with data quantiles : array-like with many quantiles needed sample_weight : array-like of the same length as array values_sorted : bool, Default False if True, then will avoid sorting of initial array old_style: if True, will correct output to be consistent with numpy.percentile.

Returns

numpy.array with computed quantiles.

Notes

quantiles should be in [0, 1]!

FROM: https://stackoverflow.com/a/29677616

NOTE: that a groupby weighted quantile can look like this:

median_SD_spread = data.groupby('date').apply(
    lambda x: weighted_quantile(x['rate_SD_spread'], 0.5, sample_weight=x['Volume']))
misc_tools.groupby_weighted_quantile(data_col=None, weight_col=None, by_col=None, data=None, transform=False, new_column_name='')#

This can already be accomplished with weighted_quantile, as demonstrated above. This function is for convenience.

misc_tools.load_date_mapping(data_dir=None, add_remaining_days_in_year=True, add_estimated_historical_days=True, historical_start='2016-01-01', add_estimated_future_dates=True, future_end='2092-01-01')#
misc_tools._alphabet#

‘0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZ*@#’

misc_tools.calc_check_digit(number)#

Calculate the check digits for the 8-digit cusip. This function is taken from arthurdejong/python-stdnum

misc_tools.convert_cusips_from_8_to_9_digit(cusip_8dig_series)#
misc_tools._with_lagged_column_no_resample(df=None, columns_to_lag=None, id_columns=None, lags=1, date_col='date', prefix='L')#

This can be easily accomplished with the shift method. For example,

df.groupby("id")["value"].shift(1)

This function is to remind myself of this older method.

misc_tools.with_lagged_columns(df=None, column_to_lag=None, id_column=None, lags=1, date_col='date', prefix='L', freq=None, resample=True)#

Add lagged columns to a dataframe, respecting frequency of the data.

Examples

>>> a=[
... ["A",'1990/1/1',1],
... ["A",'1990/2/1',2],
... ["A",'1990/3/1',3],
... ["B",'1989/12/1',12],
... ["B",'1990/1/1',1],
... ["B",'1990/2/1',2],
... ["B",'1990/3/1',3],
... ["B",'1990/4/1',4],
... ["B",'1990/6/1',6]]

>>> df=pd.DataFrame(a,columns=['id','date','value'])
>>> df['date']=pd.to_datetime(df['date'])

>>> df
  id       date  value
0  A 1990-01-01      1
1  A 1990-02-01      2
2  A 1990-03-01      3
3  B 1989-12-01     12
4  B 1990-01-01      1
5  B 1990-02-01      2
6  B 1990-03-01      3
7  B 1990-04-01      4
8  B 1990-06-01      6

>>> df_lag = _with_lagged_column_no_resample(df=df, columns_to_lag=['value'], id_columns=['id'], lags=1)
>>> df_lag
  id       date  value  L1_value
0  A 1990-01-01      1       NaN
1  A 1990-02-01      2      1.00
2  A 1990-03-01      3      2.00
3  B 1989-12-01     12       NaN
4  B 1990-01-01      1     12.00
5  B 1990-02-01      2      1.00
6  B 1990-03-01      3      2.00
7  B 1990-04-01      4      3.00
8  B 1990-06-01      6      4.00

The issue with leaving out the resample is that the lagged value
for 1990-06-01 is 4.0, but it should be NaN. This is because the
the value for group B in 1990-05-01 is missing.

Rather, it should look like this:

>>> df_lag = with_lagged_columns(df=df, column_to_lag='value', id_column='id', lags=1, freq="MS", resample=True)
>>> df_lag
   id       date  value  L1_value
2   A 1990-01-01   1.00       NaN
4   A 1990-02-01   2.00      1.00
6   A 1990-03-01   3.00      2.00
8   A 1990-04-01    NaN      3.00
1   B 1989-12-01  12.00       NaN
3   B 1990-01-01   1.00     12.00
5   B 1990-02-01   2.00      1.00
7   B 1990-03-01   3.00      2.00
9   B 1990-04-01   4.00      3.00
11  B 1990-05-01    NaN      4.00
13  B 1990-06-01   6.00       NaN

Some valid frequencies are

# "B": Business Day
# "D": Calendar day
# "W": Weekly
# "M": Month end
# "BME": Business month end
# "MS": Month start
# "BMS": Business month start
# "Q": Quarter end
# "BQ": Business quarter end
# "QS": Quarter start
# "BQS": Business quarter start
# "A" or "Y": Year end
# "BA" or "BY": Business year end
# "AS" or "YS": Year start
# "BAS" or "BYS": Business year start
# "H": Hourly
# "T" or "min": Minutely
# "S": Secondly
# "L" or "ms": Milliseconds
# "U": Microseconds
# "N": Nanoseconds

as seen here: https://business-science.github.io/pytimetk/guides/03_pandas_frequency.html

misc_tools.leave_one_out_sums(df, groupby=[], summed_col='')#

Compute leave-one-out sums,

\(x_i = \sum_{\ell'\neq\ell} w_{i, \ell'}\)

This is helpful for constructing the shift-share instruments in Borusyak, Hull, Jaravel (2022).

Examples

>>> df = pd.DataFrame({
...     'A' : ['foo', 'bar', 'foo', 'bar', 'foo', 'bar'],
...     'B' : ['one', 'one', 'one', 'two', 'two', 'two'],
...     'C' : [1, 5, 5, 2, 5, 3],
...     'D' : [2.0, 5., 8., 1., 2., 9.],
...     'LOO_Sum_C_groupby_B': [10, 6, 6, 8, 5, 7]
...                })
>>> LOO_Sum_C_groupby_B = df.groupby(['B'])['C'].transform(lambda x: x.sum() - x)
>>> pd.testing.assert_series_equal(
...     df['LOO_Sum_C_groupby_B'],
...     df.groupby(['B'])['C'].transform(lambda x: x.sum() - x),
...     check_names=False)

>>> s = leave_one_out_sums(df, groupby=['B'], summed_col='C')
>>> pd.testing.assert_series_equal(
...     df['LOO_Sum_C_groupby_B'],
...     s,
...     check_names=False)

misc_tools.get_most_recent_quarter_end(d)#

Take a datetime and find the most recent quarter end date

>>> d = pd.to_datetime('2019-10-21')
>>> get_most_recent_quarter_end(d)
datetime.datetime(2019, 9, 30, 0, 0)

misc_tools.get_next_quarter_start(d)#

Take a datetime and find the start date of the next quarter

>>> d = pd.to_datetime('2019-10-21')
>>> get_next_quarter_start(d)
datetime.datetime(2020, 1, 1, 0, 0)

misc_tools.get_end_of_current_month(d)#

Take a datetime and find the last date of the current month and also reset time to zero.

>>> d = pd.to_datetime('2019-10-21')
>>> get_end_of_current_month(d)
Timestamp('2019-10-31 00:00:00')

>>> d = pd.to_datetime('2023-03-31 12:00:00')
>>> get_end_of_current_month(d)
Timestamp('2023-03-31 00:00:00')

From https://stackoverflow.com/a/13565185

misc_tools.get_end_of_current_quarter(d)#

Take a datetime and find the last date of the current quarter and also reset time to zero.

>>> d = pd.to_datetime('2019-10-21')
>>> get_end_of_current_quarter(d)
datetime.datetime(2019, 12, 31, 0, 0)

# TODO: Note that he behavior below may be unwanted. Might consider
# fixing in the future
>>> d = pd.to_datetime('2023-03-31 12:00:00')
>>> get_end_of_current_quarter(d)
datetime.datetime(2023, 3, 31, 0, 0)

misc_tools.add_vertical_lines_to_plot(start_date, end_date, ax=None, freq='Q', adjust_ticks=True, alpha=0.1, extend_to_nearest_quarter=True)#
misc_tools.plot_weighted_median_with_distribution_bars(data=None, variable_name=None, date_col='date', weight_col=None, percentile_bars=True, percentiles=[0.25, 0.75], rolling_window=1, rolling=False, rolling_min_periods=None, rescale_factor=1, ax=None, add_quarter_lines=True, ylabel=None, xlabel=None, label=None)#

Plot the weighted median of a variable over time. Optionally, plot the 25th and 75th percentiles

Examples
--------

```
ax = plot_weighted_median_with_distribution_bars(
        data=df,
        variable_name='rate_SD_spread',
        date_col='date',
        weight_col='Volume',
        percentile_bars=True,
        percentiles=[0.25, 0.75],
        rolling_window=5,
        rescale_factor=100,
        ax=None,
        add_quarter_lines=True,
        ylabel=None,
        xlabel=None,
        label='Median Spread'
        )
plt.title('Volume-weighted median rate spread (bps)

Shaded region shows 25/75 percentiles’) other_bbg[‘2019-10-21’:].plot(ax=ax) plt.legend()

fig, ax = plt.subplots()
ax = plot_weighted_median_with_distribution_bars(
        data=df,
        variable_name='rate_SD_spread',
        date_col='date',
        weight_col='Volume',
        percentile_bars=True,
        percentiles=[0.25, 0.75],
        rolling_window=5,
        rescale_factor=100,
        ax=ax,
        add_quarter_lines=True,
        ylabel=None,
        xlabel=None,
        label=None
        )
plt.title('Volume-weighted median rate spread (bps)

Shaded region shows 25/75 percentiles’) other_bbg[‘2019-10-21’:].plot(ax=ax) plt.legend() ```

Notes
-----
rolling_window=1 means that there is no rolling aggregation applied.
misc_tools._demo()#