misc_tools

`misc_tools`#

Collection of miscelaneous tools useful in a variety of situations (not specific to the current project)

Module Contents#

Functions#

`merge_stats`	Provide statistics to assess the completeness of the merge.
`dataframe_set_difference`	Gives the rows that appear in dff but not in df
`freq_counts`	Like value_counts, but normalizes to give frequency Polars function df is a polars dataframe
`move_column_inplace`	https://stackoverflow.com/a/58686641
`move_columns_to_front`	Move a list of columns `cols` so that they appear first
`weighted_average`	Simple calculation of weighted average.
`groupby_weighted_average`	Faster method for calculating grouped weighted average.
`groupby_weighted_std`	Method for calculating grouped weighted standard devation.
`weighted_quantile`	Very close to numpy.percentile, but supports weights.
`groupby_weighted_quantile`	This can already be accomplished with weighted_quantile, as demonstrated above. This function is for convenience.
`load_date_mapping`
`calc_check_digit`	Calculate the check digits for the 8-digit cusip. This function is taken from arthurdejong/python-stdnum
`convert_cusips_from_8_to_9_digit`
`_with_lagged_column_no_resample`	This can be easily accomplished with the shift method. For example,
`with_lagged_columns`	Add lagged columns to a dataframe, respecting frequency of the data.
`leave_one_out_sums`	Compute leave-one-out sums,
`get_most_recent_quarter_end`	Take a datetime and find the most recent quarter end date
`get_next_quarter_start`	Take a datetime and find the start date of the next quarter
`get_end_of_current_month`	Take a datetime and find the last date of the current month and also reset time to zero.
`get_end_of_current_quarter`	Take a datetime and find the last date of the current quarter and also reset time to zero.
`add_vertical_lines_to_plot`
`plot_weighted_median_with_distribution_bars`	Plot the weighted median of a variable over time. Optionally, plot the 25th and 75th percentiles
`_demo`

Data#

_alphabet

API#

misc_tools.merge_stats(df_left, df_right, on=[])#

Provide statistics to assess the completeness of the merge.

To assess the completeness of the merge, this function counts the number of unique elements in the index of the left and right dataframes. It produces the following:

(First, ensure that the index of left and right are sets of unique elements.)

‘union’: num of elements in union of indices. ‘intersection’: num of elements in intersection of indices ‘union-intersection’: difference between union and intersection (unmatched symmetric) ‘intersection/union’: percentage of symmetric matched ‘left’: num of elements in left index ‘right’: num of elements in in right index ‘left-intersection’: number of excess elements in left index ‘right-intersection’: number of excess elements in right index ‘intersection/left’: percentage of matched based on total in left index ‘intersection/right’: percentage of matched based on total in right index

misc_tools.dataframe_set_difference(dff, df, library='pandas', show='rows_and_numbers')#

Gives the rows that appear in dff but not in df

Example

rows = data_frame_set_difference(dff, df)

misc_tools.freq_counts(df, col=None, with_count=True, with_cum_freq=True)#

Like value_counts, but normalizes to give frequency Polars function df is a polars dataframe

Example

df.filter(
    (pl.col("fdate") > pl.datetime(2020,1,1)) &
    (pl.col("bus_dt") == pl.col("fdate")) 
).pipe(freq_counts, col="bus_tenor_bin")

misc_tools.move_column_inplace(df, col, pos=0)#

https://stackoverflow.com/a/58686641

Use pos=0 to move to the front

misc_tools.move_columns_to_front(df, cols=[])#: Move a list of columns cols so that they appear first

misc_tools.weighted_average(data_col=None, weight_col=None, data=None)#

Simple calculation of weighted average.

Examples

>>> df_nccb = pd.DataFrame({
...     'rate': [2, 3, 2],
...     'start_leg_amount': [100, 200, 100]},
... )
>>> weighted_average(data_col='rate', weight_col='start_leg_amount', data=df_nccb)
2.5

misc_tools.groupby_weighted_average(data_col=None, weight_col=None, by_col=None, data=None, transform=False, new_column_name='')#

Faster method for calculating grouped weighted average.

From: https://stackoverflow.com/a/44683506

Examples

>>> df_nccb = pd.DataFrame({
...     'trade_direction': ['RECEIVED', 'RECEIVED', 'DELIVERED'],
...     'rate': [2, 3, 2],
...     'start_leg_amount': [100, 200, 100]},
... )
>>> groupby_weighted_average(data=df_nccb, data_col='rate', weight_col='start_leg_amount', by_col='trade_direction')
trade_direction
DELIVERED   2.00
RECEIVED    2.67
dtype: float64

misc_tools.groupby_weighted_std(data_col=None, weight_col=None, by_col=None, data=None, ddof=1)#

Method for calculating grouped weighted standard devation.

From: https://stackoverflow.com/a/72915123

Examples

>>> df_nccb = pd.DataFrame({
...     'trade_direction': ['RECEIVED', 'RECEIVED', 'RECEIVED', 'RECEIVED',
...         'DELIVERED', 'DELIVERED', 'DELIVERED', 'DELIVERED'],
...     'rate': [2, 2, 2, 3, 2, 2, 2, 3],
...     'start_leg_amount': [300, 300, 300, 0, 200, 200, 200, 200]},
... )
>>> groupby_weighted_std(data=df_nccb, data_col='rate', weight_col='start_leg_amount', by_col='trade_direction', ddof=1)
trade_direction
DELIVERED   0.50
RECEIVED    0.00
dtype: float64
>>> np.std([2,2,2,3], ddof=1)
0.5
>>> np.std([2,2,2], ddof=1)
0.0
>>> groupby_weighted_std(data=df_nccb, data_col='rate', weight_col='start_leg_amount', by_col='trade_direction', ddof=0)
trade_direction
DELIVERED   0.43
RECEIVED    0.00
dtype: float64
>>> np.std([2,2,2,3])
0.4330127018922193
>>> np.std([2,2,2])
0.0

misc_tools.weighted_quantile(values, quantiles, sample_weight=None, values_sorted=False, old_style=False)#

Very close to numpy.percentile, but supports weights.

Parameters

values: numpy.array with data quantiles : array-like with many quantiles needed sample_weight : array-like of the same length as array values_sorted : bool, Default False if True, then will avoid sorting of initial array old_style: if True, will correct output to be consistent with numpy.percentile.

Returns

numpy.array with computed quantiles.

Notes

quantiles should be in [0, 1]!

FROM: https://stackoverflow.com/a/29677616

NOTE: that a groupby weighted quantile can look like this:

median_SD_spread = data.groupby('date').apply(
    lambda x: weighted_quantile(x['rate_SD_spread'], 0.5, sample_weight=x['Volume']))

misc_tools.groupby_weighted_quantile(data_col=None, weight_col=None, by_col=None, data=None, transform=False, new_column_name='')#: This can already be accomplished with weighted_quantile, as demonstrated above. This function is for convenience.

misc_tools.load_date_mapping(data_dir=None, add_remaining_days_in_year=True, add_estimated_historical_days=True, historical_start='2016-01-01', add_estimated_future_dates=True, future_end='2092-01-01')#

misc_tools._alphabet#: ‘0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZ*@#’

misc_tools.calc_check_digit(number)#: Calculate the check digits for the 8-digit cusip. This function is taken from arthurdejong/python-stdnum

misc_tools.convert_cusips_from_8_to_9_digit(cusip_8dig_series)#

misc_tools._with_lagged_column_no_resample(df=None, columns_to_lag=None, id_columns=None, lags=1, date_col='date', prefix='L')#

This can be easily accomplished with the shift method. For example,

df.groupby("id")["value"].shift(1)

This function is to remind myself of this older method.

misc_tools.with_lagged_columns(df=None, column_to_lag=None, id_column=None, lags=1, date_col='date', prefix='L', freq=None, resample=True)#

Add lagged columns to a dataframe, respecting frequency of the data.

Examples

>>> a=[
... ["A",'1990/1/1',1],
... ["A",'1990/2/1',2],
... ["A",'1990/3/1',3],
... ["B",'1989/12/1',12],
... ["B",'1990/1/1',1],
... ["B",'1990/2/1',2],
... ["B",'1990/3/1',3],
... ["B",'1990/4/1',4],
... ["B",'1990/6/1',6]]

>>> df=pd.DataFrame(a,columns=['id','date','value'])
>>> df['date']=pd.to_datetime(df['date'])

>>> df
  id       date  value
0  A 1990-01-01      1
1  A 1990-02-01      2
2  A 1990-03-01      3
3  B 1989-12-01     12
4  B 1990-01-01      1
5  B 1990-02-01      2
6  B 1990-03-01      3
7  B 1990-04-01      4
8  B 1990-06-01      6

>>> df_lag = _with_lagged_column_no_resample(df=df, columns_to_lag=['value'], id_columns=['id'], lags=1)
>>> df_lag
  id       date  value  L1_value
0  A 1990-01-01      1       NaN
1  A 1990-02-01      2      1.00
2  A 1990-03-01      3      2.00
3  B 1989-12-01     12       NaN
4  B 1990-01-01      1     12.00
5  B 1990-02-01      2      1.00
6  B 1990-03-01      3      2.00
7  B 1990-04-01      4      3.00
8  B 1990-06-01      6      4.00

The issue with leaving out the resample is that the lagged value
for 1990-06-01 is 4.0, but it should be NaN. This is because the
the value for group B in 1990-05-01 is missing.

Rather, it should look like this:

>>> df_lag = with_lagged_columns(df=df, column_to_lag='value', id_column='id', lags=1, freq="MS", resample=True)
>>> df_lag
   id       date  value  L1_value
2   A 1990-01-01   1.00       NaN
4   A 1990-02-01   2.00      1.00
6   A 1990-03-01   3.00      2.00
8   A 1990-04-01    NaN      3.00
1   B 1989-12-01  12.00       NaN
3   B 1990-01-01   1.00     12.00
5   B 1990-02-01   2.00      1.00
7   B 1990-03-01   3.00      2.00
9   B 1990-04-01   4.00      3.00
11  B 1990-05-01    NaN      4.00
13  B 1990-06-01   6.00       NaN

Some valid frequencies are

# "B": Business Day
# "D": Calendar day
# "W": Weekly
# "M": Month end
# "BME": Business month end
# "MS": Month start
# "BMS": Business month start
# "Q": Quarter end
# "BQ": Business quarter end
# "QS": Quarter start
# "BQS": Business quarter start
# "A" or "Y": Year end
# "BA" or "BY": Business year end
# "AS" or "YS": Year start
# "BAS" or "BYS": Business year start
# "H": Hourly
# "T" or "min": Minutely
# "S": Secondly
# "L" or "ms": Milliseconds
# "U": Microseconds
# "N": Nanoseconds

as seen here: https://business-science.github.io/pytimetk/guides/03_pandas_frequency.html

misc_tools.leave_one_out_sums(df, groupby=[], summed_col='')#

Compute leave-one-out sums,

\(x_i = \sum_{\ell'\neq\ell} w_{i, \ell'}\)

This is helpful for constructing the shift-share instruments in Borusyak, Hull, Jaravel (2022).

Examples

>>> df = pd.DataFrame({
...     'A' : ['foo', 'bar', 'foo', 'bar', 'foo', 'bar'],
...     'B' : ['one', 'one', 'one', 'two', 'two', 'two'],
...     'C' : [1, 5, 5, 2, 5, 3],
...     'D' : [2.0, 5., 8., 1., 2., 9.],
...     'LOO_Sum_C_groupby_B': [10, 6, 6, 8, 5, 7]
...                })
>>> LOO_Sum_C_groupby_B = df.groupby(['B'])['C'].transform(lambda x: x.sum() - x)
>>> pd.testing.assert_series_equal(
...     df['LOO_Sum_C_groupby_B'],
...     df.groupby(['B'])['C'].transform(lambda x: x.sum() - x),
...     check_names=False)

>>> s = leave_one_out_sums(df, groupby=['B'], summed_col='C')
>>> pd.testing.assert_series_equal(
...     df['LOO_Sum_C_groupby_B'],
...     s,
...     check_names=False)

misc_tools.get_most_recent_quarter_end(d)#

Take a datetime and find the most recent quarter end date

>>> d = pd.to_datetime('2019-10-21')
>>> get_most_recent_quarter_end(d)
datetime.datetime(2019, 9, 30, 0, 0)

misc_tools.get_next_quarter_start(d)#

Take a datetime and find the start date of the next quarter

>>> d = pd.to_datetime('2019-10-21')
>>> get_next_quarter_start(d)
datetime.datetime(2020, 1, 1, 0, 0)

misc_tools.get_end_of_current_month(d)#

Take a datetime and find the last date of the current month and also reset time to zero.

>>> d = pd.to_datetime('2019-10-21')
>>> get_end_of_current_month(d)
Timestamp('2019-10-31 00:00:00')

>>> d = pd.to_datetime('2023-03-31 12:00:00')
>>> get_end_of_current_month(d)
Timestamp('2023-03-31 00:00:00')

From https://stackoverflow.com/a/13565185

misc_tools.get_end_of_current_quarter(d)#

Take a datetime and find the last date of the current quarter and also reset time to zero.

>>> d = pd.to_datetime('2019-10-21')
>>> get_end_of_current_quarter(d)
datetime.datetime(2019, 12, 31, 0, 0)

# TODO: Note that he behavior below may be unwanted. Might consider
# fixing in the future
>>> d = pd.to_datetime('2023-03-31 12:00:00')
>>> get_end_of_current_quarter(d)
datetime.datetime(2023, 3, 31, 0, 0)

misc_tools.add_vertical_lines_to_plot(start_date, end_date, ax=None, freq='Q', adjust_ticks=True, alpha=0.1, extend_to_nearest_quarter=True)#

misc_tools.plot_weighted_median_with_distribution_bars(data=None, variable_name=None, date_col='date', weight_col=None, percentile_bars=True, percentiles=[0.25, 0.75], rolling_window=1, rolling=False, rolling_min_periods=None, rescale_factor=1, ax=None, add_quarter_lines=True, ylabel=None, xlabel=None, label=None)#

Plot the weighted median of a variable over time. Optionally, plot the 25th and 75th percentiles

Examples
--------

```
ax = plot_weighted_median_with_distribution_bars(
        data=df,
        variable_name='rate_SD_spread',
        date_col='date',
        weight_col='Volume',
        percentile_bars=True,
        percentiles=[0.25, 0.75],
        rolling_window=5,
        rescale_factor=100,
        ax=None,
        add_quarter_lines=True,
        ylabel=None,
        xlabel=None,
        label='Median Spread'
        )
plt.title('Volume-weighted median rate spread (bps)

Shaded region shows 25/75 percentiles’) other_bbg[‘2019-10-21’:].plot(ax=ax) plt.legend()

fig, ax = plt.subplots()
ax = plot_weighted_median_with_distribution_bars(
        data=df,
        variable_name='rate_SD_spread',
        date_col='date',
        weight_col='Volume',
        percentile_bars=True,
        percentiles=[0.25, 0.75],
        rolling_window=5,
        rescale_factor=100,
        ax=ax,
        add_quarter_lines=True,
        ylabel=None,
        xlabel=None,
        label=None
        )
plt.title('Volume-weighted median rate spread (bps)

Shaded region shows 25/75 percentiles’) other_bbg[‘2019-10-21’:].plot(ax=ax) plt.legend() ```

Notes
-----
rolling_window=1 means that there is no rolling aggregation applied.

misc_tools._demo()#

misc_tools

Contents

misc_tools#

Module Contents#

Functions#

Data#

API#

Example

Example

Examples

Examples

Examples

Parameters

Returns

Notes

Examples

Examples

`misc_tools`#