misc_tools
#
Collection of miscelaneous tools useful in a variety of situations (not specific to the current project)
Module Contents#
Functions#
Provide statistics to assess the completeness of the merge. |
|
Gives the rows that appear in dff but not in df |
|
Like value_counts, but normalizes to give frequency Polars function df is a polars dataframe |
|
Move a list of columns |
|
Simple calculation of weighted average. |
|
Faster method for calculating grouped weighted average. |
|
Method for calculating grouped weighted standard devation. |
|
Very close to numpy.percentile, but supports weights. |
|
This can already be accomplished with weighted_quantile, as demonstrated above. This function is for convenience. |
|
Calculate the check digits for the 8-digit cusip. This function is taken from arthurdejong/python-stdnum |
|
This can be easily accomplished with the shift method. For example, |
|
Add lagged columns to a dataframe, respecting frequency of the data. |
|
Compute leave-one-out sums, |
|
Take a datetime and find the most recent quarter end date |
|
Take a datetime and find the start date of the next quarter |
|
Take a datetime and find the last date of the current month and also reset time to zero. |
|
Take a datetime and find the last date of the current quarter and also reset time to zero. |
|
Plot the weighted median of a variable over time. Optionally, plot the 25th and 75th percentiles |
|
Data#
API#
- misc_tools.merge_stats(df_left, df_right, on=[])#
Provide statistics to assess the completeness of the merge.
To assess the completeness of the merge, this function counts the number of unique elements in the index of the left and right dataframes. It produces the following:
(First, ensure that the index of left and right are sets of unique elements.)
âunionâ: num of elements in union of indices. âintersectionâ: num of elements in intersection of indices âunion-intersectionâ: difference between union and intersection (unmatched symmetric) âintersection/unionâ: percentage of symmetric matched âleftâ: num of elements in left index ârightâ: num of elements in in right index âleft-intersectionâ: number of excess elements in left index âright-intersectionâ: number of excess elements in right index âintersection/leftâ: percentage of matched based on total in left index âintersection/rightâ: percentage of matched based on total in right index
- misc_tools.dataframe_set_difference(dff, df, library='pandas', show='rows_and_numbers')#
Gives the rows that appear in dff but not in df
Example
rows = data_frame_set_difference(dff, df)
- misc_tools.freq_counts(df, col=None, with_count=True, with_cum_freq=True)#
Like value_counts, but normalizes to give frequency Polars function df is a polars dataframe
Example
df.filter( (pl.col("fdate") > pl.datetime(2020,1,1)) & (pl.col("bus_dt") == pl.col("fdate")) ).pipe(freq_counts, col="bus_tenor_bin")
- misc_tools.move_column_inplace(df, col, pos=0)#
https://stackoverflow.com/a/58686641
Use pos=0 to move to the front
- misc_tools.move_columns_to_front(df, cols=[])#
Move a list of columns
cols
so that they appear first
- misc_tools.weighted_average(data_col=None, weight_col=None, data=None)#
Simple calculation of weighted average.
Examples
>>> df_nccb = pd.DataFrame({ ... 'rate': [2, 3, 2], ... 'start_leg_amount': [100, 200, 100]}, ... ) >>> weighted_average(data_col='rate', weight_col='start_leg_amount', data=df_nccb) 2.5
- misc_tools.groupby_weighted_average(data_col=None, weight_col=None, by_col=None, data=None, transform=False, new_column_name='')#
Faster method for calculating grouped weighted average.
From: https://stackoverflow.com/a/44683506
Examples
>>> df_nccb = pd.DataFrame({ ... 'trade_direction': ['RECEIVED', 'RECEIVED', 'DELIVERED'], ... 'rate': [2, 3, 2], ... 'start_leg_amount': [100, 200, 100]}, ... ) >>> groupby_weighted_average(data=df_nccb, data_col='rate', weight_col='start_leg_amount', by_col='trade_direction') trade_direction DELIVERED 2.00 RECEIVED 2.67 dtype: float64
- misc_tools.groupby_weighted_std(data_col=None, weight_col=None, by_col=None, data=None, ddof=1)#
Method for calculating grouped weighted standard devation.
From: https://stackoverflow.com/a/72915123
Examples
>>> df_nccb = pd.DataFrame({ ... 'trade_direction': ['RECEIVED', 'RECEIVED', 'RECEIVED', 'RECEIVED', ... 'DELIVERED', 'DELIVERED', 'DELIVERED', 'DELIVERED'], ... 'rate': [2, 2, 2, 3, 2, 2, 2, 3], ... 'start_leg_amount': [300, 300, 300, 0, 200, 200, 200, 200]}, ... ) >>> groupby_weighted_std(data=df_nccb, data_col='rate', weight_col='start_leg_amount', by_col='trade_direction', ddof=1) trade_direction DELIVERED 0.50 RECEIVED 0.00 dtype: float64 >>> np.std([2,2,2,3], ddof=1) 0.5 >>> np.std([2,2,2], ddof=1) 0.0 >>> groupby_weighted_std(data=df_nccb, data_col='rate', weight_col='start_leg_amount', by_col='trade_direction', ddof=0) trade_direction DELIVERED 0.43 RECEIVED 0.00 dtype: float64 >>> np.std([2,2,2,3]) 0.4330127018922193 >>> np.std([2,2,2]) 0.0
- misc_tools.weighted_quantile(values, quantiles, sample_weight=None, values_sorted=False, old_style=False)#
Very close to numpy.percentile, but supports weights.
Parameters
values: numpy.array with data quantiles : array-like with many quantiles needed sample_weight : array-like of the same length as
array
values_sorted : bool, Default False if True, then will avoid sorting of initial array old_style: if True, will correct output to be consistent with numpy.percentile.Returns
numpy.array with computed quantiles.
Notes
quantiles should be in [0, 1]!
FROM: https://stackoverflow.com/a/29677616
NOTE: that a groupby weighted quantile can look like this:
median_SD_spread = data.groupby('date').apply( lambda x: weighted_quantile(x['rate_SD_spread'], 0.5, sample_weight=x['Volume']))
- misc_tools.groupby_weighted_quantile(data_col=None, weight_col=None, by_col=None, data=None, transform=False, new_column_name='')#
This can already be accomplished with weighted_quantile, as demonstrated above. This function is for convenience.
- misc_tools.load_date_mapping(data_dir=None, add_remaining_days_in_year=True, add_estimated_historical_days=True, historical_start='2016-01-01', add_estimated_future_dates=True, future_end='2092-01-01')#
- misc_tools._alphabet#
â0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZ*@#â
- misc_tools.calc_check_digit(number)#
Calculate the check digits for the 8-digit cusip. This function is taken from arthurdejong/python-stdnum
- misc_tools.convert_cusips_from_8_to_9_digit(cusip_8dig_series)#
- misc_tools._with_lagged_column_no_resample(df=None, columns_to_lag=None, id_columns=None, lags=1, date_col='date', prefix='L')#
This can be easily accomplished with the shift method. For example,
df.groupby("id")["value"].shift(1)
This function is to remind myself of this older method.
- misc_tools.with_lagged_columns(df=None, column_to_lag=None, id_column=None, lags=1, date_col='date', prefix='L', freq=None, resample=True)#
Add lagged columns to a dataframe, respecting frequency of the data.
Examples
>>> a=[ ... ["A",'1990/1/1',1], ... ["A",'1990/2/1',2], ... ["A",'1990/3/1',3], ... ["B",'1989/12/1',12], ... ["B",'1990/1/1',1], ... ["B",'1990/2/1',2], ... ["B",'1990/3/1',3], ... ["B",'1990/4/1',4], ... ["B",'1990/6/1',6]] >>> df=pd.DataFrame(a,columns=['id','date','value']) >>> df['date']=pd.to_datetime(df['date']) >>> df id date value 0 A 1990-01-01 1 1 A 1990-02-01 2 2 A 1990-03-01 3 3 B 1989-12-01 12 4 B 1990-01-01 1 5 B 1990-02-01 2 6 B 1990-03-01 3 7 B 1990-04-01 4 8 B 1990-06-01 6 >>> df_lag = _with_lagged_column_no_resample(df=df, columns_to_lag=['value'], id_columns=['id'], lags=1) >>> df_lag id date value L1_value 0 A 1990-01-01 1 NaN 1 A 1990-02-01 2 1.00 2 A 1990-03-01 3 2.00 3 B 1989-12-01 12 NaN 4 B 1990-01-01 1 12.00 5 B 1990-02-01 2 1.00 6 B 1990-03-01 3 2.00 7 B 1990-04-01 4 3.00 8 B 1990-06-01 6 4.00 The issue with leaving out the resample is that the lagged value for 1990-06-01 is 4.0, but it should be NaN. This is because the the value for group B in 1990-05-01 is missing. Rather, it should look like this: >>> df_lag = with_lagged_columns(df=df, column_to_lag='value', id_column='id', lags=1, freq="MS", resample=True) >>> df_lag id date value L1_value 2 A 1990-01-01 1.00 NaN 4 A 1990-02-01 2.00 1.00 6 A 1990-03-01 3.00 2.00 8 A 1990-04-01 NaN 3.00 1 B 1989-12-01 12.00 NaN 3 B 1990-01-01 1.00 12.00 5 B 1990-02-01 2.00 1.00 7 B 1990-03-01 3.00 2.00 9 B 1990-04-01 4.00 3.00 11 B 1990-05-01 NaN 4.00 13 B 1990-06-01 6.00 NaN
Some valid frequencies are
# "B": Business Day # "D": Calendar day # "W": Weekly # "M": Month end # "BME": Business month end # "MS": Month start # "BMS": Business month start # "Q": Quarter end # "BQ": Business quarter end # "QS": Quarter start # "BQS": Business quarter start # "A" or "Y": Year end # "BA" or "BY": Business year end # "AS" or "YS": Year start # "BAS" or "BYS": Business year start # "H": Hourly # "T" or "min": Minutely # "S": Secondly # "L" or "ms": Milliseconds # "U": Microseconds # "N": Nanoseconds
as seen here: https://business-science.github.io/pytimetk/guides/03_pandas_frequency.html
- misc_tools.leave_one_out_sums(df, groupby=[], summed_col='')#
Compute leave-one-out sums,
\(x_i = \sum_{\ell'\neq\ell} w_{i, \ell'}\)
This is helpful for constructing the shift-share instruments in Borusyak, Hull, Jaravel (2022).
Examples
>>> df = pd.DataFrame({ ... 'A' : ['foo', 'bar', 'foo', 'bar', 'foo', 'bar'], ... 'B' : ['one', 'one', 'one', 'two', 'two', 'two'], ... 'C' : [1, 5, 5, 2, 5, 3], ... 'D' : [2.0, 5., 8., 1., 2., 9.], ... 'LOO_Sum_C_groupby_B': [10, 6, 6, 8, 5, 7] ... }) >>> LOO_Sum_C_groupby_B = df.groupby(['B'])['C'].transform(lambda x: x.sum() - x) >>> pd.testing.assert_series_equal( ... df['LOO_Sum_C_groupby_B'], ... df.groupby(['B'])['C'].transform(lambda x: x.sum() - x), ... check_names=False) >>> s = leave_one_out_sums(df, groupby=['B'], summed_col='C') >>> pd.testing.assert_series_equal( ... df['LOO_Sum_C_groupby_B'], ... s, ... check_names=False)
- misc_tools.get_most_recent_quarter_end(d)#
Take a datetime and find the most recent quarter end date
>>> d = pd.to_datetime('2019-10-21') >>> get_most_recent_quarter_end(d) datetime.datetime(2019, 9, 30, 0, 0)
- misc_tools.get_next_quarter_start(d)#
Take a datetime and find the start date of the next quarter
>>> d = pd.to_datetime('2019-10-21') >>> get_next_quarter_start(d) datetime.datetime(2020, 1, 1, 0, 0)
- misc_tools.get_end_of_current_month(d)#
Take a datetime and find the last date of the current month and also reset time to zero.
>>> d = pd.to_datetime('2019-10-21') >>> get_end_of_current_month(d) Timestamp('2019-10-31 00:00:00') >>> d = pd.to_datetime('2023-03-31 12:00:00') >>> get_end_of_current_month(d) Timestamp('2023-03-31 00:00:00')
- misc_tools.get_end_of_current_quarter(d)#
Take a datetime and find the last date of the current quarter and also reset time to zero.
>>> d = pd.to_datetime('2019-10-21') >>> get_end_of_current_quarter(d) datetime.datetime(2019, 12, 31, 0, 0) # TODO: Note that he behavior below may be unwanted. Might consider # fixing in the future >>> d = pd.to_datetime('2023-03-31 12:00:00') >>> get_end_of_current_quarter(d) datetime.datetime(2023, 3, 31, 0, 0)
- misc_tools.add_vertical_lines_to_plot(start_date, end_date, ax=None, freq='Q', adjust_ticks=True, alpha=0.1, extend_to_nearest_quarter=True)#
- misc_tools.plot_weighted_median_with_distribution_bars(data=None, variable_name=None, date_col='date', weight_col=None, percentile_bars=True, percentiles=[0.25, 0.75], rolling_window=1, rolling=False, rolling_min_periods=None, rescale_factor=1, ax=None, add_quarter_lines=True, ylabel=None, xlabel=None, label=None)#
Plot the weighted median of a variable over time. Optionally, plot the 25th and 75th percentiles
Examples -------- ``` ax = plot_weighted_median_with_distribution_bars( data=df, variable_name='rate_SD_spread', date_col='date', weight_col='Volume', percentile_bars=True, percentiles=[0.25, 0.75], rolling_window=5, rescale_factor=100, ax=None, add_quarter_lines=True, ylabel=None, xlabel=None, label='Median Spread' ) plt.title('Volume-weighted median rate spread (bps)
Shaded region shows 25/75 percentilesâ) other_bbg[â2019-10-21â:].plot(ax=ax) plt.legend()
fig, ax = plt.subplots() ax = plot_weighted_median_with_distribution_bars( data=df, variable_name='rate_SD_spread', date_col='date', weight_col='Volume', percentile_bars=True, percentiles=[0.25, 0.75], rolling_window=5, rescale_factor=100, ax=ax, add_quarter_lines=True, ylabel=None, xlabel=None, label=None ) plt.title('Volume-weighted median rate spread (bps)
Shaded region shows 25/75 percentilesâ) other_bbg[â2019-10-21â:].plot(ax=ax) plt.legend() ```
Notes ----- rolling_window=1 means that there is no rolling aggregation applied.
- misc_tools._demo()#