This function VITAL for portfolio backtesting is now in Python, written with the help of chatGPT4

Posted on May 3, 2023 by Ilya Kipnis in R bloggers | 0 Comments

[This article was first published on R – QuantStrat TradeR, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

So, it’s been a little while. But after a couple of years of some grunt work analytics jobs *and* consulting for a $1B AUM fund, I’ve decided that I had a bit more in the tank to share as far as quant content creation–quantent creation (?)–goes.

And a function I’ve searched for in Python for a long time now, but never finding it in a proper capacity is one that we’ve seen used time and again on this blog–R’s Return.portfolio.

For those uninitiated, Return.portfolio in R is essentially the workhorse of portfolio-level asset allocation backtests. What it does can be summarized in a sentence: given an xts/Python series dataframe of weights with m (E.G. 10) assets, and an xts/Python series dataframe with returns of m assets, Return.portfolio will compute drifted portfolio returns, along with beginning of period weights, end of period weights, beginning/end of period values, contribution to returns, and in the case of the Python variant that I wrote, two-way turnover.

Now, why is this function vital to have in Python, in my opinion?

Simple: because onto my experience with the language, if you want to do any financial research with Python, you need to learn some arcane and convoluted syntax such as zipline or quantconnect, and in some cases, that depends on working through another website altogether (in the case of quantconnect), or in the case of zipline, a library developed by a company that no longer exists (quantopian), so in either case, using either of these libraries may not be the best of ideas, even *if* you put in the who-knows-how-many-hours to learn a bunch of extra syntax fighting for real estate in your head.

Instead, what porting R’s Return.portfolio into Python does is to allow the same exact numpy/scipy/pandas ye-olde-classic-data-science syntax to not only work with backtesting, but also to seamlessly link up with whatever other Python library you may want to use, such as cvxpy for your portfolio optimization solutions (having used this library on past consulting arrangements, I am *very* impressed).

Now, here’s the link to my github for the code for Return.portfolio in Python–or, rather, since the dot operator has a special use in Python, the Return_portfolio function.

But for the record, here’s the Python code, including the endpoints function.

def endpoints(df, on = "M", offset = 0):
    """
    Returns index of endpoints of a time series analogous to R's endpoints
    function.
    Takes in:
        df -- a dataframe/series with a date index
         
        on -- a string specifying frequency of endpoints
         
        (E.G. "M" for months, "Q" for quarters, and so on)
         
        offset -- to offset by a specified index on the original data
        (E.G. if the data is daily resolution, offset of 1 offsets by a day)
        This is to allow for timing luck analysis. Thank Corey Hoffstein.
    """
     
    # to allow for familiarity with R
    # "months" becomes "M" for resampling
    if len(on) > 3:
        on = on[0].capitalize()
     
    # get index dates of formal endpoints
    ep_dates = pd.Series(df.index, index = df.index).resample(on).max()
     
    # get the integer indices of dates that are the endpoints
    date_idx = np.where(df.index.isin(ep_dates))
     
    # append last day to match R's endpoints function
    # remember, Python is indexed at 0, not 1
    #date_idx = np.insert(date_idx, 0, 0)
    date_idx = np.append(date_idx, df.shape[0]-1)
    if offset != 0:
        date_idx = date_idx + offset
        date_idx[date_idx < 0] = 0
        date_idx[date_idx > df.shape[0]-1] = df.shape[0]-1
    out = np.unique(date_idx)
    return out  

def compute_weights_and_returns(subset_returns, weights):
  rep_weights = np.tile(weights, (len(subset_returns), 1))
  cum_subset_weights = np.cumprod(1 + subset_returns, axis=0) * rep_weights
  EOP_Weight = cum_subset_weights / np.sum(cum_subset_weights, axis=1).to_numpy()[:, np.newaxis]
  cum_subset_weights_bop = cum_subset_weights / (1 + subset_returns)
  BOP_Weight = cum_subset_weights_bop / np.sum(cum_subset_weights_bop, axis=1).to_numpy()[:, np.newaxis]
  portf_returns_subset = pd.DataFrame(np.sum(subset_returns.values * BOP_Weight, axis=1), 
                                        index=subset_returns.index, columns=['Portfolio.Returns'])
  return [portf_returns_subset, BOP_Weight, EOP_Weight]


# Return.portfolio.geometric from R
def Return_portfolio(R, weights=None, verbose=True, rebalance_on='months'):
  """

    Parameters
    ----------
    R : a pandas series of asset returns
    weights : a vector or pandas series of asset weights.
    verbose : a boolean specifying a verbose output containing:
        portfolio returns,
        beginning of period weights and values, end of period weights and values,
        asset contribution to returns, and two-way turnover calculation
    rebalance_on : a string specifying rebalancing frequency if weights are passed in as a vector.

    Raises
    ------
    ValueError
        Number of asset weights must be equal to the number of assets.

    Returns
    -------
    TYPE
        See verbose parameter for True value, otherwise just portfolio returns.

    """
  
  # make sure original object isn't overridden
  R = R.copy()
  weights = weights.copy()
  
  # impute NAs in returns
  if R.isna().sum().sum() > 0:
    R.fillna(0, inplace=True)
    wn.warn("NAs detected in returns. Imputing with zeroes.")
    
    # if no weights provided, create equal weight vector
  if weights is None:
    weights = np.repeat(1/R.shape[1], R.shape[1])
    wn.warn("Weights not provided, assuming equal weight for rebalancing periods.")
    
    # if weights aren't passed in as a data frame (they're probably a list)
    # turn them into a 1 x num_assets data frame
  if type(weights) != pd.DataFrame:
    weights = pd.DataFrame(weights)
    weights = weights.transpose()
    
    # error checking for same number of weights as assets
  if weights.shape[1] != R.shape[1]:
        raise ValueError("Number of weights is unequal to number of assets. Correct this.")
    
    
  # if there's a row vector of weights, create a data frame with  the desired 
  # rebalancing schedule --  also add in the very first date into the schedule
  if weights.shape[0] == 1:
        if rebalance_on is not None:
            ep = endpoints(R, on = rebalance_on)
            first_weights = pd.DataFrame(np.array(weights), index=[R.index[0]], columns=R.columns)
            weights = pd.concat([first_weights, pd.DataFrame(np.tile(weights, 
                        (len(ep), 1)), index=R.index[ep], columns=R.columns)], axis=0)
            #weights = pd.DataFrame(np.tile(weights, (len(ep), 1)), 
            #                       index = ep.index, columns = R.columns)
            weights.index.name = R.index.name
        else:
            weights = pd.DataFrame(weights, index=[R.index[0]], columns=R.columns)
    
  original_weight_dim = weights.shape[1] # save this for computing two-way turnover

    
  if weights.isna().sum().sum() > 0:
    weights.fillna(0, inplace=True)
    wn.warn("NAs detected in weights. Imputing with zeroes.")
    
  residual_weights = 1 - weights.sum(axis=1)
  if abs(residual_weights).sum() != 0:
    print("One or more periods do not have investment equal to 1. Creating residual weights.")
    weights["Residual"] = residual_weights
    R["Residual"] = 0
    
  if weights.shape[0] != 1:
    portf_returns, bop_weights, eop_weights = [], [], []
    for i in range(weights.shape[0]-1):
      subset = R.loc[(R.index >= weights.index[i]) & (R.index <= weights.index[i+1])]
      if i >= 1: # first period include all data
                # otherwise, final period of previous period is included, so drop it
        subset = subset.iloc[1:]
      subset_out = compute_weights_and_returns(subset, weights.iloc[i,:])
      subset_out[0].columns = ["Portfolio.Returns"]
      portf_returns.append(subset_out[0])
      bop_weights.append(subset_out[1])
      eop_weights.append(subset_out[2])
    portf_returns = pd.concat(portf_returns, axis=0)
    bop_weights = pd.concat(bop_weights, axis=0)
    eop_weights = pd.concat(eop_weights, axis=0)
  else: # only one weight allocation at the beginning and just drift the portfolio
    out = compute_weights_and_returns(R, weights)
    portf_returns = out[0]; portf_returns.columns = ['Portfolio.Returns']
    bop_weights = out[1]
    eop_weights = out[2]
    
    
  pct_contribution = R * bop_weights
  cum_returns = (1 + portf_returns).cumprod()
  eop_value = eop_weights * pd.DataFrame(np.tile(cum_returns, (1, eop_weights.shape[1])), 
                                        index=eop_weights.index, columns=eop_weights.columns)
  bop_value = bop_weights * pd.DataFrame(np.tile(cum_returns/(1+portf_returns), (1, bop_weights.shape[1])), 
                                        index=bop_weights.index, columns=bop_weights.columns)

  turnover = (np.abs(bop_weights.iloc[:, :(original_weight_dim-1)] - eop_weights.iloc[:, :(original_weight_dim-1)].shift(1))).sum(axis=1).dropna()
  turnover = pd.DataFrame(turnover, index=eop_weights.index[1:], columns=['Two-way turnover'])

  out = [portf_returns, pct_contribution, bop_weights, eop_weights, bop_value, eop_value, turnover]
  out = {k: v for k, v in zip(['returns', 'contribution', 'BOP.Weight', 'EOP.Weight', 'BOP.Value', 'EOP.Value', 'Two.Way.Turnover'], out)}
  if verbose:
        return out
  else:
        return portf_returns

Rather than go into it line by line, those interested can read the code, but ultimately, a lot of it is basically seeing if the weights were passed in as a data frame of individually customized weights (E.G. on January, my weights were such and such, and on February, they shifted to some other such and such), or if the weights were passed in as a vector, or Python list, if someone just wants to rebalance a portfolio (E.G. a classic buy-and-hold 60/40), and if some portfolio weights don’t add up to 1. Essentially, a fair bit of bookkeeping. (Speaking of, I’m not sure what happened to the programming language customization block here on wordpress, so, I sincerely encourage readers to check my github for this code).

Now, here’s the interesting part which motivated this post:

I didn’t write most of this code in Python. Or rather, not initially.

I wrote it in R, first, to understand it, with R’s vectorization, where I’m a bit more comfortable. (As an aside, as far as languages for quantitative finance go, I think R has the much more advanced buy-side libraries such as PortfolioAnalytics or Quantstrat compared to Python, as they were written by high-level industry practitioners, whereas Python’s finance ecosystem seems to be…fairly threadbare, consisting of various islands of libraries that don’t really play together all that well.)

Here’s the R code that I did to rewrite Peter Carl’s Return.portfolio.geometric (the underlying function that runs Return.portfolio that I’ve used all these years):

compute_weights_and_returns <- function(subset, weights) {
  rep_weights <- matrix(nrow=nrow(subset), ncol = ncol(subset), weights, byrow = TRUE)
  cum_subset_weights <- cumprod(1+subset) * rep_weights
  EOP.Weight <- cum_subset_weights/rowSums(cum_subset_weights)
  cum_subset_weights_bop <- cum_subset_weights/(1+subset)
  BOP.Weight <- cum_subset_weights_bop/rowSums(cum_subset_weights_bop)
  portf_returns_subset <- xts(rowSums(subset * BOP.Weight), order.by=index(subset))
  return(list(portf_returns_subset, BOP.Weight, EOP.Weight))
}

Return.portfolio.geometric.ilya <- function(R, weights, verbose = TRUE, rebalance_on = 'months') {
  
  if(sum(is.na(R)) > 0) {
    R[is.na(R)] <- 0
    warning("NAs detected in returns. Imputing with zeroes.")
  }
  
  if(missing(weights)) {
    weights <- rep(1/ncol(R), ncol(R))
    warning("Weights not provided, assuming equal weight for rebalancing periods.")
  }
  
  # if weights passed in as vector
  if(is.null(dim(weights))) {
    if(!is.null(rebalance_on)) {
      ep <- endpoints(R)
      first_weights <- xts(t(matrix(weights)), order.by=index(R)[1])
      weights <- rbind(first_weights,
                           xts(matrix(nrow=length(index(R)[ep]), 
                                      ncol = length(weights), weights, byrow=TRUE),
                             order.by = index(R)[ep]))
      # get the first day of all the weights, make sure it's unique
      # weights <- weights[!duplicated(index(weights)),]
    } else {
      weights <- xts(t(matrix(weights)), order.by=index(R)[1])
    }
    colnames(weights) <- colnames(R)
  } 
  
  original_weight_dim <- ncol(weights) # save this for computing two-way turnover
  
  if(original_weight_dim != ncol(R)) {
    stop("Number of weights is unequal to number of assets. Correct this.")
  }
  
  if(sum(is.na(weights)) > 0) {
    weights[is.na(weights)] <- 0
    warning("NAs detected in weights. Imputing with zeroes.")
  }
  
  residual_weights <- 1-rowSums(weights)
  if(sum(abs(residual_weights)) != 0) {
    warning("One or more periods do not have investment equal to 1. Creating residual weights.")
    weights$Residual <-  residual_weights
    R$Residual <- 0
  }
  if(nrow(weights) != 1) {
    portf_returns <- bop_weights <- eop_weights <- list()
    for(i in 1:(nrow(weights)-1)) {
      subset <- R[paste((index(weights)[i]),index(weights)[i+1], sep = "::"),]
      if(i >= 2) { # first period include all data
        # otherwise, final period of previous period is included, so drop it
        subset <- subset[-1,]
      }
      subset_out <- compute_weights_and_returns(subset, weights[i,])
      colnames(subset_out[[1]]) <- "Portfolio.Returns"
      portf_returns[[i]] <- subset_out[[1]]
      bop_weights[[i]] <- subset_out[[2]]
      eop_weights[[i]] <- subset_out[[3]]
    }
    portf_returns <- do.call(rbind, portf_returns)
    bop_weights <- do.call(rbind, bop_weights)
    eop_weights <- do.call(rbind, eop_weights)
  } else { # only one weight allocation at the beginning and just drift the portfolio
    out <- compute_weights_and_returns(R, weights)
    portf_returns <- out[[1]]; colnames(portf_returns) <- "Portfolio.Returns"
    bop_weights <- out[[2]]
    eop_weights <- out[[3]]
  }
  pct_contribution <- R * bop_weights
  cum_returns <- cumprod(1+portf_returns)
  eop_value <- eop_weights * matrix(nrow=nrow(eop_weights), ncol = ncol(eop_weights), 
                                    cum_returns, byrow= FALSE) 
  bop_value <- bop_weights * matrix(nrow=nrow(bop_weights), ncol = ncol(bop_weights),
                                    cum_returns/(1+portf_returns), byrow = FALSE)
  
  # add turnover computation because that's what this whole exercise is about
  turnover <- na.omit(xts(rowSums(abs(bop_weights[,1:original_weight_dim] - 
                                        lag(eop_weights[,1:original_weight_dim]))), 
                          order.by = index(eop_weights)))
  
  colnames(turnover) <- "Two-way turnover"
  
  out <- list(portf_returns, pct_contribution, bop_weights, 
              eop_weights, bop_value, eop_value, turnover)
  names(out) <- c("returns", "contribution", "BOP.Weight", 
                  "EOP.Weight", "BOP.Value", "EOP.Value", "Two.Way.Turnover")
  if(verbose) {
    return(out)
  } else {
    return(portf_returns)
  }
}

Now, how did I go from the above to the code at the beginning of this post?

ChatGPT4.

That is, I translated my R code into Python code using ChatGPT4 by simply having it translate, block by block, my R code into Python.

Now, was it perfect? No, actually. I had to run the code block by block, and do a little bit of debugging here and there–including adding the functionality for the Return_portfolio function to make deep copies of the returns and weights, instead of modifying the original data structure passed into the function, since Python by default is a pass by reference language, not pass by value (ugh). However, this sort of “Rosetta Stone” functionality inherent in ChatGPT4 is occasionally useful.

Now, when I first tried prompting chatGPT4 to just translate R’s functions, without guiding it step by step, it was a total disaster. Contrary to popular belief, ChatGPT4 is *not* some all-powerful, all-perfect, all-knowing, and all-wise (as the late, great, George Carlin once said) system. But occasionally, it can be useful.

Furthermore, as someone that’s used AI to try my hand at image generation, I also consider those systems (such as StableDiffusion, or Leonardo.AI) to be…somewhat impressive, if in their infancy right now (it gets much more difficult to have multiple subjects interact on one image–I.E. if you want to describe one person in an image, that’s one thing, but if you want to describe two, it isn’t like you can make an image of one–say, save it with a name, such as Alice.jpeg, make an image of a second, save it to another name, such as Bob.jpeg, and then say “Alice being hugged by Bob”). Of course, I’ve definitely been aware about the copyright hubbub brewing over the image generating AI space, but as someone that’s given away plenty of code over the past 8 or so years, including code that has been augmented to run $87 million at a $1B AUM fund, I’m pretty squarely on the open source side, even if seeing my code being responsible for running more than $100 million in AUM, and not getting paid residuals on the management fees, has discouraged me from continuing to share strategy research (I was always of the opinion that anything I shared using freely available Yahoo data was something that large investment houses had already passed over and run a more sophisticated variant of–apparently, this isn’t the case, so giving away strategy code for free nowadays seems like a very suspect thing to do).

In any case, using image-generating AI has allowed me to make artwork the likes of which I never dreamed of creating before, so I’m of the opinion that a few people taking the side of IP hoarding corporations (such as big pharma hoarding insulin patents, or Disney continuing to milk Mickey Mouse) can go kick rocks, and that IP law has long overreached. That said, I already have StableDiffusion running on my machine, coupled with some more sophisticated models built on top of it (Dreamshaper v5, Deliberate v2, etc.), and still have a lot to learn about it–though I’ll probably need a more powerful machine than a gaming laptop to really crank images out.

Now, my thoughts on AI? So far, I don’t think that AI alone is going to pull a South Park and DERK YERRR JERBZ. At least if you’re a skilled professional.

Not alone, anyway. And for what it’s worth, as a quantitative research analyst, I found that I’d fire ChatGPT4 within a few hours, owing to the artifact of AI hallucinations. As the phrase goes, trust *but verify*. I think we’re still far, far away from ChatGPT4 being able to read a strategy paper on Quantpedia, replicating it in R or Python, running the code, and replicating the results. For that matter, I think that any strategy paper submission or blog post is incomplete without data and code, for immediate replication, ready to go. It’s the 21st century. Github or git out. A minimum reproducible example (MRE) as a demo of the concept (even a toy one), in my opinion, is vastly more useful than 25 pages of mathematical derivation. Though that’s just me speaking given my background of hands-on engineering and actually *doing things* rather than merely theorizing about them.

Okay, so…this post seems to have gotten long in the tooth. In other news, I’m in the job market, currently, and as I’ve shown, I have a fairly solid command of Python, having several years of off and on experience with it. And I’m also wondering if a written blog is the best way to actually communicate so many of my findings, or if YouTube is the way to go (though I don’t know much about editing videos).

So…yeah. Follow if you like the post, and most importantly:

Try to lose money…less than I do.

Thanks for reading.

My LinkedIn.

To leave a comment for the author, please follow the link and comment on their blog: R – QuantStrat TradeR.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

R-bloggers

R news and tutorials contributed by hundreds of R bloggers

This function VITAL for portfolio backtesting is now in Python, written with the help of chatGPT4

Related

Related

Never miss an update! Subscribe to R-bloggers to receive e-mails with the latest R posts. (You will not see this message again.)

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)