Analyzing Customer Data from Square

[This article was first published on Blog-rss on, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

The Square Data Model

Whether you own your own business or consult for a business using Square to capture payment data, Square can offer some amazing opportunities to gain insights by leveraging their Connect v1 & v2 APIs. The Square data backend operates much like a CRM system that holds information about transactions that customers make when purchasing items offered at a location. Naturally, there are API endpoints for each of the object types italicized above (Locations, Customers, Transactions, and Items) and many more endpoints for administrative tasks (employees, roles, timecards, refunds, etc.).


While working with Square data I decided to develop an R package squareupr that makes it easier to retrieve Square data from R so that you can focus on the analysis. After installing the squareupr package you must authenticate by supplying a personal access token (PAT) or using an OAuth 2.0 flow. You can find your PAT by logging into the Square Dashboard -> Apps -> My Apps. Create an app or click “Manage App” if you’ve already created one and there you should see your personal access token:

# The squareupr package is not yet available on CRAN so you must install from GitHub
# install.packages("devtools")


# authenticate using your Personal Access Token (PAT)
sq_auth(personal_access_token = "sq-Th1s1sMyPers0nalAcessT0ken")

The package also offers OAuth 2.0 authentication. More information is available here.

Pulling Transaction Data

As mentioned above there are endpoints for every major type of data stored by Square. The API documentation does a very good job at laying out how requests should be made to each of those endpoints. One thing to note is that it is important to first pull down the location details for your business because the location is often required when searching for things like transactions and items.

our_locations <- sq_list_locations()
our_locations$name <- "{HIDDEN}"
our_locations %>% select(id, name, address, timezone, 
                        capabilities, status, created_at)
#> # A tibble: 5 x 7
#>   id            name     address  timezone  capabilities status created_at
#>   <chr>         <chr>    <list>   <chr>     <list>       <chr>  <chr>     
#> 1 46FYN9N9RQS54 {HIDDEN} <list [… America/… <list [1]>   ACTIVE 2017-04-2…
#> 2 DRDCJ2X8E2PMV {HIDDEN} <list [… America/… <list [1]>   ACTIVE 2016-09-2…
#> 3 8T1TYXE840S00 {HIDDEN} <list [… America/… <list [1]>   ACTIVE 2016-09-2…
#> 4 1AWPRVVVFWGQF {HIDDEN} <list [… America/… <list [1]>   ACTIVE 2017-04-1…
#> 5 50X1GNAWEC8V0 {HIDDEN} <list [… America/… <list [1]>   ACTIVE 2017-03-0…

Now that you have the location ids if you would like to pull all of the transactions during a given timeframe, you would use the function sq_list_transactions().

# list all transactions for our 2nd location on May 11, 2018
# by default, if a date is provided with no time, then the time component is set to midnight
our_transactions <- sq_list_transactions(location = our_locations$id[2], 
                                         begin_time = as.Date('2018-05-11'), 
                                         end_time = as.Date('2018-05-12'))
#> # A tibble: 245 x 6
#>    id          location_id  created_at    tenders product client_id       
#>    <chr>       <chr>        <chr>         <list>  <chr>   <chr>           
#>  1 bUjFGVjBvN… DRDCJ2X8E2P… 2018-05-12T0… <list … REGIST… D5528FBA-E5DE-4…
#>  2 5PZP31N5Zs… DRDCJ2X8E2P… 2018-05-11T2… <list … REGIST… A3A1FF51-325A-4…
#>  3 BTrGydD6he… DRDCJ2X8E2P… 2018-05-11T2… <list … REGIST… 2B3D32EB-8E58-4…
#>  4 XsqOAHl68z… DRDCJ2X8E2P… 2018-05-11T2… <list … REGIST… C50AF3D7-BE32-4…
#>  5 vmLRzrwByS… DRDCJ2X8E2P… 2018-05-11T2… <list … REGIST… 52E40E1B-2333-4…
#>  6 pTbzQApZW7… DRDCJ2X8E2P… 2018-05-11T2… <list … REGIST… 962766FF-1436-4…
#>  7 lnE20zklpP… DRDCJ2X8E2P… 2018-05-11T2… <list … REGIST… A02191CC-9AC9-4…
#>  8 DSumrqQW0L… DRDCJ2X8E2P… 2018-05-11T2… <list … REGIST… 1135FF4F-9B89-4…
#>  9 tPwFXetIwe… DRDCJ2X8E2P… 2018-05-11T2… <list … REGIST… 0D95E79D-B44C-4…
#> 10 bqUuFrzH71… DRDCJ2X8E2P… 2018-05-11T2… <list … REGIST… 48FD6A49-80A9-4…
#> # ... with 235 more rows

At first glance there does not appear to be very much detail on the transaction record. However, the tender field represents a method of payment used in a Square transaction so it contains information regarding the amount of money paid in total, in Square fees, and tip. The tender field even contains information regarding the customer_id and credit card information. In the following I will loop through the transactions in April 2018 and determine the total spend for each customer in the dataset. Note: The call to sq_list_transactions may take a couple minutes to complete if you are pulling thousands of transactions.

april_transactions <- sq_list_transactions(location = our_locations$id[2], 
                                           begin_time = as.Date('2018-04-01'), 
                                           end_time = as.Date('2018-05-01'))

In order to extract the customer ID and money spent I create a function that checks for the tender object on the transaction and if it exists tries to extract the data into a tbl_df. When I supply this function as map_df(extract_cust_info_func) %>% I get the data from each transaction stacked into a single tbl_df that’s ready to analyze.

# create a function that will extract out just the customer id and money spent
extract_cust_info_func <- function(x){
    tibble(customer_id = sq_null_to_na(x$tender[[1]]$customer_id),
           money_spent = sq_null_to_na(x$tender[[1]]$amount_money$amount))  
  } else {
    tibble(customer_id = NA_character_, 
           money_spent = NA_integer_)

april_customer_spend <- april_transactions %>% 
  transpose() %>%
  # pull out just the information we want from each transaction
  map_df(extract_cust_info_func) %>%
  group_by(customer_id) %>%
  summarize(total_spend = sum(money_spent, na.rm=TRUE))

#> # A tibble: 208 x 2
#>    customer_id                total_spend
#>    <chr>                            <int>
#>  1 064HFDQG0N52AHDBSG00C1BAC8        1700
#>  2 07VNWH1V4S6W4W2EJ4AN7SEJNR           0
#>  3 08M453QNJ97BCT97SM09TN7QK4           0
#>  4 08XE43X8FS0MPX8P2W4N0DEQY0         350
#>  5 0CZ78CVRW12V7AZET8S3S82GGW         675
#>  6 0G1J81148H42GGMTMQKRWSJHGC           0
#>  7 0V1Y1BX23WYRK889ERVBE2T0KM         900
#>  8 13HHFBFZTD33RX0RSJNAZQKV5M           0
#>  9 1BFCHB9MK91GQ39HTD7QK6R7ZR           0
#> 10 1DR0AK5GKX57H9ER9SG1JF01P0        2900
#> # ... with 198 more rows

The amounts in april_customer_spend may seem large, but the Square APIs return money as integers that represent whole cents. If you divide by 100, then you will have the money amounts in dollars.

Spend by Customer Group

Square has this concept of “groups” that customers belong to. These groups can be fashioned to do marketing campaigns complete with email blasts. In our analysis let’s further determine which groups these customers belong to. The Square API has an endpoint to retrieve one customer at a time; however, with large lists you may get subjected to rate limiting. Rate limiting is errors on your requests because too many are coming from the same application or access token. I would recommend pulling down the entire list of customers with sq_list_customers() and then matching them up in R.

cust_groups <- sq_list_customers() %>%
  select(id, groups) %>%
  sq_extract_cust_groups() %>%
  # filter to only the groups designated automatically by Square

#> # A tibble: 13,444 x 3
#>    id                             
#>    <chr>                      <chr>                     <chr>             
#>  2 58MK9F1HQ5447D1QZDX60NHTP4 CQ689YH4KCJMY.CHURN_RISK  Lapsed            
#>  3 58MK9F1HQ5447D1QZDX60NHTP4 CQ689YH4KCJMY.REACHABLE   Reachable         
#>  5 MBSJA4QV4WX6N2XV8WV9VJJTG8 CQ689YH4KCJMY.REACHABLE   Reachable         
#>  6 ZCBZJ234217KTV812WX4DP2404 CQ689YH4KCJMY.REACHABLE   Reachable         
#>  8 FKEMR8KZCN3BH98RV78PKHKQ1R CQ689YH4KCJMY.LOYAL       Regulars          
#> 10 QASM1G54VX0QN2S15YS6KHEFCC CQ689YH4KCJMY.LOYAL       Regulars          
#> # ... with 13,434 more rows

Now that we know the customer memberships, let’s join back with the transaction data to determine the average total spend across the different membership groups.

cust_groups %>% 
  # bring in the spend data
  inner_join(., april_customer_spend, by=c('id'='customer_id')) %>%
  # group by the customer groups to find average spend per group
  group_by( %>%
  summarize(avg_spend = mean(total_spend / 100, na.rm=TRUE))
#> # A tibble: 6 x 2
#>              avg_spend
#>   <chr>                        <dbl>
#> 1 Cards on File                569. 
#> 2 Lapsed                        20.3
#> 3 Loyalty Participants          13.5
#> 4 Nearing a Loyalty Reward      15.3
#> 5 Reachable                     51.0
#> 6 Regulars                      27.4

Issues with the APIs

What is great about this analysis is that we can use the Square APIs to quickly and reliably pull down transaction data, match it to customer information and see how certain customer groupings or campaigns are performing. However, I did find some quirks and drawbacks that I wish the Square team would consider:

Cannot Request Specific Fields

First, it would be nice to implement, as part of the query parameters, the ability to only return certain fields (e.g. the id and groups fields from the Customer object). This would help for two reasons: 1) It would improve the speed since only the required data would be passed across and 2) For analysts who do not want to deal with personally identifiable information (even in RAM) you would not be forced to pull information like phone number, email, names, and even credit card information.

Cannot Update Customer Groups Programmatically

Second, the API will not allow you to update the customer groups programmatically. This means that you are stuck using the dashboard to create and assign customers to the groups you want to analyze. It is a drag when you want to create a reproducible research workflow to assign customer groups but the API hinders your ability to do so.

Customer ID Not On Transactions

Finally, it appears that the customer ID associated to a transaction is not reliably captured on every transaction’s tender object. This means that you cannot get a complete picture of all the transactions. I believe part of this is an issue with the merchant capturing the customer information at the point of sale, but also partly Square’s fault since they seem to be able to piece everything together in their dashboard. However, this is not the case with the APIs. Overall, the Square APIs are a rich data resource for helping run a business and they should only get better with time as development progresses.

To leave a comment for the author, please follow the link and comment on their blog: Blog-rss on offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)