# Intermediate SQL for Data Science – Analytical Functions Deep Dive

**r – Appsilon | End to End Data Science Solutions**, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

## Intermediate SQL for Data Science

Running data queries in the database can offer significant speed improvements over doing so in R or Python. There’s no need to drag the entire dataset to memory and run the calculations once the loading completes. The runtime differences can be drastic, depending on the dataset size.

That’s why SQL is your best friend when it comes to larger datasets. And there’s no better way to improve your SQL skills than going over a couple of intermediate concepts, wrapped in a little something called **analytical functions**. These functions perform computations over multiple rows, but they also return multiple rows. We’ll go over a couple of them in this tutorial.

This article is structured as follows:

## Database and Data Introduction

This article assumes you have the PostgreSQL database installed and configured. It also assumes basic knowledge of SQL, so words like `SELECT`

, `WHERE`

, `BETWEEN`

, `FROM`

, and the others shouldn’t feel new.

With regards to the data, we’ll use a small table called ORDERS which you can download from here. The previous URL contains SQL code for generating and populating three tables, so please execute it before proceeding.

If you did everything correctly, the following `SELECT`

statement:

Should yield the following dataset:

Which means you’re ready to proceed.

## Running Averages

If you have any experience with SQL, it’s likely you’re familiar with aggregation functions such as `SUM`

, `AVG`

, `MIN`

, and `MAX`

. It’s also likely you’ve used them in the `GROUP BY`

clause. As it turns out, you can also use them in `ORDER BY`

to obtain a running total, average, minimum, or maximum.

Let’s go over a concrete example to make this more clear. You want to monitor the states of your sales agent, and want to see their performance in the third quarter of 2008. To do so, you can calculate the running average revenue and the total revenue obtained.

Here’s the code:

And here are the results:

And that’s how easy it is! Let’s proceed to the next one.

## First Values

In Postgres, you can use the `FIRST_VALUE`

analytical function to return the value of a specified column from the first row of the window frame. Similarly, you can use the `LAST_VALUE`

and `NTH_VALUE`

functions.

In our Orders table example. you could use the `FIRST_VALUE`

function to check the date gap between the first and the next purchase per customer. Here’s the code for doing so:

Here are the results:

Take a moment and think of all use cases when functionality like this could be useful. More than a handful, I’m sure.

Let’s proceed to the next one.

## Leads and Lags

As the name suggests, the `LEAD`

function fetches the value of a specific attribute from the next row and returns it in the current row. It takes two arguments:

`COLUMN_NAME`

– name of the attribute from which the next value is fetched`INDEX`

– number of rows relative to the current one

On the other hand, the `LAG`

function does the opposite. It fetches the value for a column of interest from the previous `INDEX`

rows.

Here’s an example – we want to find out what is the last highest amount for which an order has been sold. Here’s the code:

And here are the results:

Let’s cover one more analytical function before calling it a day.

## Ranking

In PostgreSQL, you can use `RANK`

and `DENSE_RANK`

as numbering functions. They are here to assign an integer value to a row and are particularly useful when you have to find the nth highest or lowest record from the table.

The two functions are a bit different when it comes to assigning integer values. `DENSE_RANK`

will return consecutive ranks, while `RANK`

will return ranking in such a way where a rank is skipped in case of a tie.

For example, ranking with `DENSE_RANK`

would return (1, 2, 2, 3), while ranking with `RANK`

would return (1, 2, 2, 4) – hence a skipped rank due to a tie.

Let’s see this in action – we want to find the second highest order values for each month. Here’s the code:

And here are the results:

And that’s just enough for today. Let’s wrap things up in the next section.

## Conclusion

And there you have it – a handful of analytical functions to take your SQL and database knowledge to the next level. These are particularly useful in data science, as most of the time the worthy insights are hidden, and the only way to obtain them is through some creative data manipulation.

Analytical functions provide a perfect way for doing so.

## Learn More

- What Can I Do With R? 6 Essential R Packages for Programmers
- How to Analyze Data with R: A Complete Beginner Guide to dplyr
- Hands-on R and dplyr – Analyzing the Gapminder Dataset
- 7 Must-Have Skills to Get a Job as a Data Scientist
- Tableau vs. R Shiny: Which Excel Alternative is Right For You?

**Appsilon is hiring for remote roles! See our ****Careers**** page for all open positions, including a Senior React Developer and ****Senior R Shiny Developers****. Join Appsilon and work on groundbreaking projects with the world’s most influential Fortune 500 companies.**

Article Intermediate SQL for Data Science – Analytical Functions Deep Dive comes from Appsilon | End to End Data Science Solutions.

**leave a comment**for the author, please follow the link and comment on their blog:

**r – Appsilon | End to End Data Science Solutions**.

R-bloggers.com offers

**daily e-mail updates**about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.