Time series visualizations with wind turbine energy data in R

October 27, 2018
By

(This article was first published on R Programming – DataScience+, and kindly contributed to R-bloggers)

    Categories

    1. Visualizing Data

    Tags

    1. Data Visualisation
    2. R Programming
    3. Time Series

    One of the sectors with a huge demand for data science/analysis is the energy sector. A branch of this sector where demand is high is the green wind energy turbine sector. In this analysis, you will learn to do a time series wind turbine analysis in R.

    Read packages into R library

    First we need to load the R packages into our R library:

    # Load R packages into the library
    library(plyr)
    library(dplyr)
    library(plotly)
    library(ggplot2)
    library(skimr)
    library(reshape2)
    library(lubridate)
    

    Next it is time to load the dataset and do some data variable changes. We will also look at som descriptive statistics with the skimr package. We use open source wind turbine data:

    # Read dataset into R
    data<-read.csv(file="/WIND_VXE_2013.csv", header=TRUE)
    # New optimized column names
    new.names<-c("date_time","T1_Possible_Power","T2_Possible_Power","T3_Possible_Power","T4_Possible_Power","T5_Possible_Power","T6_Possible_Power","T7_Possible_Power","T1_Total_Active_Power","T2_Total_Active_Power","T3_Total_Active_Power","T4_Total_Active_Power","T5_Total_Active_Power","T6_Total_Active_Power","T7_Total_Active_Power","mean_wind_mps", "min_wind_mps", "max_wind_mps", "cum_energy_delivered_kwh")
    # Data management
    cbind(names(data), new.names)
    names(data)<-new.names
    # Data type
    Cumulative<-subset(data, select=c(1,19))
    Possible<-subset(data, select=1:8)
    Active<-subset(data, select=c(1,9:15))
    Wind<-subset(data, select=c(1,16:18))
    # Data selection
    dat<-data 
    dim(dat)
    str(dat)
    skim(dat)
    Skim summary statistics
     n obs: 52560 
     n variables: 19 
    
    -- Variable type:factor --------------------------------------------------------
      variable missing complete     n n_unique                     top_counts ordered
     date_time       0    52560 52560    52560 1/1: 1, 1/1: 1, 1/1: 1, 1/1: 1   FALSE
    
    -- Variable type:integer -------------------------------------------------------
                     variable missing complete     n       mean         sd      p0        p25       p50        p75    p100      hist
     cum_energy_delivered_kwh      59    52501 52560 5052061.08 2745939.75      78 2741038    5152314   7152592      1e+07 ????????
            T1_Possible_Power     725    51835 52560     475.87     297.55      -3     191        518       772        850 ????????
        T1_Total_Active_Power     725    51835 52560 4608851.25  827476.83 3109970 3895622    4744043   5262894.5    6e+06 ????????
            T2_Possible_Power     710    51850 52560     483.49     295.1       -3     204        530       774        850 ????????
        T2_Total_Active_Power     710    51850 52560 2189450.03  884214.18  609852 1391641.25 2341906   2894952    3690817 ????????
            T3_Possible_Power     927    51633 52560     496.81     297.6       -2     214        552       792        850 ????????
        T3_Total_Active_Power     927    51633 52560 4870725.97   9e+05    3254759 4066341      5e+06   5586471    6413840 ????????
            T4_Possible_Power     654    51906 52560     517.58     305.05      -3     222        598       819        850 ????????
        T4_Total_Active_Power     654    51906 52560   5e+06     923361.64 3341303 4168934.75 5159420.5 5736883    6575858 ????????
            T5_Possible_Power     685    51875 52560     495.31     306.21      -3     192        553       805        850 ????????
        T5_Total_Active_Power     685    51875 52560 4827586.6   886361.55 3230186   4e+06      5e+06   5531891    6326853 ????????
            T6_Possible_Power     652    51908 52560     489.35     297         -2     206        537       786        850 ????????
        T6_Total_Active_Power     652    51908 52560 4903692.83  909486.89 3264175 4085929    5057922.5 5624684.75 6451012 ????????
            T7_Possible_Power     710    51850 52560     471.98     302.52      -6     180        497       785        850 ????????
        T7_Total_Active_Power     710    51850 52560 4712335.03  874375.89 3136754 3919522.75 4875311.5 5410487.5  6193951 ????????
    
    -- Variable type:numeric -------------------------------------------------------
          variable missing complete     n  mean   sd p0 p25  p50  p75 p100     hist
      max_wind_mps       3    52557 52560 10.55 4.59  0 7.6 11.1 13.9 31.5 ????????
     mean_wind_mps       3    52557 52560  8.08 3.81  0 5.6  8.5 10.9 18.8 ????????
      min_wind_mps       3    52557 52560  5.6  3.02  0 3.6  6    7.8 15.8 ????????
    

    There are some missing values in the dataset – let us clean this up:

    # Remove missing values
    dat<-na.omit(dat)
    skim(dat)
    Skim summary statistics
     n obs: 51183 
     n variables: 19 
    
    -- Variable type:factor --------------------------------------------------------
      variable missing complete     n n_unique                     top_counts ordered
     date_time       0    51183 51183    51183 1/1: 1, 1/1: 1, 1/1: 1, 1/1: 1   FALSE
    
    -- Variable type:integer -------------------------------------------------------
                     variable missing complete     n       mean         sd      p0       p25     p50       p75    p100     hist
     cum_energy_delivered_kwh       0    51183 51183   5e+06    2757020.28      78 2753214.5 5140652 7198371     1e+07 ????????
            T1_Possible_Power       0    51183 51183     478.24     296.22      -3     195       522     773       850 ????????
        T1_Total_Active_Power       0    51183 51183 4603062.95  828267.92 3109970 3887096   4740818 5258988.5   6e+06 ????????
            T2_Possible_Power       0    51183 51183     485.82     293.68      -3     209       533     774       850 ????????
        T2_Total_Active_Power       0    51183 51183 2182348.55  884595.86  609852 1381887.5 2338389 2888334   3690817 ????????
            T3_Possible_Power       0    51183 51183     496.98     297.04      -2     215       551     792       850 ????????
        T3_Total_Active_Power       0    51183 51183 4865350.4    9e+05    3254759 4063042.5   5e+06 5576852   6413840 ????????
            T4_Possible_Power       0    51183 51183     520.57     303.3       -3     229       601     820       850 ????????
        T4_Total_Active_Power       0    51183 51183   5e+06     923805.37 3341303 4159034.5 5155586 5730673.5 6575858 ????????
            T5_Possible_Power       0    51183 51183     497.98     304.76      -3     197       557     805       850 ????????
        T5_Total_Active_Power       0    51183 51183 4821143.7   886888.84 3230186   4e+06     5e+06 5527166.5 6326853 ????????
            T6_Possible_Power       0    51183 51183     491.93     295.43      -2     212       540     787       850 ????????
        T6_Total_Active_Power       0    51183 51183 4895574.95  909954.68 3264175 4074391.5 5053650 5617882.5 6451012 ????????
            T7_Possible_Power       0    51183 51183     474.86     301.2       -6     185       502     786       850 ????????
        T7_Total_Active_Power       0    51183 51183 4705353.38  874167.71 3136754 3910777.5 4872279 5406292   6193951 ????????
    
    -- Variable type:numeric -------------------------------------------------------
          variable missing complete     n  mean   sd p0 p25  p50  p75 p100     hist
      max_wind_mps       0    51183 51183 10.59 4.57  0 7.7 11.1 13.9 31.5 ????????
     mean_wind_mps       0    51183 51183  8.11 3.8   0 5.7  8.6 10.9 18.8 ????????
      min_wind_mps       0    51183 51183  5.62 3.01  0 3.7  6    7.8 15.8 ????????
    

    Great – data is ready for more computing:

    # energy sentout in each timeblock
    n<-length(dat$cum_energy_delivered_kwh)
    a<-dat$cum_energy_delivered_kwh[1:n-1]
    b<-dat$cum_energy_delivered_kwh[2:n]
    diff<-b-a
    dat$energy_sentout_10min_kwh<-c(diff,0)
    # kinetic energy in the wind at each windspeed
    rho=1.225 
    area=2174 
    turbines=7 
    c<-(1/2)*rho*area
    dat$wind_power_kw<-c*(dat$mean_wind_mps)^3*turbines/1000 
    dat$wind_energy_10min_kwh<-c*(dat$mean_wind_mps)^3*turbines/(1000*6) 
    # compute betz limit
    betz.coef<- 16/27
    dat$betz_limit_10min_kwh<-dat$wind_energy_10min_kwh*betz.coef
    # turbine efficiency
    dat$turbine_eff<-dat$energy_sentout_10min_kwh/dat$wind_energy_10min_kwh
    # total Possible Power
    uncurtailed_power<-apply(X=dat[,2:8], MARGIN=1, FUN=sum)
    dat$uncurtailed_10min_kwh<-(uncurtailed_power)/6
    # curtailment
    dat$curtailment_10min_kwh<-dat$uncurtailed-dat$energy_sentout_10min_kwh
    # Na due to 0 division
    inf<-which(dat$turbine_eff == "Inf")
    dat$turbine_eff[inf]<-0
    

    Now it is time to format data to time and create energy data to analysis:

    # Make conversion to time date variables
    dat$date_time1<-as.POSIXlt(dat$date_time, format="%m/%d/%y")
    dat$date_time<-as.POSIXlt(dat$date_time, format="%m/%d/%y %H:%M")
    # Create date variables
    dat$ymd<-ymd(dat$date_time1)
    dat$day<-day(dat$date_time1)
    dat$week<-week(dat$date_time1)
    dat$month<-month(dat$date_time1)
    dat$year<-year(dat$date_time1)
    dat$yw <- as.numeric(paste(dat$year, dat$week, sep = ""))
    dat$hour<-hour(dat$date_time)
    # outlier detection
    datv<-as.data.frame(lapply(dat[c(2:26)], as.numeric)) # Make variables numeric for cor analysis
    colnames(dat)
    OUTresult <- scores(datv, type="t", prob=0.95) #t test, probability is 0.95
    skim(OUTresult)
    # Create energy data
    energy<-subset(dat, select=c("ymd", "energy_sentout_10min_kwh", "wind_energy_10min_kwh", "betz_limit_10min_kwh", "uncurtailed_10min_kwh", "curtailment_10min_kwh"))
    energy<-ddply(energy, .(ymd), numcolwise(sum))
    

    Now we begin making the time series visualization analysis. Before we do this we deal with outliers:

    # Visualization energy & time
    test<-melt(energy, id.vars=("ymd"))
    test<-na.omit(test)
    # Remove outliers
    remove_outliers <- function(x, na.rm = TRUE, ...) {
      qnt <- quantile(x, probs=c(.025, .975), na.rm = na.rm, ...)
      H <- 1.5 * IQR(x, na.rm = na.rm)
      y <- x
      y[x < (qnt[1] - H)]  (qnt[2] + H)] <- NA
      y}
    testo <- as.data.frame(remove_outliers(test$value))
    names(testo)[names(testo) == "remove_outliers(test$value)"] <- "valueo"
    test1<-cbind(test,testo)
    

    Next we will look at some general visualization plots for data:

    # ggplot2 Histogram  
    ggplot(data = test1,aes (x = valueo))+ geom_histogram(color="darkblue", fill="lightblue")
    ggplot(data = test1,aes (x = valueo,fill= variable))+ geom_histogram()
    # ggplot2 - density plot
    ggplot(data = test1,aes(x = valueo)) + geom_density(fill = 'green',color = 'blue')
    ggplot(data = test1,aes(x = valueo,fill = variable)) + geom_density()
    # boxplot
    boxplot(valueo ~ variable, col="darkgreen", data=test1)
    

    The above coding gives us the following plots:

    Histograms


    Density plots


    Boxplot

    Visualization graph – day data

    Now it is time for the first visualization graph analysis on day data:

    # ggplot2
    ggplot(test1, aes(x=ymd, y=valueo/10^3, group=variable, color=variable, linetype=variable)) +
      geom_line() +
      scale_y_continuous(name="MWh per day") + 
      labs(title="Energy Timeseries") +
      theme_classic()  +
      scale_x_discrete(breaks=test1$day[seq(1, 360, by=60)], labels=abbreviate)
    

    The above coding gives us the following time series graph:

    Visualization graph – monthly data

    Now we will look at monthly data:

    # Monthly format
    energy<-subset(dat, select=c("day","week", "month", "energy_sentout_10min_kwh", "wind_energy_10min_kwh", "betz_limit_10min_kwh", "uncurtailed_10min_kwh", "curtailment_10min_kwh"))
    energy<-ddply(energy, .(day, week, month), numcolwise(sum))
    

    Let us make a visualization graph of the data:

    # Visualization of energy and time by month - data management
    energy<-subset(dat, select=c("day","week", "month", "energy_sentout_10min_kwh", "wind_energy_10min_kwh", "betz_limit_10min_kwh", "uncurtailed_10min_kwh", "curtailment_10min_kwh"))
    energy<-ddply(energy, .(day, week, month), numcolwise(sum))
    test<-melt(energy, id.vars=c("day", "week", "month"))
    levels(test$month) <- month.name[1:12]
    # Visualization of energy and time by month
    ggplot(test, aes(x=day, y=value/10^3, group=variable, colour=variable, linetype=variable)) +
      geom_line() +
      facet_wrap(~month, scales="free") +
      scale_y_continuous(name="MWh per day") + 
      labs(title="Monthwise Energy Timeseries") +
      theme_classic() +
      scale_x_discrete(breaks=NULL)
    

    This gives us the following monthly plot:

    Visualization graph – weekly data

    Let us look at energy and time on weekly basis:

    # Data on weekly basis
    # Total energy, per week
    energy<-subset(dat, select=c("week", "energy_sentout_10min_kwh", "uncurtailed_10min_kwh", "curtailment_10min_kwh", "betz_limit_10min_kwh"))
    energy<-ddply(energy, .(week), numcolwise(sum))
    test<-melt(energy, id.vars=("week"))
    # Remove outliers
    remove_outliers <- function(x, na.rm = TRUE, ...) {
      qnt <- quantile(x, probs=c(.025, .975), na.rm = na.rm, ...)
      H <- 1.5 * IQR(x, na.rm = na.rm)
      y <- x
      y[x < (qnt[1] - H)]  (qnt[2] + H)] <- NA
      y}
    testo <- as.data.frame(remove_outliers(test$value))
    names(testo)[names(testo) == "remove_outliers(test$value)"] <- "valueo"
    test1<-cbind(test,testo)
    # ggplot2 energy vs time week
    ggplot(test1, aes(x=week, y=valueo/10^3, group=variable, colour=variable, linetype=variable)) +
      geom_line() +
      scale_y_continuous(name="MWh per week") + 
      labs(title="Energy Timeseries") +
      theme_classic() +
      scale_x_discrete(breaks=levels(test1$week)[seq(1,52, by=8)], labels=abbreviate)
    

    This gives us the following plot:

    References

    1. Using plyr in R – CRAN.R-project.org
    2. Using dplyr in R – CRAN.R-project.org
    3. Using ggplot2 in R – CRAN.R-project.org
    4. Using plotly in R – CRAN.R-project.org
    5. Using skimr in R – CRAN.R-project.org
    6. Using reshape2 in R – CRAN.R-project.org
    7. Using lubridate in R – CRAN.R-project.org

    Related Post

    1. Visualizations for credit modeling in R
    2. Decision Trees and Random Forests in R
    3. Add value to your visualizations in R
    4. Visualize your Portfolio’s Performance and Generate a Nice Report with R
    5. Exploring San Francisco Bay Area’s Bike Share System

    To leave a comment for the author, please follow the link and comment on their blog: R Programming – DataScience+.

    R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...



    If you got this far, why not subscribe for updates from the site? Choose your flavor: e-mail, twitter, RSS, or facebook...

    Comments are closed.

    Search R-bloggers


    Sponsors

    Never miss an update!
    Subscribe to R-bloggers to receive
    e-mails with the latest R posts.
    (You will not see this message again.)

    Click here to close (This popup will not appear again)