**Econometrics By Simulation**, and kindly contributed to R-bloggers)

I found myself easily convinced by the strength of his arguments yet also curious as to how he produced the sample data that fit his statistical argument so perfectly. Given that he had only 11 points of data, I am drawn to think he played around with the data by hand till it fit his needs. This is suggested by the lack of precision on the statistics of the generated data (Anscombe’s quartet).

If he could do it by hand, I should be able to do it through algorithm!

**Method 1 – randomly draw some points then select the remaining – fail**

^{th}point. This approach however quickly fails as it relies too much on the 11th point. Say the mean from the first draws was unusually low with a mean of 8. In order to weight the sample mean back to 9 the 11th point would therefore need to be 19 in order to balance the x values at 9. Then you have to somehow figure out how to manage the variance which you know is already going to be blown up by the presence of my 11th value.

**Method 2 – use optimization to select points which match the desired outcome – fail**

**Method 3**–

**brute force, randomly generate data – fail**

The intent of the approach was to get data

*close*to target parameters, then modifying individual data points to match desired properties

**.**

**Method 4** –** modify random data to meet parameter specifications**

*a*to be a multiplicative scalar for x

Mean(X) = 9, Var(X) = 7.5, B0 = 3, B1 = .5, COR(X,Y)=.8.

**Sample Data – Using Ascombe’s Parameters****exactly**identical regardless of how the data is generated.

`Call: lm(formula = y ~ x, data = xy8)`

Residuals:

Min 1Q Median 3Q Max

-2.30594 -0.99280 0.02465 0.91005 2.77910

Coefficients:

Estimate Std. Error t value Pr(>|t|)

(Intercept) 3.00000 0.51854 5.785 5.32e-07 ***

x 0.50000 0.05413 9.238 3.18e-12 ***

---

Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 1.257 on 48 degrees of freedom

Multiple R-squared: 0.64, Adjusted R-squared: 0.6325

F-statistic: 85.33 on 1 and 48 DF, p-value: 3.18e-12

Figure 1: Graphs 1-4 are recreations of Anscome’s Quartet. 5-6 are new. |

**Sample Data – Using Negative Slope Parameters**`Table 2: `

`Call: lm(formula = y ~ x, data = xy1)`

Residuals:

Min 1Q Median 3Q Max

-2.3862 -0.6586 -0.2338 0.5721 3.6159

Coefficients:

Estimate Std. Error t value Pr(>|t|)

(Intercept) 3.00000 0.51854 5.785 5.32e-07 ***

x -0.50000 0.05413 -9.238 3.18e-12 ***

---

Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 1.257 on 48 degrees of freedom

Multiple R-squared: 0.64, Adjusted R-squared: 0.6325

F-statistic: 85.33 on 1 and 48 DF, p-value: 3.18e-12

Figure 2: Same as figure 1 except B1 = -0.5 |

**Summary**Such supplementary data found in graphs will likely not be the basis of whether the arguments you are making through statistics are valid, but they will add credibility.

__CODE__

Find my code for generating exact linear relationships between XY regardless of the dependency of the errors U and X (u|x).

**leave a comment**for the author, please follow the link and comment on their blog:

**Econometrics By Simulation**.

R-bloggers.com offers

**daily e-mail updates**about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...