(This article was first published on

So I have been having difficulty getting my Stata code to look the way I want it to look when I post it to my blog. To alleviate this condition I have written a html encoder in R. I don't know much about html so it is likely to be a little clunkier in terms of tags than it need be. It still needs some work but I thought to post what I have so far in case others would like to use the code to format their Stata code or modify it to format any language of their choosing.**Econometrics by Simulation**, and kindly contributed to R-bloggers)I would like to build a Shiny app in which all the user need do is paste the code and submit it. But that will be for a future post. Here is the code and the example using my Stata post from July 30th. You can also find the code on github. Please feel free to submit possible solutions to the two technical hurdles I discuss in my code (the inability to find and format numbers and the difficulty of finding and formatting punctuation).

R-Code

# A Stata HTML formatter in R

# Load up your Stata do file.

txt <- readLines(

"https://raw.github.com/EconometricsBySimulation/2013-07-30-Diff-n-Diff/master/Code.do")

# First subsititute out all of the < and > which can be misinterpretted as

# tags in HTML.

txt <- gsub("<","<",txt)

txt <- gsub(">",">",txt)

# Choose the formatting tags you would like applied to each field type.

comment.start <- '<span style="color: #669933">'

comment.end <- '</span>'

# I would like to auto format all numbers but I have nto yet been able to figure

# out how to do this.

num.start <- '<span style="color: #990000"><b>'

num.end <- '</b></span>'

punc.start <- '<span style="color: #0000FF">'

punc.end <- '</span>'

command1.start <- '<span style="color: #0000CC"><b>'

command1.end <- '</b></span>'

command2.start <- '<span style="color: #9900FF">'

command2.end <- '</span>'

command3.start <- '<span style="color: #990033">'

command3.end <- '</span>'

# I am not sure where exactly I got this

stata.commands1 <- unlist(strsplit(readLines(

"https://raw.github.com/EconometricsBySimulation/RFormatter/master/Stata/C1.txt")

, split=" "))

stata.commands2 <- unlist(strsplit(readLines(

"https://raw.github.com/EconometricsBySimulation/RFormatter/master/Stata/C2.txt")

, split=" "))

stata.commands3 <- unlist(strsplit(readLines(

"https://raw.github.com/EconometricsBySimulation/RFormatter/master/Stata/C3.txt")

, split=" "))

punc <- unlist(strsplit(readLines(

"https://raw.github.com/EconometricsBySimulation/RFormatter/master/Stata/Punc.txt") , split=" "))

# I want to figure out how to highlight the puncuation as well but I am having trouble

# with that.

# for (v in punc) txt<- gsub(v,

# paste0(punc.start,v,punc.end), txt)

# Create a vector to tell R to ignore entire lines.

comment <- (1:length(txt))*0

# '*' Star comment recognizer

for (i in grep("[:*:]", txt)) {

# Break each line to discover is the first symbol which is not a space is a *

txt2 <- strsplit(txt[i], split=" ")[[1]]

if (txt2[txt2!=""][1]=="*") {

txt.rep <- paste(c(comment.start,txt[[i]],comment.end), collapse="")

txt[[i]] <- txt.rep

comment[i] <- 1

}

}

# '//' Comment recognizer

for (i in (grep("//", txt))) if (comment[i]==0) {

txt2 <- strsplit(txt[i], split=" ")[[1]]

comment.place <- grep("//", txt2)[1]-1

txt.rep <- paste(c(txt2[1:comment.place], comment.start,

txt2[-(1:comment.place)],comment.end), collapse=" ")

txt[[i]] <- txt.rep

}

# Format stata commands that match each list

# "\\<",v,"\\>" ensures only entire word matches

# are used.

for (v in stata.commands1) txt[comment==0]<-

gsub(paste0("\\<",v,"\\>"),

paste0(command1.start,v,command1.end),

txt[comment==0])

for (v in stata.commands2) txt[comment==0]<-

gsub(paste0("\\<",v,"\\>"),

paste0(command2.start,v,command2.end),

txt[comment==0])

for (v in stata.commands3) txt[comment==0]<-

gsub(paste0("\\<",v,"\\>"),

paste0(command3.start,v,command3.end),

txt[comment==0])

# This is my attempt at highlighting all numbers that are not words.

# It did not work.

# <a href ="http://stackoverflow.com/questions/18160131/replacing-numbers-r-regular-expression">stackoverflow topic</a>

# txt <- gsub(".*([[:digit:]]+).*", paste0(num.start,"\\1",num.end), txt)

# Add tags to the end and beginning to help control the general format.

txt <- c('<pre><span style="font-family: monospace',txt,

'\nFormatted By <a href="http://www.econometricsbysimulation.com">EconometricsbySimulation.com</a>',

'</span></pre>')

# Copy formatted HTML to the clipboard.

writeClipboard(paste(txt, collapse="\n"))

Stata code formatting example:

(https://raw.github.com/EconometricsBySimulation/2013-07-30-Diff-n-Diff/master/Code.do)

**clear**

set **obs** 4000

**gen** id = _n

**gen** eta1 = rnormal()

**gen** eta2 = rnormal()

* Generate 5 irrelevant factors that might affect each of the

* different responses on the pretest

**gen** f1 = rnormal()

**gen** f2 = rnormal()

**gen** f3 = rnormal()

**gen** f4 = rnormal()

**gen** f5 = rnormal()

* Now let's apply the treatment

**expand** 2, **gen**(t) // double our data

**gen** treat=0

replace treat=1 if ((id<=_N/4)&(t==1))

* Now let's generate our changes in etas

replace eta1 = eta1 + treat*1 + t*.5

replace eta2 = eta2 + treat*.5 + t*1

* Finally we generate out pre and post test responses

**gen** v1 = f1*.8 + eta1*1 + eta2*.4 // eta1 has more loading on

**gen** v2 = f2*1.5 + eta1*1 + eta2*.3 // the first few questions

**gen** v3 = f3*2 + eta1*1 + eta2*1

**gen** v4 = f4*1 + eta1*.2 + eta2*1 // eta2 has more loading on

**gen** v5 = f5*1 + eta2*1 // the last few questions

* END Simulation

* Begin Estimation

**sem** (L1 -> v1 v2 v3 v4 v5) (L2 -> v1 v2 v3 v4 v5) if t==0

predict L1 L2, latent

**sem** (L1 -> v1 v2 v3 v4 v5) (L2 -> v1 v2 v3 v4 v5) if t==1

predict L12 L22, latent

replace L1 = L12 if t==1

replace L2 = L22 if t==1

* Now let's see if our latent predicted factors are correlated with our true factors.

corr eta1 eta2 L1 L2

* We can see already that we are having problems.

* I am no expert on SEM so I don't really know what is going wrong except

* that eta1 is reasonably highly correlated with L1 and L2 and

* eta2 is less highly correlated with L1 and L2 equally each

* individually, which is not what we want.

* Well too late to stop now. Let's do our diff in diff estimation.

* In this case we can easily accomplish it by generating one more variable.

* Let's do a seemingly unrelated regression form to make a single joint estimator.

sureg (L1 t id treat) (L2 t id treat)

* Now we have estimated the effect of the treatment given a control for the

* time effect and individual differences. Can we be sure of our results?

* Not quite. We are treating L1 and L2 like observed varaibles rather than

* random variables we estimated. We need to adjust out standard errors to

* take this into account. The easiest way though computationally intensive is

* to use a bootstrap routine.

* This is how it is done. Same as above but we will use temporary variables.

cap program drop SEMdnd

program define SEMdnd

tempvar L1 L2 L12 L22

**sem** (L1 -> v1 v2 v3 v4 v5) (L2 -> v1 v2 v3 v4 v5) if t==0

predict `L1' `L2', latent

**sem** (L1 -> v1 v2 v3 v4 v5) (L2 -> v1 v2 v3 v4 v5) if t==1

predict `L12' `L22', latent

replace `L1' = `L12' if t==1

replace `L2' = `L22' if t==1

sureg (`L1' t id treat) (`L2' t id treat)

drop `L1' `L2' `L12' `L22'

**end**

SEMdnd // Looking good

* This should do it though I don't hae the machine time available to wait

* for it to finish.

bs , **rep**(200) cluster(id): SEMdnd

Formatted By EconometricsbySimulation.com

To

**leave a comment**for the author, please follow the link and comment on his blog:**Econometrics by Simulation**.R-bloggers.com offers

**daily e-mail updates**about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...