Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

If you use MySQL, there’s a default schema called ‘information_schema‘ which contains lots of information about your schemas and tables among other things. Recently I wanted to know whether a table I use for storing the results of a large number experiments was any way near maxing out. To cut a brief story even shorter, the answer was “not even close” and could be found in ‘information_schema.TABLES‘. Not being one to avoid any opportunity to procrastinate, I went on to write a short script to produce a global overview of the entire database.

infomation_schema.TABLES contains the following fields: TABLE_SCHEMA, TABLE_NAME, TABLE_ROWS, AVG_ROW_LENGTH and MAX_DATA_LENGTH (and a few others). We can first have a look at the relative sizes of the schemas with the MySQL query “SELECT TABLE_SCHEMA,SUM(DATA_LENGTH) SCHEMA_LENGTH FROM information_schema.TABLES WHERE TABLE_SCHEMA!='information_schema' GROUP BY TABLE_SCHEMA“.

Code:
 1 2 3 4 5 6 7 8 9 10 11 12 13  library("ggplot2") # You'll need ggplot2 0.9 for this. library("reshape2") library("RMySQL")   connection<-dbConnect(MySQL(), user="username", password="XXXXXX",host="127.0.0.1",port=3306,dbname='')   query<-"SELECT TABLE_SCHEMA,SUM(DATA_LENGTH) SCHEMA_LENGTH FROM information_schema.TABLES WHERE TABLE_SCHEMA!='information_schema' GROUP BY TABLE_SCHEMA" result<-dbGetQuery(connection,query) result$TABLE_SCHEMA<-reorder(result$TABLE_SCHEMA,result$SCHEMA_LENGTH) p<-ggplot(result)+geom_bar(aes(x=TABLE_SCHEMA,y=SCHEMA_LENGTH))+coord_flip() p<-p+xlab("Size")+ylab("") p<-p+opts(title="Schemas' size") print(p) And for the whole overview, let’s break each schema down by tables: Code:  1 2 3 4 5 6 7   query<-"SELECT TABLE_SCHEMA,TABLE_NAME,TABLE_ROWS,DATA_LENGTH FROM information_schema.TABLES WHERE TABLE_SCHEMA!='information_schema'" result<-dbGetQuery(connection,query) result<-within(result,TABLE_NAME<-factor(TABLE_NAME,levels=sort(TABLE_NAME,decreasing=TRUE))) p<-ggplot(result)+geom_bar(aes(x=TABLE_NAME,y=DATA_LENGTH))+coord_flip()+facet_wrap(~TABLE_SCHEMA,scales='free') p<-p+xlab("Size")+ylab("") p<-p+opts(title="Tables' size") print(p) Also, using the AVG_ROW_LENGTH and MAX_DATA_LENGTH and assuming a relatively constant row length, we can derive the maximum number of rows that a table can use, which gives us an estimate of how much space there is left: Code:  1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28   query<-"SELECT TABLE_SCHEMA,TABLE_NAME,100*TABLE_ROWS/FLOOR(MAX_DATA_LENGTH/AVG_ROW_LENGTH) AS USED FROM information_schema.TABLES WHERE TABLE_SCHEMA!='information_schema'" # query<-"SELECT TABLE_SCHEMA,TABLE_NAME,RAND(42)*100 AS USED FROM information_schema.TABLES WHERE TABLE_SCHEMA!='information_schema'" result<-dbGetQuery(connection,query) result$LEFTOVER<-100-result$USED result<-within(result,TABLE_NAME<-factor(TABLE_NAME,levels=sort(TABLE_NAME,decreasing=TRUE))) result<-melt(result,id.vars=c("TABLE_SCHEMA","TABLE_NAME"),variable.name='TYPE',value.name='PROPORTION',na.rm=TRUE) p<-ggplot(result) p<-p+geom_bar(aes(x=TABLE_NAME,y=PROPORTION,fill=TYPE),stat='identity') p<-p+coord_flip()+facet_wrap(~TABLE_SCHEMA,scales='free') p<-p+scale_fill_manual(values=c("USED"='#DD0000',LEFTOVER='#AAAAAA')) p<-p+xlab('')+ylab('')+opts(title="Tables' usage") print(p) query<-"SELECT TABLE_SCHEMA, MAX(100*TABLE_ROWS/FLOOR(MAX_DATA_LENGTH/AVG_ROW_LENGTH)) AS USED FROM information_schema.TABLES WHERE TABLE_SCHEMA!='information_schema' GROUP BY TABLE_SCHEMA" # query<-"SELECT TABLE_SCHEMA, MAX(100*RAND(42)) AS USED FROM information_schema.TABLES WHERE TABLE_SCHEMA!='information_schema' GROUP BY TABLE_SCHEMA" result<-dbGetQuery(connection,query) result$LEFTOVER<-100-result$USED result$TABLE_SCHEMA<-reorder(result$TABLE_SCHEMA,result$USED) result<-melt(result,id.vars=c("TABLE_SCHEMA"),variable.name='TYPE',value.name='PROPORTION',na.rm=TRUE) p<-ggplot(result) p<-p+geom_bar(aes(x=TABLE_SCHEMA,y=PROPORTION,fill=TYPE),stat='identity') p<-p+coord_flip() p<-p+scale_fill_manual(values=c("USED"='#DD0000',LEFTOVER='#AAAAAA')) p<-p+xlab("")+ylab("")+opts(title="Largest Usage") print(p) dbDisconnect(connection)

Unless you are using very large tables, those last two graphs should come out pretty much all gray. You can check that the colouring works by using the commented out queries instead, which use random values for the estimates.

About dbConnect(): I left it here to make things easier to replicate but I normally call a simple function which is just a wrapper for it, with my username and password in. This way my credentials are in one single place instead of all over my scripts.

PS: This is my first anniveRsary! I’ve been using R for a year now. And I’m certainly planning to carry on.