Articles by msuzen

Collaborative data science: High level guidance for ethical scientific peer reviews

May 12, 2020 | msuzen

Preamble Catalan Castellers are collaborating (Wikipedia) Availability of distributed code tracking tools and associated collaborative tools make life much easier in building collaborative scientific tools and products. This is now especially much more important in data science as it is applied in many different industries as a de-facto standard. Essentially ...

[Read more...]

Collaborative data science: High level guidance for ethical scientific peer reviews

May 12, 2020 | msuzen

PreambleCatalan Castellers are collaborating (Wikipedia)Availability of distributed code tracking tools and associated collaborative tools make life much easier in building collaborative scientific tools and products. This is now especially much more i...

[Read more...]

Teaching to machines: What is learning in machine learning entails?

November 16, 2017 | msuzen

Preamble Figure 1: The oldest learning institution in the world; University of Bologna. (Source: Wikipedia). Machine Learning (ML) is now a de-facto skill for every quantitative job and almost every industry embraced it, even though fundamentals of the field is not new at all. However, what does it mean to teach ... [Read more...]

Understanding overfitting: an inaccurate meme in supervised learning

August 16, 2017 | msuzen

Preamble There is a lot of confusion among practitioners regarding the concept of overfitting. It seems like, a kind of an urban legend or a meme, a folklore is circulating in data science or allied fields with the following statement:Applying cross-validation prevents overfitting and a good out-of-sample performance, low ...

[Read more...]

Post-statistics: Lies, damned lies and data science patents

August 5, 2017 | msuzen

US Patent (Wikipedia) Statistics is so important field in our daily lives nowadays, the emerging field of 50 years old data science that is applied to almost every human activity now, or post-statistics, a kind of post-rock, fusing operations research, data mining, software and performance engineering and of course multitude fields ...

[Read more...]

Pitfalls in pseudo-random number sampling at scale with Apache Spark

June 15, 2017 | msuzen

In many data science applications and in academic research, techniques involving Bayesian Inference is now used commonly. One of the basic operation in Bayesian Inference techniques is drawing instances from given statistical distribution. This of course well known pseudo-random number sampling. Most commonly used methods first generates uniform random number ...

[Read more...]

Practical Kullback-Leibler (KL) Divergence: Discrete Case

January 7, 2017 | msuzen

KL divergence (Kullback-Leibler57) or KL distance is non-symmetric measure of difference between two probability distributions. It is related to mutual information and can be used to measure the association between two random variables.Figure: Distance between two distributions. (Wikipedia)In this short tutorial, I show how to compute KL divergence ...

[Read more...]

Understanding the empirical law of large numbers and the gambler’s fallacy

August 1, 2016 | msuzen

One of the misconceptions in our understanding of statistics, or a counter-intuitive guess, fallacy, appears in the assumption of the existence of the law of averages. Imagine we toss a fair coin many times, most people would think that the number of heads and tails would be balanced over the ...

[Read more...]

Economy and dynamic modelling: Haavelmo’s approach

July 25, 2016 | msuzen

Updated on 25 August 2017Preamable: Predictions using dynamic modellingMachine Learning and Neural Networks are not the only way to do data science or AI. There are other techniques to explore , for example, from quantitative economics. Apart from Game Theory, dynamic modelling could be suitable to many prediction problems, specially the ones ... [Read more...]

Economy and dynamic modelling: Haavelmo’s approach

July 25, 2016 | msuzen

Econometrics aims at estimating observables in the economy and their inter-dependencies and testing the estimates against the economic reality. A quantitative approach to express these inter-dependencies appear as simultaneous equations, an i.e. system of linear equations, this is a mathematical structure of economic relationships that were made possible with ... [Read more...]

S-shaped data: Smoothing with quasibinomial distribution

January 16, 2016 | msuzen

Figure 1: Synthetic data and fitted curves.S-shaped distributed data can be found in many applications. Such data can be approximated with logistic distribution function [1]. Cumulative distribution function of logistic distribution function is a... [Read more...]

S-shaped data: Smoothing with quasibinomial distribution

January 16, 2016 | msuzen

Figure 1: Synthetic data and fitted curves. S-shaped distributed data can be found in many applications. Such data can be approximated with logistic distribution function [1]. Cumulative distribution function of logistic distribution function is a logistic function, i.e., logit.To demonstrate this, in this short example, after generating a synthetic data, ... [Read more...]

Scale back or transform back multiple linear regression coefficients: Arbitrary case with ridge regression

April 10, 2015 | msuzen

SummaryThe common case in data science or machine learning applications, different features or predictors manifest them in different scales. This could bring difficulty in interpreting the resulting coefficients of linear regression, such as one featur... [Read more...]

Euclid Algorithm for Set of Integers: ‘Reduce’ vs. trees in R

May 7, 2014 | msuzen

The Euclid Algorithm provides a solution to the greatest common divisor (GCD) of two natural numbers $x_{1}$ and $x_{-2}$, denoted by $GCD(x_{1}, x_{2})$. This will produce the largest integer that divides $x_{1}$ and $x_{2}$. Solution is proposed by ... [Read more...]

Particle approximation to probability density functions: Dirac delta function representation

January 17, 2014 | msuzen

In the previous post, I have briefly shown the idea of using dirac delta function for discrete data representation. In the second example there, a histogram locations for a given set of points are presented as spike trains, where as heights are somehow... [Read more...]

Demystify Dirac delta function for data representation on discrete space

November 20, 2013 | msuzen

Dirac delta function is an important tool in Fourier Analysis. It is used specially in electrodynamics and signal processing routinely. A function over set of data points is often shown with a delta function representation. A novice reader relyin... [Read more...]

A technique for doing parameterized unit test in R: Case study with stock price data analysis

September 13, 2013 | msuzen

Ensuring the quality and correctness of statistical or scientific software in general constitute as one for the main responsibilities of scientific software developers and scientists who provide a code to solve a specific computational task. Sometimes tasks could be mission critical. For example, in drug trails, clinical research or designing ... [Read more...]

A technique for doing parametrized unit testing in R: Case study with stock price data analysis

September 13, 2013 | msuzen

Ensuring the quality and correctness of statistical or scientific software in general constitute as one fo the main responsibilities of scientific software developers and scientists who provide a code to solve a specific computational task. Sometimes t... [Read more...]

Metaprogramming in R with an example: Beating lazy evaluation

September 5, 2013 | msuzen

Functional languages allows us to treat functions as types. This brings us a distinct advantage of being able to write a code that generates further code, this practise is generally known as metaprogramming. As a functional language R project provides ... [Read more...]

Practicing static typing in R: Prime directive on trusting our functions with object oriented programming

June 13, 2013 | msuzen

The creator of S language which R is derived from John Chambers said in one of his books Software for data analysis programming with R: ...This places an obligation on all creators of software to program in such away that the computations ca... [Read more...]

1 2 »

R-bloggers

R news and tutorials contributed by hundreds of R bloggers

Articles by msuzen

Collaborative data science: High level guidance for ethical scientific peer reviews

Collaborative data science: High level guidance for ethical scientific peer reviews

Teaching to machines: What is learning in machine learning entails?

Understanding overfitting: an inaccurate meme in supervised learning

Post-statistics: Lies, damned lies and data science patents

Pitfalls in pseudo-random number sampling at scale with Apache Spark

Practical Kullback-Leibler (KL) Divergence: Discrete Case

Understanding the empirical law of large numbers and the gambler’s fallacy

Economy and dynamic modelling: Haavelmo’s approach

Economy and dynamic modelling: Haavelmo’s approach

S-shaped data: Smoothing with quasibinomial distribution

S-shaped data: Smoothing with quasibinomial distribution

Scale back or transform back multiple linear regression coefficients: Arbitrary case with ridge regression

Euclid Algorithm for Set of Integers: ‘Reduce’ vs. trees in R

Particle approximation to probability density functions: Dirac delta function representation

Demystify Dirac delta function for data representation on discrete space

A technique for doing parameterized unit test in R: Case study with stock price data analysis

A technique for doing parametrized unit testing in R: Case study with stock price data analysis

Metaprogramming in R with an example: Beating lazy evaluation

Practicing static typing in R: Prime directive on trusting our functions with object oriented programming

Articles by msuzen

Never miss an update! Subscribe to R-bloggers to receive e-mails with the latest R posts. (You will not see this message again.)

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)