On every SQL community event, where there could be a cluster of sessions dedicated to BI or analytics, I would have people asking me, “which one would you recommend?” or “which one I prefer?”
So, questions about recommendation and preferences are in my opinion the hardest one. And not that I would know my preferences but because you are inadvertently creating someone’s taste or preferences by imposing yours. And expressing taste through someone else taste is even harder.
My initial reaction is a counter-question, why are you asking this? Simply because my curiosity goes beyond the question of preferring A over B, respectively. And most of the time, the answer I get is, “because everyone is asking this question” or “because someone said this and the other said that”. In none of the cases, and I mean literally in none, I got the response (the one I would love to get) back like “we are running this algorithm and there are issues…” or “this library suits us better…”. So the community is mainly focused on asking themselves which one is better, instead of asking, can R / Python do the job. And I can assure you, that both can do the job! Period.
Image I ask you, would you prefer Apple iPhone over Samsung Galaxy, respectively? Or if I would ask you, would you prefer BMW over Audi, respectively? In all the cases, both phones or both cars will get the job done. So will Python or R, R or Python. So instead of asking which one I prefer, ask your self, which one suits my environment better? If your background is more statistics and less programming, take R, if you are more into programming and less into statistics, take Python; in both cases you will have faster time to accomplish results with your preferred language. If you ask me, can I do gradient boosting or ANOVA or MDS in Python or in R, the answer will be yes, you can do both in any of the languages.
Important questions are therefore the one that will give you fast results, easier adaptation and adoption, will give a better fit into your environment and will have less impact on your daily tasks.
Some might say, R is a child’s play language, while Python is a real programming language. Or some might say, Python is so complex and you have to program everything, whereas in R, everything is ready. And so on and on. All these allegations have some truth, but to fully understand them, I guess one needs to understand the background of the people saying this. Obviously, Python in comparison to R is more general purpose scripting and programming language, therefore the number of packages is 10x higher, when compared to R. And both come with variety of different packages, giving users a specific functions, classes and procedures to execute their results. R on the other hand has had it’s moment in past couple of years and the community grew rapidly, whereas Python community is in it’s steady phase.
When you are deciding which one to select, here are some questions to be answered:
- how big my corporate environment and how many end users will I have
- who is the end user and how will the end user handle the results
- what is current general knowledge with the language
- which statistical and predictive algorithms will the company be using
- would there be a need to parallel and distributed on-prem computations
- if needed, do we need to connect (or copy/paste) the code to the cloud
- how fast can the company adopt the language and the amount of effort needed
- which language would fit easier with existing BI stack and visualization tools
- how is your data centralized and silosd and which data sources are you using
- governance and providence issues
- installation, distribution of the core engine and packages
- selection and the costs of IDE and GUI
- corporate support and SLA
- possibility to connect to different data sources
- released dates of the most useful packages
- community support
- third party tools and additional programs for easier usage of the language
- total cost of using the language once completely in place
- asses the risk of using an GNU/open source software
After answering these questions, I implore you to do the stress and load tests against your datasets and databases to see, what perform better.
All in all, both languages, when doing statistical and predictive analysis, also have couple of annoyances that should also be addressed:
- memory limitations (unless spilling to disk)
- language specifics (e.g.: R is case-sensitive, Python is indent-sensitive and both will annoy you)
- parallel and distributed computations (CPU utilization, multi-threading)
- multi-OS running environment
- cost of GUI/IDE
- engine and package dependencies and versioning
- and others
So next time, when you ask yourself or overhear the conversation in the community, which one is better (bigger, faster, stable,…), start asking the questions on your needs and effort to adopt it. Otherwise, I always add, learn both. It does not hurt to learn and use both (for at least the statistical and predictive purposes).