I interviewed Wes McKinney, the creator of pandas, and author of Python for Data Analysis. This interview was conducted as a way to provide something illustrative about Open Source Software now that I am committing actively to the arrow R package, of which he’s a very active collaborator and a co-creator of the multi-platform library Apache Arrow.
1. What would you recommend to users who claim that the X programming language is better than the Y programming language? (I’ve used X to avoid putting Python/R/etc 1st)
I think the “language wars” are pretty counterproductive. Different languages have different strengths, and ultimately what matters is what tool will make you most effective as an individual and enable you to work productively with your colleagues. Which language is “better” for solving a problem can depend a great deal on the particular context and person.
2. How can Arrow increase cooperation between people who use different programming languages?
Apache Arrow provides a language-agnostic in-memory data format for bulk data interchange and analytical computing. It enables programming languages to exchange data with each other without having to pay the usual serialization / data conversion penalties. Using Arrow, we’ve been able to achieve zero-copy interoperability between Java and Python or Python and R, so that applications can run code written in any of those languages simultaneously on the same data. Historically, creating polyglot systems was a huge headache for developers because of these issues.
3. How can Arrow help to make scientific results reproducible?
Arrow is partly intended to simplify fast data access and encourage the use of binary data formats like Parquet over text-based formats like CSV and JSON. By enabling scientists to spend less time thinking about how they deal with data management and data access, they can spend more time on other software engineering problems necessary to make their research results more reproducible.
4. How can Arrow help industry?
It’s been our intent for Arrow to provide a reliable in-memory analytics SDK to make it easier for application developers to build fast applications that work on large tabular datasets. The Arrow data format spares industry developers from having to develop custom data formats for their applications, so they can just use Arrow “off the shelf” as the basis of their in-memory computing and interoperability with various programming languages. One of the most successful use cases for Arrow has been as an on-ramp to the open source data science system. Rather than have companies building custom data connectors to data science tools like pandas, instead they integrate with Arrow (which is generally much easier), and they get the fast connectivity for pandas for free.
5. What is the greatest challenge you’ve faced at keeping Arrow open source but also running a sustainable company behind its development?
Most companies are not accustomed to providing substantial funding to open source developers, at least the kind of funding that is necessary to be able to pay competitive salaries. So before founding Ursa Computing (which is a startup with venture investors), I spent a lot of time working with the lawyers and accountants at our sponsors to set up a suitable business relationship to enable us to operate a small full time development team. Of course, I was responsible for doing the accounting and invoicing, too!
Since Arrow is a part of the Apache Software Foundation, whether it remains open source is not up to us. The project is owned by the community and will certainly live on even if we were to move on to do something else in the future. That’s pretty unlikely, but it’s good to know that substantially more contributions are coming from non-Ursa folks than Ursa folks nowadays.
6. Do you have any advice for companies that don’t trust open source?
One of the best ways to help is to do what you can to provide funding to make open source development more sustainable. One of the biggest risks to open source projects are its best developers and maintainers moving on to other projects because they can’t justify continuing to work on their projects for free. Lack of adequate maintenance can lead to security flaws or other serious bugs going unfixed, which can create unpredictable risks for the businesses that depend on them.
7. What are, in your opinion, the greatest challenges for statistics, data science, and ML/AI?
For a long time I have been concerned about how much software innovation has lagged behind hardware innovation. We have very powerful computing hardware on our laptops and even smartphones now, but we don’t have as much software that can fully take advantage of that hardware. One of the purposes of Arrow is to reduce the complexity of the problem space by enabling software developers to focus their efforts on creating algorithms for a common data format and enabling those algorithms to be shared across a wider collection of programming languages and use cases.
8. Do you have an opinion about blockchain?
I’m a proponent of blockchain technologies and cryptocurrencies, and while there are frequently concerns cited about the energy use of Bitcoin, I believe the energy use and environmental harm of other aspects of our energy and financial infrastructure can be much worse. I’m interested to see more crossover in the future between the Blockchain world and the data science / ML / AI world.
9. Is there something that in your opinion constitutes a large blocker for today’s computational tools that need urgent attention before blockchain or cryptocurrencies?
The need to store and efficiently process vast amounts of data is more important now than ever. Improving the performance and lowering the net energy use through less wasteful computing is one of the best things we can do to help. Arrow is all about more efficient, less wasteful analytical computing, so we hope that what we are doing is part of the solution.
10. How do you think that tools like Arrow, DuckDB and other new promising tools for analytics are going to interact with each other and transform today’s world?
One of the hopes for Arrow is to enable computational systems to be more interoperable and straightforward to plug together so it’s easier to create heterogeneous application pipelines. If Arrow serves as the common data medium between different systems, then developers have to think less about the mechanics of interoperability such as data serialization. We’re actively working with DuckDB, for example, to improve its Arrow integration.
11. What advice can you give to those that want to adventure into data science but don’t know where to start?
I would recommend finding an introductory book on data science tools (like “R for Data Science” or “Python for Data Analysis” — sorry, couldn’t help myself! — and find a problem you’re interested in solving, such as exploring some of the many open / public datasets out there. I believe one of the best ways to learn is by doing. As you find yourself needing to do different things with the data, you can find plenty of resources online on Stack Overflow and elsewhere to see how other people have solved similar problems.