Yesterday and today I attended the data2day, a conference about Big Data, Machine Learning and Data Science in Heidelberg, Germany. Topics and workshops covered a range of topics surrounding (big) data analysis and Machine Learning, like Deep Learning, Reinforcement Learning, TensorFlow applications, etc. Distributed systems and scalability were a major part of a lot of the talks as well, reflecting the growing desire to build bigger and more complex models that can’t (or would take too long to) run on a single computer. Most of the application examples were built in Python but one talk by Andreas Prawitt was specifically titled “Using R for Predictive Maintenance: an example from the TRUMPF Laser GmbH”. I also saw quite a few graphs that were obviously made with ggplot!
The keynote lecture on Wednesday about Blockchains for AI was given by Trent McConaghy. Blockchain technology is based on a decentralized system of storing and validating data and changes in data. It experiences a huge hype at the moment but it is only starting to gain track in Data Science and Machine Learning as well. I therefore found it a very fitting topic for the keynote lecture! Trent and his colleagues at BigchainDB are implementing an “internet-scale blockchain database for the world” – the Interplanetary Database (IPDB).
“IPDB is a blockchain database that offers decentralized control, immutability and the creation and trading of digital assets. […] As a database for the world, IPDB offers decentralized control, strong governance and universal accessibility. IPDB relies on “caretaker” organizations around the world, who share responsibility for managing the network and governing the IPDB Foundation. Anyone in the world will be able to use IPDB. […]” https://blog.bigchaindb.com/ipdb-announced-as-public-planetary-scale-blockchain-database-7a363824fc14
He presented a number of examples where blockchain technology for decentralized data storage/access can be beneficial to Machine Learning and AI, like exchanging data from self-driving cars, of online market places and for generating art with computers. You can learn more about him here:
The other talks were a mix of high- and low-level topics: from introductions to machine learning, Apache Spark and data analysis with Python to (distributed) data streaming with Kappa architecture or Apache Kafka, containerization with Docker and Kubernetes, data archiving with Apache Cassandra, relevance tuning with Solr and much more. While I spent most of the time at my company’s conference stand, I did hear three of the talks. I summarize each of them below…
- Scalable Machine Learning with Apache Spark for Fraud Detection
In this first talk I heard, Dr. Patrick Baier and Dr. Stanimir Dragiev presented their work at Zalando. They built a scalable machine learning framework with Apache Spark, Scala and AWS to assess and predict fraud in online transactions. Zalando is a German online store that sells clothes, shoes and accessories. Normally, they allow registered customers to buy via invoice, i.e. they receive their ordered items before they pay them. This leaves them vulnerable to fraud where item are not paid for. The goal of their data science team is to use customer and basket data to obtain a probability score for how likely a transaction is going to be fraudulent. High-risk payment options, like invoice, can then be disabled in transactions with high fraud probability. To build and run such machine learning models, the Zalando data science team uses a combination of Spark, Scala, R, AWS, SQL, Python, Docker, etc. In their workflow, they use a combination of static and dynamic features, imputing missing values and building a decision model. In order to scale their modeling workflow to process more requests, use more data in training, update models more frequently and/or run more models, they described a workflow that uses Spark, Scala and Amazon Web Services (AWS). Spark’s machine learning library can be used for modeling and scaled horizontally by increasing the number of clusters on which to run the models. Scala provides multi-threading functionality and AWS is used for storing data in S3 and extending computation power depending on changing needs. Finally, they include a model inspector into their workflow to assure comparability of training and test data and check that the model is behaving as expected. Problems that they are dealing with include highly unbalanced data (which is getting even worse the better their models work as they keep reducing the number of fraud cases), delayed labeling due to the long process of the transactions, seasonality in data.
- Sparse Data: Don’t Mind the Gap!
In this talk, my colleagues from codecentric Dr. Daniel Pape and Dr. Michael Plümacher showed an example from ad targeting of how to deal with sparse data. Sparse data occurs in many areas, e.g. as rare events over a long period of time or in areas where there are many items and few occurrences per item, like in recommender systems or in natural language processing (NLP). In ad targeting, the measure of success is the rate of the click-through rate (CRT): this is the number of clicks on a given advertisement displayed to a user on a website divided by the total number of advertisements, or impressions. Because financial revenue comes from a high CTR, advertisements should be placed in a way that maximizes their chance of being clicked, i.e. we want to recommend advertisements for specific users that match their interests or are of actual relevance. Sparsity come into play with ad targeting because the number of clicks is very low compared to two metrics: a) from all the potential ads that a user could see, only a small proportion is actually shown to her/him and b) of the ads that a user sees, she/he only clicks on very few. This means that, a CTR matrix of advertisements x targets will have very few combinations that have been clicked (the mean CTR is 0.003) and contain many missing values. The approach they took was to impute the missing values and predict for each target/user the most similar ads from the imputed CTR matrix. This approach worked well for a reasonably large data set but it didn’t perform so well with smaller (and therefore even sparser) data. They then talked about alternative approaches, like grouping users and/or ads into groups in order to reduce the sparsity of the data. Their take-home messages were that 1) there is no one-size-fits-all solution, what works depends on the context and 2) if the underlying data is of bad quality, the results will be sub-optimal – no matter how sophisticated the model.
- Distributed TensorFlow with Kubernetes
In the third talk, another colleague of mine from codecentric, Jakob Karalus, explained in detail how to set up a distributed machine learning modelling set-up with TensorFlow and Kubernetes. TensorFlow is used to build neural networks in a graph-based manner. Distributed and parallel machine learning can be necessary when training big neural networks with a lot of training data, very deep neural networks, with complex parameters, grid search for hyper-parameter tuning, etc. A good way to build neural networks in a controlled and stable environment is to use Docker containers. Kubernetes is a container orchestration tool that can set up distribution of nodes from our TensorFlow modeling container. Setting up this distributed system is quite complex, though and Jakob recommended to try to stay on one CPU/GPU as long as possible.