Keeping Users Safe While Collecting Data
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
I caught a mention of this project by Pete Warden on Four Short Links today. If his name sounds familiar, he’s the creator of the DSTK, an O’Reilly author, and now works at Google. A decidedly clever and decent chap.
The project goal is noble: crowdsource and make a repository of open speech data for researchers to make a better world. Said sourcing is done by asking folks to record themselves saying “Yes”, “No” and other short words.
As I meandered over the blog post I looked in horror on the URL for the application that did the recording: https://open-speech-commands.appspot.com/
.
Why would the goal of the project combined with that URL give pause? Read on!
You’ve Got Scams!
Picking up the phone and saying something as simple as ‘Yes’ has been a major scam this year. By recording your voice, attackers can replay it on phone prompts and because it’s your voice it makes it harder to refute the evidence and can foil recognition systems that look for your actual voice.
As the chart above shows, the Better Business Bureau has logged over 5,000 of these scams this year (searching for ‘phishing’ and ‘yes’). You can play with the data (a bit — the package needs work) in R with scamtracker
.
Now, these are “analog” attacks (i.e. a human spends time socially engineering a human). Bookmark this as you peruse section 2.
Integrity Challenges in 2017
I “trust” Pete’s intentions, but I sure don’t trust open-speech-commands.appspot.com
(and, you shouldn’t either). Why? Go visit https://totally-harmless-app.appspot.com. It’s a Google App Engine app I made for this post. Anyone can make an appspot app and the https
is meaningless as far as integrity & authenticity goes since I’m running on google’s infrastructure but I’m not google.
You can’t really trust most SSL/TLS sessions as far as site integrity goes anyway. Let’s Encrypt put the final nail in the coffin with their Certs Gone Wild! initiative. With super-recent browser updates you can almost trust your eyes again when it comes to URLs, but you should be very wary of entering your info — especially uploading voice, prints or eye/face images — into any input box on any site if you aren’t 100% sure it’s a legit site that you trust.
Tracking the Trackers
If you don’t know that you’re being tracked 100% of the time on the internet then you really need to read up on the modern internet.
In many cases your IP address can directly identify you. In most cases your device & browser profile — which most commercial sites log — can directly identify you. So, just visiting a web site means that it’s highly likely that web site can know that you are both not a dog and are in fact you.
Still Waiting for the “So, What?”
Many states and municipalities have engaged in awareness campaigns to warn citizens about the “Say ‘Yes’” scam. Asking someone to record themselves saying ‘Yes’ into a random web site pretty much negates that advice.
Folks like me regularly warn about trust on the internet. I could have cloned the functionality of the original site to open-speech-commmands.appspot.com
. Did you even catch the 3rd ‘m’ there? Even without that, it’s an appspot.com
domain. Anyone can set one up.
Even if the site doesn’t ask for your name or other info and just asks for your ‘Yes’, it can know who you are. In fact, when you’re enabling the microphone to do the recording, it could even take a picture of you if it wanted to (and you’d likely not know or not object since it’s for SCIENCE!).
So, in the worst case scenario a malicious entity could be asking you for your ‘Yes’, tying it right to you and then executing the post-scam attacks that were being performed in the analog version.
But, go so far as to assume this is a legit site with good intentions. Do you really know what’s being logged when you commit your voice info? If the data was mishandled, it would be just as easy to tie the voice files back to you (assuming a certain level of data logging).
The “so what” is not really a warning to users but a message to researchers: You need to threat model your experiments and research initiatives, especially when innocent end users are potentially being put at risk. Data is the new gold, diamonds and other precious bits that attackers are after. You may think you’re not putting folks at risk and aren’t even a hacker target, but how you design data gathering can reinforce good or bad behaviour on the part of users. It can solidify solid security messages or tear them down. And, you and your data may be more of a target than you really know.
Reach out to interdisciplinary colleagues to help threat model your data collection, storage and dissemination methods to ensure you aren’t putting yourself or others at risk.
FIN
Pete did the right thing:
and, I’m sure the site will be on a “proper” domain soon. When it is, I’ll be one of the first in line to help make a much-needed open data set for research purposes.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.