The Tens Of Millions Of Faces Training Facial Recognition; You’ll Soon Be Able To Search For Yourself

In a stiflingly hot lecture tent at CCCamp on Friday, Adam Harvey took to the stage to discuss the huge data sets being used by groups around the world to train facial recognition software. These faces come from a variety of sources and soon Adam and his research collaborator Jules LaPlace will release a tool that makes these dataset searchable allowing you to figure out if your face is among the horde.

Facial recognition is the new hotness, recently bubbling up to the consciousness of the general public. In fact, when boarding a flight from Detroit to Amsterdam earlier this week I was required to board the plane not by showing a passport or boarding pass, but by pausing in front of a facial recognition camera which subsequently printed out a piece of paper with my name and seat number on it (although it appears I could have opted out, that was not disclosed by Delta Airlines staff the time). Anecdotally this gives passengers the feeling that facial recognition is robust and mature, but Adam mentions that this not the case and that removed from highly controlled environments the accuracy of recognition is closer to an abysmal 2%.

Images are only effective in these datasets when the interocular distance (the distance between the pupils of your eyes) is a minimum of 40 pixels. But over the years this minimum resolution has been moving higher and higher, with the current standard trending toward 300 pixels. The increase is not surprising as it follows a similar curve to the resolution available from digital cameras. The number of faces available in data sets has also increased along a similar curve over the years.

Adam’s talk recounted the availability of face and person recognition datasets and it was a wild ride. Of note are data sets by the names of Brainwash Cafe, Duke MTMC (multi-tracking-multi-camera),  Microsoft Celeb, Oxford Town Centre, and the Unconstrained College Students data set. Faces in these databases were harvested without consent and that has led to four of them being removed, but of course, they’re still available as what is once on the Internet may never die.

The Microsoft Celeb set is particularly egregious as it used the Bing search engine to harvest faces (oh my!) and has associated names with them. Lest you think you’re not a celeb and therefore safe, in this case celeb means anyone who has an internet presence. That’s about 10 million faces. Adam used two examples of past CCCamp talk videos that were used as a source for adding the speakers’ faces to the dataset. It’s possible that this is in violation of GDPR so we can expect to see legal action in the not too distant future.

Your face might be in a dataset, so what? In their research, Adam and Jules tracked geographic locations and other data to establish who has downloaded and is likely using these sets to train facial recognition AI. It’s no surprise that the National University of Defense Technology in China is among the downloaders. In the case of US intelligence organizations, it’s easier much easier to know they’re using some of the sets because they funded some of the research through organizations like the IARPA. These sets are being used to train up military-grade face recognition.

What are we to do about this? Unfortunately what’s done is done, but we do have options moving forward. Be careful of how you license images you upload — substantial data was harvested through loopholes in licenses on platforms like Flickr, or by agreeing to use through EULAs on platforms like Facebook. Adam’s advice is to stop populating the internet with faces, which is why I’ve covered his with the Jolly Wrencher above. Alternatively, you can limit image resolution so interocular distance is below the forty-pixel threshold. He also advocates for changes to Creative Commons that let you choose to grant or withhold use of your images in train sets like these.

Adam’s talk, MegaPixels: Face Recognition Training Datasets, will be available to view online by the time this article is published.

45 thoughts on “The Tens Of Millions Of Faces Training Facial Recognition; You’ll Soon Be Able To Search For Yourself

  1. Why people are so paranoid over this is just simple paranoia. So some system can figure out things about me, so what.
    So they use the security cameras at work to see that I went to the bathroom 6 times during my work day, and got 3 cups of coffee. Picked my nose while walking in the hallway. Was at the gas station at 7:32AM, and tracked everything else i did and where i was at during the day. So what…. I am human and do human things, how is that going to be destructive to my life.
    In this day and age where EVERYTHING is tracked do you really think your going to be able to hide? You can’t stop whats already started now.

    1. Metadata matters.
      Systems like these and apathy to your metadata make it possible for government & other malicious actors to stalk you for no reason. This represents an erosion of your privacy and your civil / human rights.
      Just because I don’t care if the justice department or my bank knows how many times I scratched my butt any given day, I don’t want them to just have that information. If this were a person following you around all day tallying your actions you could press charges for stalking, harassment, or similar crimes. But somehow a computer system with a million eyes doing it isn’t a crime?

      No, if you want to stalk me, get a warrant.

      1. for me its far more criminal to let the bad people roam the streets freely where they can hurt me, so i say track all the people all the time everywhere, just remove the bad ones from the society

        1. That kind of thinking paves the way for the abuse we’re already seeing. From there we get the broken thinking that stockpiles 0-days and wants to put backdoors in crypto.
          Treating everyone like criminals erodes trust and ruins society.

        2. Very funny! Works great (for you) until you suddenly match one the criteria for ‘removal’ ;-). Could be something you did a decade ago that was totally innocent at the time. Could be something you find totally normal today but tomorrow someone decides that it’s a reason for ‘removal’. Perhaps you’re driving the ‘wrong’ car, wearing the ‘wrong’ colored clothes, visiting the ‘wrong’ restaurant, who knows… Could also be due to an error, i.e., you happen to look like the ‘wrong’ person…

          1. sure, for the error i will understand (ok, i will be very angry) by the way, it is far better to punish ten innocents rather to let it slide for just one criminal, as for what is normal right now in the society, i think everybody can know for sure, but i’m understand the dark side of it also, what is coming from the non perfect implementation (it is very interesting that’s “my” system can be fatal in case of any error, even a small one, but the current one, what is performing not too well, can survive a lot of errors)

          2. Holy crap alfcoder. First of all, if your system has so many false positives that you’d end up punishing 10 innocent people and 1 criminal then that system has too many errors to exist. I’d rather let one criminal get away with something than wrongly punish 10 innocent people. Holy crap.

        3. Who is a bad guy? Someone that murders people or steals or sets fires? What about someone who is ideologically opposed to the people in power? Oops, you managed to piss off someone in power. Now they are going to search your metadata for something that you did wrong so they can punish you.

        4. In places where this kind of technology is already weaponized and in use for controlling the population, “the bad ones in society” include environmentalists, lawyers, conscientious objectors, and minorities and their religious beliefs. According to certain standards, these are the criminals, not innocents. By other standards, this is the edge of life in 1984.

          1. And this is why, as good citizens, it is our moral obligation to confound, confuse, and generally add noise to the wholesale collection and persistent storage of this data.

            Better still, act to instill in general society a sense that surveillance like this is a gross indecency to life.

    2. If some evil organisation wishes to stalk YOU then yes, normal CCTV is a threat, but they’d have to employ someone to do a lot of hunting through various footage. Facial recognition is much more dangerous because they can just set an algorithm loose to work out your movements with minimal effort on their part. This minimal effort also enables them to stalk EVERYBODY without needing a prohibively large workforce to review what are mostly utterly tedious vids.

    3. It’s all a matter of scaling, an evil orgnaisation trying to stak you with normal camera footage has to employ some poor soul to sit infront of a bank of screens running through hours of archive footage. This is expensive and time consuming, it can only be used for selected targeted stalking victims however well resourced the evil orgnisation is. With facial recognition a computer can do all that detective work, cheaply and fast. Now it’s possible for the big evil to stalk everybody at once withut needing to hire huge numbers of archive footage watchers.

    4. Better yet, would you want your bank to know how many times you visit certain stores and how you spend cash? Would you like that information to influence whether you qualify for a loan or not?

      Mike, take your argument to the logical conclusion: if you don’t care about your digital or physical privacy, give me your email password and a key to your home. No? Why not?

    5. So literally you do not care who has access to your metadata.

      So you would have no issue with metadata being directly or indirectly (oops, sorry we were hacked) bulk sold to B&E specialists (Breaking and Entry ) for a few cents, telling where you live, when you are there, what time you leave daily, projected timelines from spending patterns for when you leave on holidays, and when you will probably return.

      Or in 20 years time an insurance company unable to access the exact details of a trip you probably took to an STD clinic 14 years previously (you were actually visiting a friend that lived nearby, but the metadata they bought from a taxi company was not accurate enough) and decided to double your premiums because similar data on file for people living near where you lived at the indicated an outbreak of hepatitis B.

      I’ve already seen junk mails arrive +/-2 days from when a new baby was born, and that was in the 80’s. And I’ve even seen the targeted junk mail arrive before the mother even knew that she was pregnant.

      I will admit that the level of data collected and analysis is much much scarier today, but it does not mean that people should blindly accept it.

      My real concern is how the metadata collected today will be used at some future date. Unlike paper trails which were bulky, costly to store and most of which was only kept for at most seven years, today’s storage is so cheap, it makes sense to collect everything and skim it now and reprocess it at some distant future date. My concern is for the things that you could never predict, how exactly the cross-correlated metadata will be abused. One bit of metadata is totally insignificant, but millions of items cross-correlated will always produce unforeseen consequences.

    6. Because Spotify and Amazon Music already know who you’re voting for (Nielsen 360, 2018) and Pandora has been tailoring political ads based on these inferences for years (sfgate, 2014)

      Because the stores you shop at know you or someone you’re shopping for is pregnant before you do (Target, 2012).

      Because your car’s telematics system prompted for consent when it was first booted, and the dealer prep tech probably just clicked OK before handing you the car. Or you did, since you don’t care. Ever since, it’s been reporting and selling your whereabouts. The concert venues you park at, the protests you might be at, the friends’ houses you visit, and this is easy to correlate with who else is in the same places at the same times.

      If you’re up to something, it’s very easy to infer.

      If Paul Revere’s horse had been reporting his whereabouts, King George would’ve been able to suss him out long before any pesky terrorist actions got started.

      Now add in facial recognition. It’s no longer inference.

      If your life is so boring that that doesn’t unnerve you, I feel sorry for you. History will not know your name.

    7. Dystopia marches onward amid people’s total lack of an imagination for how things could go wrong. This is so naive and stupid. For one: which is it? That there’s nothing to fear, or that it’s inevitable so we may as well bend over and take it? You seem to be scattered between the two.

      There are some things which can’t be un-invented, and we’ll be stuck with their consequences forever. Maybe paranoia is a healthy attitude in these cases.

    8. What is not a crime or reason for getting fired right now (like two guys loving each other) may very well be a crime or a reason for getting fired next year. Fascism is seriously on the rise, not only in the USA but also in Europe.
      Hell, the orange fascist already wants to strip employment rights from people who love their own gender.

      Maybe you’ve been seen with a known gay person. Are you gay? We saw you at the gay guy’s house, why should we think you’re straight? Well, we have an appointment scheduled with HR. HR also thinks you’ve been using the toilet a bit much. Maybe we should talk about the scheduled extension of your contract…

      1. Fascism is not ‘on the rise’. What a joke. I don’t think you know what fascism is. Apparently the people in power are not who you voted for. And you need to stop the bleeding by labeling your opponents as anti-gay, fascist, nationalist, you name it, in order to get back your grip on power. That might have worked in 2010. It will not work now. “Vote for me in 2020 (you racists and bigots)….” Funny how the people labeling everybody the last few years now need the votes of those they hate. Good luck.

  2. I followed the link “pausing in front of a facial recognition camera”, which leads to a DHS site where the explanation says “CBP discards all photos of U.S. Citizens within 12 hours of identity verification.” At first, that statement made me feel better. But then I wondered, that doesn’t really say that they don’t perform image analysis and store the extracted features in their database, does it? So they would’t be able to run a newer extraction/learning algorithm on the image next year, but still, the “effects” of the image capture could very well end up in their DB.

  3. “Facial recognition is the new hotness”

    Ugh. That’s one way to put it I guess. Can we retire this one—especially for things like this which aren’t exactly something that’s receiving enthusiastic adoption by most people? More like people are being forced to adopt its use whether they like it or not. This isn’t some hot trend, it’s well-deserved controversy.

  4. Mike is a hacker associated, with a website that caters to hackers, how long before we never posts from Mike here again? ;) Hackaday could probably make fair amount of money by producing and selling Jolly Wrencher masks.

  5. When a malicious perpetrator gains access to your information, something you may have done that’s not necessarily bad but it gets released or used as blackmail from the perpetrator could damage your life. A potential employer may see it and disagree with what it is. There goes that job. Being constantly tracked could put you in a position where your every move could be predicted, when you’re most vulnerable and they’ll get you for everything you have.

  6. The People’s Republic of China.
    There, I’ve said it, the 800 pound gorilla in the room.
    Their “Social Credit” system is linked to facial recognition, (with Google’s help) and is tracking their citizens in their daily lives.

    1. It is coming here. Funny how the guy screaming bloody murder above about fascist bogeyman has no concern about the Silicon Valley behemoth catering to the China’s of the world. https://www.fastcompany.com/90394048/uh-oh-silicon-valley-is-building-a-chinese-style-social-credit-system … “The New York State Department of Financial Services announced earlier this year that life insurance companies can base premiums on what they find in your social media posts.”

      1. Help them out by PAYING to put a listening device and video camera in multiple places in your home in order to accomplish totally trivial things you could do with other devices you already own. Just dumb. While you’re at it, completely automate your home with IOT. There’s a great apartment hack scene in Mr. Robot about that and a great episode in the recent X-Files season, too.

  7. I thought uploading fitbit data to the cloud was basically creating a la carte kidney harvesting options for motivated organ harvesters who just had to wait for the fit individual to run past at the right time… Now they can put a face to the kidney too.

  8. Imagine being a Starbucks employee and having a Voight-Kampff attached to your Point of Sales machine. I used to get a report at how many times a shift I hit the delete button on my POS. Imagine getting feedback on smiles per hour or customer EQ response.

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.