Preventing AI Plagiarism With .ASS Subtitling

January 25, 2025

Around two years ago, the world was inundated with news about how generative AI or large language models would revolutionize the world. At the time it was easy to get caught up in the hype, but in the intervening months these tools have done little in the way of productive work outside of a few edge cases, and mostly serve to burn tons of cash while turning the Internet into even more of a desolate wasteland than it was before. They do this largely by regurgitating human creations like text, audio, and video into inferior simulacrums and, if you still want to exist on the Internet, there’s basically nothing you can do to prevent this sort of plagiarism. Except feed the AI models garbage data like this YouTuber has started doing.

At least as far as YouTube is concerned, the worst offenders of AI plagiarism work by downloading the video’s subtitles, passing them through some sort of AI model, and then generating another YouTube video based off of the original creator’s work. Most subtitle files are the fairly straightfoward .srt filetype which only allows for timing and text information. But a more obscure subtitle filetype known as Advanced SubStation Alpha, or .ass, allows for all kinds of subtitle customization like orientation, formatting, font types, colors, shadowing, and many others. YouTuber [f4mi] realized that using this subtitle system, extra garbage text could be placed in the subtitle filetype but set out of view of the video itself, either by placing the text outside the viewable area or increasing its transparency. So now when an AI crawler downloads the subtitle file it can’t distinguish real subtitles from the garbage placed into it.

[f4mi] created a few scripts to do this automatically so that it doesn’t have to be done by hand for each one. It also doesn’t impact the actual subtitles on the screen for people who need them for accessibility reasons. It’s a great way to “poison” AI models and make it at least harder for them to rip off the creations of original artists, and [f4mi]’s tests show that it does work. We’ve actually seen a similar method for poisoning data sets used for emails long ago, back when we were all collectively much more concerned about groups like the NSA using automated snooping tools in our emails than we were that machines were going to steal our creative endeavors.

Thanks to [www2] for the tip!

57 thoughts on “Preventing AI Plagiarism With .ASS Subtitling”

osmarks says:

January 25, 2025 at 10:24 am

This won’t stop autotranscription from audio. And a right-thinking person wants to be in more training data for obvious reasons (wider exposure, shifting the text prior).

Report comment

Reply
1. Lord Elpus says:
  
  January 25, 2025 at 10:30 am
  
  If they’re using autotranscription then they’re already poisoning themselves, saving us the hassle.
  
  Report comment
  
  Reply
  1. osmarks says:
    
    January 27, 2025 at 8:41 am
    
    We had pretty good things for this (Whisper) back in 2022. I don’t think it’s a real problem now.
    
    Report comment
    
    Reply
2. Pat says:
  
  January 25, 2025 at 10:35 am
  
  “for obvious reasons (wider exposure”
  
  So it’s not copyright infringement if EVERYONE does it and it’s big enough… then it’s advertising?
  
  Anything that can increase the amount of effort that companies or creators need to go through is awesome, because it increases the possibility that laws can be passed to make this garbage explicitly illegal as opposed to trusting courts containing people who apparently can’t comprehend “if I just steal ENOUGH people’s work it can’t be stealing!”
  
  Report comment
  
  Reply
  1. Rick C says:
    
    January 26, 2025 at 9:19 am
    
    “So it’s not copyright infringement if EVERYONE does it and it’s big enough… then it’s advertising?”
    
    I’ve never heard of any AI emissions crediting whose work it’s using so any exposure a given person gets from having his content slurped up seems to be “none”.
    
    Report comment
    
    Reply
    1. JASON P MASCIO says:
      
      January 26, 2025 at 10:20 pm
      
      That’s absolutely incorrect factually!
      
      Report comment
      
      Reply
3. David says:
  
  January 25, 2025 at 12:02 pm
  
  Your rationale for wanting to be in training data is nigh-incomprehensible. Exposure isn’t relevant in a context that lacks attribution, and “shifting the text prior” is plain word salad when used WRT plagiarism.
  
  Perhaps you shouldn’t have farmed out your internet commenting to an LLM just yet.
  
  Report comment
  
  Reply
  1. osmarks says:
    
    January 27, 2025 at 8:40 am
    
    The LLM version of me (https://docs.osmarks.net/hypha/autogollark) has unfortunately not been deployed for general internet use due to various engineering issues.
    
    I don’t mean that they’ll talk about me with my name attached – though this does happen to a few people who happen to be particularly distinctive in their context. I mean exposure to my ideas – they are more likely to say things I would say, at least on the topics I have written about. “Shifting the text prior” is a more general version of this (see also https://www.lesswrong.com/s/N7nDePaNabJdnbXeE/p/vJFdjigzmcXMhNTsx): LLMs model agents and agency by modelling (some composite of many) humans. It is probably the case that the prior probability (non-RLHF) LLMs assign to any particular person being the author of the text they are completing is roughly proportional to the frequency of that person (and people semantically close to them) in the training data. So if you’re overrepresented in the training data, LLMs doing things are more likely to do the kind of things you would do, so you have more influence on the world.
    
    Report comment
    
    Reply
    1. phuzz says:
      
      January 28, 2025 at 1:39 am
      
      You sound like you’re ready to be introduced to Roko’s Basilisk: https://en.wikipedia.org/wiki/Roko%27s_basilisk
      Or possibly you should go outside and touch grass.
      
      Report comment
      
      Reply
4. Me says:
  
  January 25, 2025 at 5:41 pm
  
  Exposure is what you die of. No thank you.
  
  Report comment
  
  Reply
5. Norintha says:
  
  January 26, 2025 at 4:39 am
  
  “Wider exposure” how to tell me you aren’t a creative without saying you hate to think.
  
  Report comment
  
  Reply
6. Mike says:
  
  January 26, 2025 at 2:31 pm
  
  “Exposure”? Since when does AI ever credit the sources it steals from?
  
  Report comment
  
  Reply
  1. JASON P MASCIO says:
    
    January 26, 2025 at 10:27 pm
    
    It will generate the truth when manipulation has been applied; When you directly tell it too, for instance; (System: “list sources in e.g. MLA, APA, CHI, HVD”)
    
    Report comment
    
    Reply
  2. osmarks says:
    
    January 27, 2025 at 8:28 am
    
    To the things I say, not specifically my name.
    
    Report comment
    
    Reply
7. Jay Cop says:
  
  January 26, 2025 at 6:11 pm
  
  Autotranscription is easy to attack with adversarial examples. Pipe your audio through one before upload.
  
  Report comment
  
  Reply
  1. osmarks says:
    
    January 27, 2025 at 8:26 am
    
    I don’t think the widely used adversarial things actually work.
    
    Report comment
    
    Reply
Helena says:

January 25, 2025 at 11:13 am

Have you read this book? Because from the Amazon description it sounds like it’s just this guy advocating/speculating about how AI could be used. Does it actually have any examples of benefit? Not one-off tests, but cases where AI has been deployed for a while and shown to improve patient outcomes?

I also read the reviews to try to get a better sense of what’s in the book and it’s weird how they’re all fake… they’re all just “***** Compelling read!” and then they repeat the description of the book, either in their own words or literally just copied from above. But that seems about right for AI anyway.

Report comment

Reply
1. Agammamon says:
  
  January 25, 2025 at 11:33 am
  
  Odds are both the book itself and the post by Ostracus are AI-generated spam.
  
  And probably my post right here is too.
  
  Report comment
  
  Reply
  1. YoDrTentacles says:
    
    January 25, 2025 at 3:44 pm
    
    Dishonest reviews on Amazon?!? The nerve, sir.
    
    Report comment
    
    Reply
David says:

January 25, 2025 at 11:19 am

This trick won’t last long: It should be easy enough to write a program that will take the fancied-up .ass-format text, simulate how it would appear on the screen, then use only the text that is “inside the visible area of the screen” and “displayed prominently enough to be seen when displayed over the video itself” (enough contrast, large enough, etc.) as input to the AI.

Report comment

Reply
1. Agammamon says:
  
  January 25, 2025 at 11:34 am
  
  Evolution in action.
  
  Report comment
  
  Reply
2. Dan says:
  
  January 25, 2025 at 1:48 pm
  
  But unless everyone does this, they won’t bother – and the poisoning will also have legible effect.
  
  Report comment
  
  Reply
  1. Oli says:
    
    January 26, 2025 at 10:11 am
    
    Exactly. This isn’t meant to be used at scale. It’s for individual creators who are fed up with AI cash farms on YouTube using their content.
    
    Report comment
    
    Reply
Dave says:

January 25, 2025 at 12:37 pm

Make an .ass out of AI.

Report comment

Reply
1. Darey says:
  
  January 26, 2025 at 3:14 pm
  
  😉😄😄😄😄😄
  
  Report comment
  
  Reply
Goat says:

January 25, 2025 at 12:43 pm

Sounds like they have a lot of easy ways to get around this. They could lower the weights of subtitle data, they could run a check of audio transcription against the data before judging whether it should be used or not, or they could just ignore it completely. The most important part of video training is the video itself, so this isn’t as damaging as people think it is. There still aren’t any video models that include audio, and they get most their information from context on metadata outside of subtitle information.

Report comment

Reply
Charles Springer says:

January 25, 2025 at 1:01 pm

I wouldn’t call anything an “edge case”. It is the inbred offspring of “use case” and “corner case”. Maybe outlier will do. Or boundary instance or boundary sample.

Report comment

Reply
Nathan says:

January 25, 2025 at 2:44 pm

Those are some sus reviews. There’s no way those aren’t paid or generated.

Report comment

Reply
𐂀 𐂅 says:

January 25, 2025 at 3:53 pm

Yet another naive and already out of date scheme, will become increasingly futile too as reasoning models (text based) are already here and the next generation over the coming months will have visual reasoning. If you don’t want machines learning from your content, in the same way that humans do, you only have one (flawed) option, some nonexistent law that allows you to discriminate, plus a platform that cooperates with you and only shows your content to verified humans and at a rate that is match to human perception speeds. Even then you will have people with AI agents interacting with your content using the human’s credentials and you can’t stop that without getting into legal hot water for discriminating against people with disabilities.

Time you got your head around what just happened (r.e. DeepSeek etc.) we have just experienced a profound paradigm shift where human intelligence’s value has just imploded. Once your value to society, and the rewards it offered you, were closely correlated with your intelligence, this is no longer the case.

Report comment

Reply
1. Blah says:
  
  January 26, 2025 at 6:14 am
  
  Value to society has never been correlated with intelligence. What’s more important is how much people like you and perceive you as intelligent. Hence why names like Thomas Edison and Elon Musk and Mark Zuckerberg are known world-round, despite the fact that these guys are actually dumb as rocks and terrible engineers
  
  Report comment
  
  Reply
  1. 𐂀 𐂅 says:
    
    January 29, 2025 at 6:27 pm
    
    That is entirely false from an economic POV, only an insane person would claim otherwise.
    
    Report comment
    
    Reply
    1. Pat says:
      
      January 30, 2025 at 7:57 am
      
      see, proving the point right there
      
      Report comment
      
      Reply
Balders says:

January 25, 2025 at 4:56 pm

Ass? Sounds so wrong 😂

Report comment

Reply
1. Happymeal says:
  
  January 26, 2025 at 10:07 am
  
  Haha, my thoughts exactly! Though, it sounds like something a hacker would come up with, so it’s kind of amazing at the same time.
  
  Report comment
  
  Reply
2. Gil Perkins says:
  
  January 26, 2025 at 6:42 pm
  
  Back in the day , when getting anime from different places, I knew I was in for a real treat font and color-wise if it came with .ass subtitles. The modern industry still hasn’t caught up.
  
  Report comment
  
  Reply
3. Johnu says:
  
  January 27, 2025 at 1:30 am
  
  I’ve seen a suggestion that the term”AI” is replaced with “Plagiarised Information Synthesis System” which is a more fitting acronym.
  
  Report comment
  
  Reply
Zellers says:

January 25, 2025 at 5:03 pm

I’m pretty confident someone out there has built some accessibility solution for vision impaired people and is pretty pissed by this approach.

Report comment

Reply
1. Hobo Lobo says:
  
  January 25, 2025 at 6:25 pm
  
  I entirely understand if people don’t have time or desire to watch a video, but I’m still surprised how rarely that prevents them from commenting on things that are not only explicitly and repeatedly addressed in said video, but also literally mentioned in the summary article on hackaday.
  
  Shockingly, [f4mi] seems to be quite knowledgeable of all the shortcomings of her approach and as conscientious of the accessibility and other legitimate uses of her youtube videos, as you would hopefully expect of someone who not only seems to be quite smart and thoughtful but also literally does this for a living.
  
  Report comment
  
  Reply
  1. Aaron Myles Landwehr says:
    
    January 26, 2025 at 9:24 am
    
    I don’t believe screen readers for the visually impaired are really addressed in the article or the video. I’m not 100 percent sure on the latter since I listened to it on 1.5x speed and jumped pass all of the filler as marked by sponsorblock. All that’s stated is that it doesn’t affect people who need subtitles for accessibility reasons. How so? I’m not sure. I think maybe the author of the poisoning may thinks that screen readers for videos just digitally process the visual data and read the visible text? If so that’s definitely not universally the case and probably isn’t common. The reality is that screen reader software implementations for videos may just actively read the srt regardless of whether it’s visually on screen.
    
    It didn’t appear to me that really any of these types of shortcomings are addressed. On the face of it, of a screen reader could bypass the garbage data with the article/videos poisoning approach then the only thing data miners would have to do is the same bypass during a preprocess phase. If such screen readers cannot bypass the garbage data then the approach shouldn’t be used because it makes accessibility that much more difficult. Either case would seem to make this approach a dead end in practice.
    
    As a side note, I wish people on the Internet would stop implying others should spend 20 to 30 minutes watching a video accompanying a summary article on every article i read on every website. You are trying to distance yourself from the fact that you are implying this but you are indeed still implying to the other commenter. The funny thing is that the video is mostly filler and has maybe 3 to 5 minutes of actual content similar to many other videos on YouTube designed to maximize ad revenue and algorithm views. Practically speaking, people have better things to do than spend large swaths of time watching the accompanying videos on every article they read. It shouldn’t be necessary for the context anyway. I don’t know why folks are surprised that someone wouldn’t do so.
    
    Moreover, let’s suppose the author did address the issues with visual impairedness in detail in the video. Why on earth wouldn’t you just say that, what they said that addressed it in the video, and what timestamp it was addressed at? Imagine writing an essay or paper arguing something without providing any evidence for said argument. That’s equivalent writing a comment telling someone they are wrong without providing any details showing otherwise and telling someone to read the article or watch the video. You presumably put in the effort to watch the video so why is your comment as little effort as possible? Be kind and provide the details if you see going to say something. Don’t ask people to do all of the leg work themselves unnecessarily if you are making a claim.
    
    This kind of behavior happens on summaries of scientific papers all of the time as well. People respond to others telling them that what they asked/said is addressed in the paper, as if it’s not reasonable that a person didn’t read a 4 to 10 page paper, and then don’t even bother to provide where it’s addressed or how. It’s just bad commenting behavior… Once again, be kind and provide the details if you are going to say something or make a claim
    
    Report comment
    
    Reply
    1. Steve G says:
      
      January 26, 2025 at 9:43 pm
      
      Finally, some sense.
      
      Report comment
      
      Reply
    2. JASON P MASCIO says:
      
      January 26, 2025 at 10:39 pm
      
      👏🤌🫵
      
      Report comment
      
      Reply
mac says:

January 25, 2025 at 9:52 pm

Cool hack. I’m in love with that thumbnail.

Report comment

Reply
TG says:

January 25, 2025 at 11:13 pm

Glad to hear that in future history books “ASS” will have its place in the strange saga of adversarial anti-AI arms races (that inevitably will fail)

Report comment

Reply
1. xX_CoolDude_Xx says:
  
  May 23, 2025 at 8:18 am
  
  turns off global internet access, goes back to landlines, lives in the 50’s
  
  Report comment
  
  Reply
pelrun says:

January 25, 2025 at 11:42 pm

Everything old is new again. Remember pre-Google when search engines worked off simple keyword matching of the content of a webpage, and people would do keyword stuffing in the site text, changing the text colour to background so it wouldn’t be obvious to the end user?

Yeah, it didn’t work great then either.

Report comment

Reply
Ewald says:

January 26, 2025 at 2:40 am

The first paragraph is clearly a comment-bait, and if it isn’t he clearly hasn’t followed the news on AI. I work at a hospital and we already use many AI/LLM applications that save time and money and/or improve quality; and this is only still the beginning.
The new top AI models are often trained on high quality synthetic data so he’s probably two years late with his countermeasure.

Report comment

Reply
IronMew says:

January 26, 2025 at 4:54 am

“mostly serve to burn tons of cash while turning the Internet into even more of a desolate wasteland than it was before” I’m disappointed. I expect this kind of sensationalist garbage from normie media desperately hunting for fear-induced clicks, not Hackaday. We’ve all seen every tool ever made used for good as well as evil, and I’d expect the editors to know better than most that blame as well as merit should be given to the users, not the tools themselves.

Report comment

Reply
1. Ostracus says:
  
  January 26, 2025 at 6:02 am
  
  Maybe they’re equating AI with cryptocurrency. An easy mistake to make.
  
  Report comment
  
  Reply
2. John says:
  
  January 26, 2025 at 7:43 pm
  
  I just copied that quote to tear it apart, too. There have been plenty of people who have figured out how to use AI productively. The author’s denial of this or their inability to figure out how to use AI effectively does not prove the proclaimed point.
  
  Two things I know this author never does: language translation and computer programming. AI has helped me tremendously in both, and I know other people who claim the same.
  
  Report comment
  
  Reply
Gordon M Shephard Jr says:

January 26, 2025 at 5:13 am

What about the .bho (b*** hole) extension? Or .dhd? (d*** head) .cnt .bch .lsr .wkr .gth .stfu .esad

Report comment

Reply
1. Jii says:
  
  January 26, 2025 at 11:22 am
  
  “Butt” is now a word that needs censorship? Did you know that ass is also a synonym for donkey?
  
  And what exactly is your point anyway?
  
  Report comment
  
  Reply
Anon says:

January 26, 2025 at 6:48 am

I think it’s hilarious that people really think this stuff is AI. What a trendy buzzword. The algorithms being used are less intelligent than a game AI from twenty years ago. There’s nothing “generative” about it.

Report comment

Reply
1. Oli says:
  
  January 26, 2025 at 10:14 am
  
  Haven’t read the article, only watched her video, but she’s very specifically talking about people scraping her subtitles, then using GenAI to rewrite her script. That’s very much generative.
  
  Report comment
  
  Reply
  1. John says:
    
    January 26, 2025 at 7:45 pm
    
    And if you think game AI from 20 years ago was better, you don’t know anything about AI today nor game AI from 20 years ago. What a parrot.
    
    Report comment
    
    Reply
KaidenP says:

January 26, 2025 at 7:40 am

This seems good in theory but what about people who are both hearing impaired and vision impaired? I can’t say for certain, but maybe they would use software that transcribed the subtitles into brail? Would this software then get confused?

Report comment

Reply
1. xX_CoolDude_Xx says:
  
  May 23, 2025 at 8:16 am
  
  dude.. they would see or hear the video, audio, or subtitles…
  
  Report comment
  
  Reply
Dracolytch says:

January 29, 2025 at 8:55 am

“but in the intervening months these tools have done little in the way of productive work outside of a few edge cases”

A few edge cases? This stuff has literally transformed the way I work. I can voice-dictate notes, and have it organize and turn those notes into something semi-professional. Is it a fully finished product? Hell no… But it’s 80%-90% there with essentially zero effort on my part.

Code? Sure, it codes like an entry-level programmer. That’s fine. Because it can do it in any language that I want. Sure, I haven’t used C in a decade, but I understand enough to be able to read the output and see if it does what I want. Same with Excel formulas… I don’t remember a lot of that, but I can communicate what I want and have an AI write it for me.

Linux is another one… I don’t use it for my daily driver anymore, and I’m rusty as hell, but just yesterday it saved me HOURS trying to figure out a service initialization order issue for an app on a Raspberry Pi.

Yes, there are valid ethical concerns. Yes, the output can be improved… But to say that it’s only useful for a few edge cases is denying the value of these systems, which is only doing yourself a disservice. Burying your head in the sand isn’t going to address that.

~D

Report comment

Reply

Hackaday

Preventing AI Plagiarism With .ASS Subtitling

57 thoughts on “Preventing AI Plagiarism With .ASS Subtitling”

Leave a ReplyCancel reply

Search

Never miss a hack

If you missed it

Analog Surround Sound Was Everywhere, But You Probably Didn’t Notice

The Channel Crossing Bridge That Never Was

Built-In Batteries: A Daft Idea With An Uncertain Future

What Happened To Running What You Wanted On Your Own Machine?

Ore Formation: Return Of The Revenge Of The Fluids

Our Columns

Know Audio: Lossy Compression Algorithms And Distortion

Exploding The Mystical Craftsman Myth

Hackaday Links: October 26, 2025

Get Ready For Supercon

Hackaday Podcast Episode 343: Double Component Abuse, A Tinkercad Twofer, And A Pair Of Rants

57 thoughts on “Preventing AI Plagiarism With .ASS Subtitling”

Leave a ReplyCancel reply

Search

Never miss a hack

Subscribe

If you missed it

Our Columns