Dump A Code Repository As A Text File, For Easier Sharing With Chatbots

April 14, 2024

Some LLMs (Large Language Models) can act as useful programming assistants when provided with a project’s source code, but experimenting with this can get a little tricky if the chatbot has no way to download from the internet. In such cases, the code must be provided by either pasting it into the prompt or uploading a file manually. That’s acceptable for simple things, but for more complex projects, it gets awkward quickly.

To make this easier, [Eric Hartford] created github2file, a Python script that outputs a single text file containing the combined source code of a specified repository. This text file can be uploaded (or its contents pasted into the prompt) making it much easier to share code with chatbots.

It’s becoming increasingly clear that these tools represent more of an evolution than a revolution, and there are useful roles chatbots can play in programming. Some available chatbots have coding in mind. Others do not. But hackers being hackers we naturally want to experiment for ourselves regardless of a product’s intended uses, and a tool like this makes it easier to do that. Just remember their work — for now — is often at the intern level.

16 thoughts on “Dump A Code Repository As A Text File, For Easier Sharing With Chatbots”

Hamish says:

April 14, 2024 at 6:26 am

Doing this to a github repo makes sense, and is a definite hack for the current gen of LLM systems. I have a hotkey’d script in VS Code that copies every file with some specified extensions into the clipboard with the filename and directory prepended as a comment. I can’t wait for this not to be necessary though!

Report comment

Reply
zoobab says:

April 14, 2024 at 6:52 am

Tried with a C project, ended up with an empty file. I think it only supports Python or Go projects.

Report comment

Reply
Cyna says:

April 14, 2024 at 7:17 am

So you supply a bot with all the source which it can then “creatively” incorporate into other’s questions about similar solutions? Without licence and without looking like the original, of course. Brilliant…

Report comment

Reply
1. JoMill says:
  
  April 14, 2024 at 1:04 pm
  
  My thoughts exactly.
  
  Report comment
  
  Reply
2. Josh says:
  
  April 14, 2024 at 2:43 pm
  
  You can choose to run your own open source model, or trust a service that doesn’t use your data for training.
  
  Report comment
  
  Reply
  1. Ian says:
    
    April 21, 2024 at 3:42 pm
    
    “open source”…no.
    That doesn’t mean you can do anything you want with the code.
    
    Does the model give attribution to every project it stole source from in training?
    
    AI generation is theft.
    At best, it is simply “laundering” material so you feel okay about it.
    
    Report comment
    
    Reply
limroh says:

April 14, 2024 at 7:32 am

> … but experimenting with this can get a little tricky if the chatbot has no way to download from the internet.

Not reading the article properly. Brilliant…

Report comment

Reply
Ramy Mohamed says:

April 14, 2024 at 9:00 am

this method in the python run file will be executed in case the project is written in python or go. So, this is not valid for any project other than these languages.
def get_language_extensions(language: str) -> List[str]:
language_extensions = {
“python”: [“.py”, “.pyw”],
“go”: [“.go”],
}

Report comment

Reply
Neverm|nd says:

April 14, 2024 at 10:24 am

Ok guys, I mean I haven’t had the chance yet, or you could try running OpenDevin entirely locally… Nothing dives off into the cloud…

Report comment

Reply
HaHa says:

April 14, 2024 at 1:28 pm

Anybody know of particularly bad github repos we can use to poison the LLM well?
Dead huge steaming piles of abandoned code with only obsolete and incorrect comments…

How about mySQL? I know it stinks to heaven as a tool, never had a reason to look at the guts.
What’s the _worst_ JS library? The one with hard coded references to obsolete versions of other, almost as bad, libraries?

How to search Github for such crap?
Large codebase + maximum 1 contributor for time + maximum historic contributors + bugs + least/worst commented + least average time for prospective coders to go from ‘curious’ to ‘run away’.
Write that standard expression.

What if we train a coding LLM with TempleOS source?
Will it go crazier?
Is that how Skynet is born?
That’s probably a bad idea.

Report comment

Reply
1. Nick says:
  
  April 15, 2024 at 11:54 pm
  
  Just take existing code and randomly swap lines between functions, and files.
  
  LLMs aren’t checking that the code compiles. Nor do they check that imports are valid.
  
  All they do is fuzzy pattern matching between files, function names, and variable names.
  
  Report comment
  
  Reply
2. Fancy a Snack says:
  
  April 16, 2024 at 10:26 am
  
  Somebody dig up Elon’s code from his X/Paypal days…
  
  Report comment
  
  Reply
  1. HaHa says:
    
    April 16, 2024 at 3:45 pm
    
    Still butthurt over the twitter files?
    
    Just deny they exist.
    Stick fingers in ears and go ‘LALALA’.
    SOP
    
    Report comment
    
    Reply
dustin says:

April 14, 2024 at 4:50 pm

I made this which works with a github repo, a zip file, or a folder on your pc. GH link at top left.

https://gh-repo-dl.cottonash.com/

Report comment

Reply
Greg A says:

April 14, 2024 at 6:21 pm

kind of solves a non-problem. the reason it’s tricky to feed large amounts of source into these systems is that they can’t do anything meaningful with large amounts of source. it’s like replacing the output fuse on a power supply to increase its current capacity. it won’t work.

Report comment

Reply
Charlie says:

April 15, 2024 at 3:33 am

I dove into this a bit a few weeks ago. If you can feed the entire repository to the LLM, then yes do this. But on the not so off chance that there is too much content for the input, you need to go a different route. It also matters how the chatbot handles this input, and formatting matters. Google has a pretty basic repository that converts your content into embeddings, and the relevant chunks based on the weights are fed to Gemini along with your prompt. But it’s still only feeding the model pieces of the content, and If the contents is related in ways more than just similarity of text, it won’t be able to make strategic connections. There are more advanced methods emerging and I have a feeling that OpenAI uses some of these when you feed it files directly, but still those are limited in size. Seems like we’re still a ways out from being able to dump in large amounts of content for real analysis.

Report comment

Reply

Hackaday

Dump A Code Repository As A Text File, For Easier Sharing With Chatbots

16 thoughts on “Dump A Code Repository As A Text File, For Easier Sharing With Chatbots”

Leave a ReplyCancel reply

Search

Never miss a hack

If you missed it

Dearest C++, Let Me Count The Ways I Love/Hate Thee

Personal Reflections On Immutable Linux

Crunching The News For Fun And Little Profit

The End Of The Hackintosh Is Upon Us

The Hackaday Summer Reading List: No AI Involvement, Guaranteed

Our Columns

Trickle Down: When Doing Something Silly Actually Makes Sense

Hackaday Podcast Episode 328: Benchies, Beanies, And Back To The Future

This Week In Security: Bitchat, CitrixBleed Part 2, Opossum, And TSAs

Ask Hackaday: Are You Wearing 3D Printed Shoes?

FLOSS Weekly Episode 840: End-of-10; Not Just Some Guy In A Van

16 thoughts on “Dump A Code Repository As A Text File, For Easier Sharing With Chatbots”

Leave a ReplyCancel reply

Search

Never miss a hack

Subscribe

If you missed it

Our Columns