Dump A Code Repository As A Text File, For Easier Sharing With Chatbots

Some LLMs (Large Language Models) can act as useful programming assistants when provided with a project’s source code, but experimenting with this can get a little tricky if the chatbot has no way to download from the internet. In such cases, the code must be provided by either pasting it into the prompt or uploading a file manually. That’s acceptable for simple things, but for more complex projects, it gets awkward quickly.

To make this easier, [Eric Hartford] created github2file, a Python script that outputs a single text file containing the combined source code of a specified repository. This text file can be uploaded (or its contents pasted into the prompt) making it much easier to share code with chatbots.

It’s becoming increasingly clear that these tools represent more of an evolution than a revolution, and there are useful roles chatbots can play in programming. Some available chatbots have coding in mind. Others do not. But hackers being hackers we naturally want to experiment for ourselves regardless of a product’s intended uses, and a tool like this makes it easier to do that. Just remember their work — for now — is often at the intern level.

16 thoughts on “Dump A Code Repository As A Text File, For Easier Sharing With Chatbots

  1. Doing this to a github repo makes sense, and is a definite hack for the current gen of LLM systems. I have a hotkey’d script in VS Code that copies every file with some specified extensions into the clipboard with the filename and directory prepended as a comment. I can’t wait for this not to be necessary though!

  2. So you supply a bot with all the source which it can then “creatively” incorporate into other’s questions about similar solutions? Without licence and without looking like the original, of course. Brilliant…

      1. “open source”…no.
        That doesn’t mean you can do anything you want with the code.

        Does the model give attribution to every project it stole source from in training?

        AI generation is theft.
        At best, it is simply “laundering” material so you feel okay about it.

  3. this method in the python run file will be executed in case the project is written in python or go. So, this is not valid for any project other than these languages.
    def get_language_extensions(language: str) -> List[str]:
    language_extensions = {
    “python”: [“.py”, “.pyw”],
    “go”: [“.go”],
    }

  4. Anybody know of particularly bad github repos we can use to poison the LLM well?
    Dead huge steaming piles of abandoned code with only obsolete and incorrect comments…

    How about mySQL? I know it stinks to heaven as a tool, never had a reason to look at the guts.
    What’s the _worst_ JS library? The one with hard coded references to obsolete versions of other, almost as bad, libraries?

    How to search Github for such crap?
    Large codebase + maximum 1 contributor for time + maximum historic contributors + bugs + least/worst commented + least average time for prospective coders to go from ‘curious’ to ‘run away’.
    Write that standard expression.

    What if we train a coding LLM with TempleOS source?
    Will it go crazier?
    Is that how Skynet is born?
    That’s probably a bad idea.

    1. Just take existing code and randomly swap lines between functions, and files.

      LLMs aren’t checking that the code compiles. Nor do they check that imports are valid.

      All they do is fuzzy pattern matching between files, function names, and variable names.

  5. kind of solves a non-problem. the reason it’s tricky to feed large amounts of source into these systems is that they can’t do anything meaningful with large amounts of source. it’s like replacing the output fuse on a power supply to increase its current capacity. it won’t work.

  6. I dove into this a bit a few weeks ago. If you can feed the entire repository to the LLM, then yes do this. But on the not so off chance that there is too much content for the input, you need to go a different route. It also matters how the chatbot handles this input, and formatting matters. Google has a pretty basic repository that converts your content into embeddings, and the relevant chunks based on the weights are fed to Gemini along with your prompt. But it’s still only feeding the model pieces of the content, and If the contents is related in ways more than just similarity of text, it won’t be able to make strategic connections. There are more advanced methods emerging and I have a feeling that OpenAI uses some of these when you feed it files directly, but still those are limited in size. Seems like we’re still a ways out from being able to dump in large amounts of content for real analysis.

Leave a Reply

Please be kind and respectful to help make the comments section excellent. (Comment Policy)

This site uses Akismet to reduce spam. Learn how your comment data is processed.