Living In The (LLM) Past

In the early days of AI, a common example program was the hexapawn game. This extremely simplified version of a chess program learned to play with your help. When the computer made a bad move, you’d punish it. However, people quickly realized they could punish good moves to ensure they always won against the computer. Large language models (LLMs) seem to know “everything,” but everything is whatever happens to be on the Internet, seahorse emojis and all. That got [Hayk Grigorian] thinking, so he built TimeCapsule LLM to have AI with only historical data.

Sure, you could tell a modern chatbot to pretend it was in, say, 1875 London and answer accordingly. However, you have to remember that chatbots are statistical in nature, so they could easily slip in modern knowledge. Since TimeCapsule only knows data from 1875 and earlier, it will be happy to tell you that travel to the moon is impossible, for example. If you ask a traditional LLM to roleplay, it will often hint at things you know to be true, but would not have been known by anyone of that particular time period.

Chatting with ChatGPT and telling it that it was a person living in Glasgow in 1200 limited its knowledge somewhat. Yet it was also able to hint about North America and the existence of the atom. Granted, the Norse apparently found North America around the year 1000, and Democritus wrote about indivisible matter in the fifth century. But that knowledge would not have been widespread among common people in the year 1200. Training on period texts would surely give a better representation of a historical person.

The model uses texts from 1800 to 1875 published in London. In total, there is about 90 GB of text files in the training corpus. Is this practical? There is academic interest in recreating period-accurate models to study history. Some also see it as a way to track both biases of the period and contrast them with biases found in data today. Of course, unlike the Internet, surviving documents from the 1800s are less likely to have trivialities in them, so it isn’t clear just how accurate a model like this would be for that sort of purpose.

Instead of reading the news, LLMs can write it. Just remember that the statistical nature of LLMs makes them easy to manipulate during training, too.


Featured Art: Royal Courts of Justice in London about 1870, Public Domain

23 thoughts on “Living In The (LLM) Past

  1. If they did this training, while doing a dedicated parallel course of training on similar modern texts, with a lot of personal and biographical and (especially?) professional data about the authors of the modern training texts, it is not that far-fetched to think that a chatbot which is representative of a typical author of the antique era could be produced. A common street person, probably not so much,

    though there is plenty of sensational garbage in our culture to mirror the broadsheets and ballads of the middle ages as well, I would think.

    1. Which begs the question, how many diaries of ancestors are forgotten in attics and never digitized? Previously those didn’t have value, but for this project it would. Certainly if you look at gender bias. Diaries from average women at that time would improve the model. Hopefully those diaries will be rescued and digitized so those personal writing styles and memories won’t be lost.

    2. “it is not that far-fetched to think that a chatbot which is representative of a typical author of the antique era could be produced.”

      Uh. No. It’d produce author-like output from a typical author of that time. Authors don’t talk like they write. They’re not trying to be conversational. Even when you’re writing letters back and forth with someone else, you’re not talking like you converse.

      The best examples you could get would be from plays, but even there it’s limited.

  2. This is great, both as a DIY project in its own right, and also as a helpful way to show how the sausage is made. Too many of my technical friends are duping themselves into believing LLMs are experiencing the world, despite being demonstrably neither learning nor mutating. Looking at how the training data is processed to become a model helps dismiss the illusion.

  3. I find this concept fascinating. Who knows what knowledge lies buried in pre 1900 text. If enough newspapers, books, letters, and periodicals were assembled there might actually be enough data to train a model that performs well.

    1. I was originally going to say that this wouldn’t happen as we’ve refined and grown our knowledge overtime. However I just recalled that someone discovered something decadesbefore it found any use and at the time was only thought of as a curiosity. It was someone much later rediscovering the work and realising it could solve a massive modern problem. I’m heading to bed now, might come to me in my sleep.

  4. I can see a lot of interesting uses for this. It’s always been fun to speculate about the “Great Man” vs “social inevitability” theories of history. We could explore with this by giving an 1800s LLM an interface it can call to “perform a physics experiment” (i.e. the 1800s LLM describes an experimental setup and a modern LLM responds with realistic experiment results). Then see whether the 1800s LLMs can “invent” quantum mechanics on their own.

    If it gets there relatively quickly and using similar “experiments” as Bohr, Plank, Rutherford et al, then we would have some evidence that quantum mechanics was a logical next step as opposed to a miraculous leap.

    Of course, it was difficult and expensive to get published in the 1800s. So the training corpus does exactly reflect the full milieu of the time.

    1. +1

      The LLMs can’t simulate the wandering mind of a dreamer or a novelist, though.
      To this to happen, it would need to read between the lines and apply knowledge and experience from totally different fields.
      Including children stories, lullabies and fairy tales. ;)
      And even that doesn’t suffice, maybe, because some real people had real hallucinations that might have inspired them (ahem).

      As is, the resulting LLM will be rather a mirror of the scientific society of a given era, at the best.
      A manifestation of things that have been accepted as valid and true by the masses of the day.
      It will be the equivalent of common knowledge of the ambitious readers of newspapers and scientific books of the given time, maybe.

      Which, for its own, can turn out to be very interesting and entertaining, though.
      Especially if it manages to talk as eloquently as these ladies and gentlemen used to do.
      Assuming that users of today still have the mental capacity to follow. 🥲

    2. as Bohr, Plank, Rutherford et al

      The fact that you’re listing multiple people here is already kinda disproving the “great man” theory…

      The argument goes that the mode of development shifts from individuals to groups as social complexity goes up. It was easier for a single inventor to make groundbreaking discoveries when humanity knew basically nothing, as opposed to today when there’s too much knowledge for any one person to hold in their head.

    3. Nope. No chance. Not a prayer.

      The reason why something like this is going to be inherently limited is that “written works of the 1800s” is not the Internet. Not just in terms of scale, but in terms of what they were intended for. We use the Internet for basic communication. Like, extremely basic. You email people that are 10 feet from you.

      Nowadays papers get published for practically anything, because it’s cheap and easy and it’s the best way to document and preserve information. They did not do that then. Most learning and development happened in person, which is why you got development in clusters and bursts. YoLLMs u didn’t communicate via papers. You published results in papers. But the development was either in person or private communication.

      Go and read papers from the 1800s. They’re out there, and they’ve been translated. They don’t read like modern papers. No plots. Few equations. Lots of “obviously” and “it is clear” and “can easily see.” You get tons of results that happen shortly after physical society meetings. And you don’t have records of communication there. Because they were in person. It’s even worse if you look at actually experimental descriptions and such. Good luck even understanding what they were doing. You weren’t expected to be able to replicate stuff from papers.

      I mean, LLMs just absolutely suck at understanding modern research and they’ve got the benefit of a much larger dataset. The place where they’re most useful in modern science is essentially data mining – if you try to find meaning from something that’s totally not understood by scientists now, there’s just nothing there. Because there’s nothing to find.

    4. A significant factor I believe LLMs would struggle with here is taking into account the manufacturing abilities of the time, particularly in terms of accuracy and consistency.

      Measuring anything & everything with any level of precision was so much more difficult at the time, just consider the Greenwich Time Lady for adjusting your clock

  5. This is great. Mainstream LLMs certainly are bad at anachronisms. This would actually serve as a good model to train bigger models against to learn how to not have that problem.

    It’s also very close to another thing we need: an LLM trained only on public domain works. It just needs to be put through some proper reinforcement learning to be helpful.

  6. [list games]

    Chess
    Checkers
    Backgammon
    Poker
    Fighter Combat
    Guerilla Engagement
    Desert Warfare
    Air-To-Ground Actions
    Theaterwide Tactical Warfare
    Theaterwide Biotoxin and Chemical Warfare
    Global Thermonuclear War

Leave a Reply

Please be kind and respectful to help make the comments section excellent. (Comment Policy)

This site uses Akismet to reduce spam. Learn how your comment data is processed.