ChatGPT, Bing, And The Upcoming Security Apocalypse

Most security professionals will tell you that it’s a lot easier to attack code systems than it is to defend them, and that this is especially true for large systems. The white hat’s job is to secure each and every point of contact, while the black hat’s goal is to find just one that’s insecure.

Whether black hat or white hat, it also helps a lot to know how the system works and exactly what it’s doing. When you’ve got the source code, either because it’s open-source, or because you’re working inside the company that makes the software, you’ve got a huge advantage both in finding bugs and in fixing them. In the case of closed-source software, the white hats arguably have the offsetting advantage that they at least can see the source code, and peek inside the black box, while the attackers cannot.

Still, if you look at the number of security issues raised weekly, it’s clear that even in the case of closed-source software, where the defenders should have the largest advantage, that offense is a lot easier than defense.

So now put yourself in the shoes of the poor folks who are going to try to secure large language models like ChatGPT, the new Bing, or Google’s soon-to-be-released Bard. They don’t understand their machines. Of course they know how the work inside, in the sense of cross multiplying tensors and updating weights based on training sets and so on. But because the billions of internal parameters interact in incomprehensible ways, almost all researchers refer to large language models’ inner workings as a black box.

And they haven’t even begun to consider security yet. They’re still worried about how to construct obscure background prompts that prevent their machines from spewing hate speech or pornographic novels. But as soon as the machines start doing something more interesting than just providing you plain text, the black hats will take notice, and someone will have to figure out defense.

Indeed, this week, we saw the first real shot across the bow: a hack to make Bing direct users to arbitrary (bad) webpages. The Bing hack requires the user to already be on a compromised website, so it’s maybe not very threatening, but it points out a possible real security difference between Bing and ChatGPT: Bing gives you links to follow, and that makes it a juicy target.

We’re right on the edge of a new security landscape, because even the white hats are facing a black box in the AI. So far, what ChatGPT and Codex and other large language models are doing is trivially secure – putting out plain text – but Bing is taking the first dangerous steps into doing something more useful, both for users and black hats. Given the ease with which people have undone OpenAI’s attempts to keep ChatGPT in its comfort zone, my guess is that the white hats will have their hands full, and the black-box nature of the model deprives them of their best hope. Buckle your seatbelts.

Simultaneous Invention, All The Time?

As Tom quipped on the podcast this week, if you have an idea for a program you’d like to write, all you have to do is look around on GitHub and you’ll find it already coded up for you. (Or StackOverflow, or…) And that’s probably pretty close to true, at least for really trivial bits of code. But it hasn’t always been thus.

I was in college in the mid 90s, and we had a lab of networked workstations that the physics majors could use. That’s where I learned Unix, and where I had the idea for the simplest program ever. It took the background screen color, in the days before wallpapers, and slowly random-walked it around in RGB space. This was set to be slow enough that anyone watching it intently wouldn’t notice, but fast enough that others occasionally walking by my terminal would see a different color every time. I assure you, dear reader, this was the very height of wit at the time.

With the late 90s came the World Wide Web and the search engine, and the world got a lot smaller. For some reason, I was looking for how to set the X terminal background color again, this time searching the Internet instead of reading up in a reference book, and I stumbled on someone who wrote nearly exactly the same random-walk background color changer. My jaw dropped! I had found my long-lost identical twin brother! Of course, I e-mailed him to let him know. He was stoked, and we shot a couple funny e-mails back and forth riffing on the bizarre coincidence, and that was that.

Can you imagine this taking place today? It’s almost boringly obvious that if you search hard enough you’ll find another monkey on another typewriter writing exactly the same sentence as you. It doesn’t even bear mentioning. Heck, that’s the fundamental principle behind Codex / CoPilot – the code that you want to write has been already written so many times that it will emerge as the most statistically likely response from a giant pattern-matching, word-word completion neural net model.

Indeed, stop me if you’ve read this before.

All About USB-C: Manufacturer Sins

People experience a variety of problems with USB-C. I’ve asked people online about their negative experiences with USB-C, and got a wide variety of responses, both on Twitter and on Mastodon. In addition to that, communities like r/UsbCHardware keep a lore of things that make some people’s experience with USB-C subpar.

In engineering and hacking, there’s unspoken things we used to quietly consider as unviable. Having bidirectional power and high-speed data on a single port with thousands of peripherals, using nothing but a single data pin – if you’ve ever looked at a schematic for a proprietary docking connector attempting such a feat, you know that you’d find horrors beyond comprehension. For instance, MicroUSB’s ID pin quickly grew into a trove of incompatible resistor values for anything beyond “power or be powered”. Laptop makers had to routinely resort to resistor and one-wire schemes to make sure their chargers aren’t overloaded by a laptop assuming more juice than the charger can give, which introduced a ton of failure modes on its own.

When USB-C was being designed, the group looked through chargers, OTG adapters, display outputs, docking stations, docking stations with charging functions, and display outputs, and united them into a specification that can account for basically everything – over a single cable. What could go wrong?

Of course, device manufacturers found a number of ways to take everything that USB-C provides, and wipe the floor with it. Some of the USB-C sins are noticeable trends. Most of them, I’ve found, are manufacturers’ faults, whether by inattention or by malice; things like cable labelling are squarely in the USB-C standard domain, and there’s plenty of random wear and tear failures.

I don’t know if the USB-C standard could’ve been simpler. I can tell for sure that plenty of mistakes are due to device and cable manufacturers not paying attention. Let’s go through the notorious sins of USB-C, and see what we can learn. Continue reading “All About USB-C: Manufacturer Sins”

Copyright Data, But Do It Right

Copyright law is a triple-edged sword. Historically, it has been used to make sure that authors and rock musicians get their due, but it’s also been extended to the breaking point by firms like Disney. Strangely, a concept that protected creative arts got pressed into duty in the 1980s to protect the writing down of computer instructions, ironically a comparatively few bytes of BIOS code. But as long as we’re going down this strange road where assembly language is creative art, copyright law could also be used to protect the openness of software as well. And doing so has given tremendous legal backbone to the open and free software movements.

So let’s muddy the waters further. Looking at cases like the CDDB fiasco, or the most recent sale of ADSB Exchange, what I see is a community of people providing data to an open resource, in the belief that they are building something for the greater good. And then someone comes along, closes up the database, and sells it. What prevents this from happening in the open-software world? Copyright law. What is the equivalent of copyright for datasets? Strangely enough, that same copyright law.

Data, being facts, can’t be copyrighted. But datasets are purposeful collections of data. And just like computer programs, datasets can be licensed with a restrictive copyright or a permissive copyleft. Indeed, they must, because the same presumption of restrictive copyright is the default.

I scoured all over the ADSB Exchange website to find any notice of the copyright / copyleft status of their dataset taken as a whole, and couldn’t find any. My read is that this means that the dataset is the exclusive property of its owner. The folks who were contributing to ADSB Exchange were, as far as I can tell, contributing to a dataset that they couldn’t modify or redistribute. To be a free and open dataset, to be shared freely, copied, and remixed, it would need a copyleft license like Creative Commons or the Open Data Commons license.

So I’ll admit that I’m surprised to have not seen permissive licenses used around community-based open data projects, especially projects like ADSB Exchange, where all of the software that drives it is open source. Is this just because we don’t know enough about them? Maybe it’s time for that to change, because copyright on datasets is the law of the land, no matter how absurd it may sound on the face, and the closed version is the default. If you want your data contributions to be free, make sure that the project has a free data license.

Now ChatGPT Can Make Breakfast For Me

The world is abuzz with tales of the ChatGPT AI chatbot, and how it can do everything, except perhaps make the tea. It seems it can write code, which is pretty cool, so if it can’t make the tea as such, can it make the things I need to make some tea? I woke up this morning, and after lying in bed checking Hackaday I wandered downstairs to find some breakfast. But disaster! Some burglars had broken in and stolen all my kitchen utensils! All I have is my 3D printer and laptop, which curiously have little value to thieves compared to a set of slightly chipped crockery. What am I to do!

Never Come Between A Hackaday Writer And Her Breakfast!

OK Jenny, think rationally. They’ve taken the kettle, but I’ve got OpenSCAD and ChatGPT. Those dastardly miscreants won’t come between me and my breakfast, I’m made of sterner stuff! Into the prompt goes the following query:

"Can you write me OpenSCAD code to create a model of a kettle?" Continue reading “Now ChatGPT Can Make Breakfast For Me”

Speak To The Machine

If you own a 3D printer, CNC router, or basically anything else that makes coordinated movements with a bunch of stepper motors, chances are good that it speaks G-code. Do you?

If you were a CNC machinist back in the 1980’s, chances are very good that you’d be fluent in the language, and maybe even a couple different machines’ specialized dialects. But higher level abstractions pretty quickly took over the CAM landscape, and knowing how to navigate GUIs and do CAD became more relevant than knowing how to move the machine around by typing.

a Reprap Darwin
Reprap Darwin: it was horrible, but it was awesome.

Strangely enough, I learned G-code in 2010, as the RepRap Darwin that my hackerspace needed some human wranglers. If you want to print out a 3D design today, you have a wealth of convenient slicers that’ll turn abstract geometry into G-code, but back in the day, all we had was a mess of Python scripts. Given the state of things, it was worth learning a little G-code, because even if you just wanted to print something out, it was far from plug-and-play.

For instance, it was far easier to just edit the M104 value than to change the temperature and re-slice the whole thing, which could take an appreciable amount of time back then. Honestly, we were all working on the printers as much as we were printing. Knowing how to whip up some quick bed-levelling test scripts and/or demo objects in G-code was just plain handy. And of course the people writing or tweaking the slicers had to know how to talk directly to the machine.

Even today, I think it’s useful to be able to speak to the machine in its native language. Case in point: the el-quicko pen-plotter I whipped together two weekends ago was actually to play around with Logo, the turtle language, with my son. It didn’t take me more than an hour or so to whip up a trivial Logo-alike (in Python) for the CNC: pen-up, pen-down, forward, turn, repeat, and subroutine definitions. Translating this all to machine moves was actually super simple, and we had a great time live-drawing with the machine.

So if you want to code for your machine, you’ll need to speak its language. A slicer is great for the one thing it does – turning an STL into G-code, but if you want to do anything a little more bespoke, you should learn G-code. And if you’ve got a 3D printer kicking around, certainly if it runs Marlin or similar firmware, you’ve got the ideal platform for exploration.

Does anyone else still play with G-code?

Irreproducible, Accumulative Hacks

Last weekend, I made an incredibly accurate CNC pen-plotter bot in just 20 minutes, for a total expenditure of $0. How did I pull this off? Hacks accumulate.

In particular, the main ingredients were a CNC router, some 3D-printed mounts that I’d designed and built for it, and a sweet used linear rail that I picked up on eBay as part of a set a few years back because it was just too good of a deal. If you had to replicate this build exactly, it would probably take a month or two of labor and cost maybe $2,000 on top of that. Heck, just tuning up the Chinese 6040 CNC machine alone took me four good weekends and involved replacing the stepper motors. Continue reading “Irreproducible, Accumulative Hacks”