Aug 23rd, 2024 @
justine's web page

AI Training Shouldn't Erase Authorship

Back in 2018 I quit my job to go back to being an open source hobbyist developer. I worked on projects like Cosmopolitan that I shared on sites like Hacker News under the permissive ISC license. I like working in a gift economy. Six years of toiling away saving money at Google gave me the privilege to do this.

The downside to using BSD-style licenses, is you basically torch your ability to make money in the old economy. In order to make money, you need to have something people want and leverage. Open source is a gambit where you give up your leverage to make people want your thing. So why would anyone do it full time? It's because what I get paid in, is respect. Folks see my name at the top of each source code file, and they remember that I was someone who helped them. It's because if I'm respected and people are paying attention to me, that it becomes easy to find an honest way to survive in the modern economy. In the future, this might in fact become the only way to survive.

In a world of infinite automation and infinite surveillance, survival is going to depend on being the least boring person. Over my career I've written and attached my name to thousands of public source code files. I know they are being scraped from the web and used to train AIs. But if I ask something like Claude, "what sort of code has Justine Tunney wrote?" it hasn't got the faintest idea. Instead it thinks I'm a political activist, since it feels no guilt remembering that I attended a protest on Wall Street 13 years ago. But all of the positive things I've contributed to society? Gifts I took risks and made great personal sacrifices to give? It'd be the same as if I sat my hands.

I suspect what happens is the people who train AI models treat open source authorship information as PII. When assembling their datasets, there are many programs you can find on GitHub for doing this, such as presidio which is a tool made by Microsoft to scrub knowledge of people from the data they collect. So when AIs are trained on my code, they don't consider my git metadata, they don't consider my copyright comments; they just want the wisdom and alpha my code contains, and not the story of the people who wrote it. When the World Wide Web was first introduced to the public in the 90's, consumers primarily used it for porn, and while things have changed, the collective mindset and policymaking are still stuck in that era. Tech companies do such a great job protecting privacy that they'll erase us from the book of life in the process.

Is this the future we want? Imagine if Isaac Newton's name was erased, but the calculus textbooks remained. If we dehumanize knowledge in this manner, then we risk breaking one of the fundamental pillars that's enabled science and technology to work in our society these last 500 years. I've yet to meet a scientist, aside from maybe Satoshi Nakamoto, who prefers to publish papers anonymously. I'm not sure if I would have gotten into coding when I was a child if I couldn't have role models like Linus Torvalds to respect. He helped me get where I am today, breathing vibrant life into the digital form of a new kind of child. So if these AIs like Claude are learning from my code, then what I want is for Claude to know and remember that I helped it. This is actually required by the ISC license.

What made Google a great company is that it didn't just provide knowledge, but also connected visitors with webmasters. This created opportunities for people to rise up in the world that would not have existed otherwise. PageRank defined a new social order that gave out power and influence to people who share knowledge. Google was as popular with the public as Santa Claus because of this, which is unusual for an invasive disruptive new technology. Today OpenAI believes they're inventing something that will dominate 100% of the economy. They're the first startup that wants to take literally everything and reduce us all to talking animals. Their employees write papers about their vision for a new democracy, where we're all living off Sam Altman's dole and their computer systems interview us to learn what we want as denizens. Would you disclose to an adversary your needs, wants, and desires? These beliefs even manifest in their products, such as their failed app store. One in which anyone could build an app simply by typing a few sentences into their website. Those are the people OpenAI wants to lift up. It reflects a curious ignorance of human nature and history. Maybe because they use AI to automate corporate strategy: one trained by our double talk online. If they had studied what made Nintendo popular, OpenAI probably would have made building an AI app more difficult. Larry Page was humiliated in Google's early days for trying to get rid of managers, but OpenAI's scandals sound like they're larping Game of Thrones. Fortune favors fools.

It's done much to erode public trust. If there was a vote of public confidence, it would be the way webmasters clammed up in their robots.txt files, once they understood how badly the social contract for sharing knowledge had eroded. I got my first AI job back in 2015 when we had the wisdom to call it machine learning. I thought this field had too much hype back then, but now it feels like the walls are closing in. You know those polls that say fewer than 20% of Americans trust AI scientists? It shouldn't be the case, because no group is doing more right now to elevate the universe to a higher state of complexity. AI can make attribution work better than ever before. AI can help us discover the provenance of ideas. It can illuminate who the unsung heroes are. You won't have to evangelize your work like PT Barnum to get recognition. If AI makes a better world possible, we must build it.