The OpenAI hack serves as a reminder that AI startups are gold mines for cybercriminals

There's no reason to be concerned about the recently disclosed breach of OpenAI's systems obtaining your private ChatGPT talks. The attack itself seems to have been shallow, which is concerning, but it serves as a reminder that AI companies have quickly established themselves as some of the most attractive targets for cybercriminals.

Jul 6, 2024 - 03:55
Jul 6, 2024 - 03:59
 0  60
The OpenAI hack serves as a reminder that AI startups are gold mines for cybercriminals
: :
playing

Following a recent podcast mention by Leopold Aschenbrenner, a former employee of OpenAI, the hack was covered in greater depth by The New York Times. Although he referred to it as a "major security incident," anonymous corporate officials told the Times that the hacker was only able to access a discussion forum for employees. (I contacted OpenAI to get a response and confirmation.)

Ads: Start your startup with Klacify

There's no such thing as a minor security violation, and listening in on internal OpenAI development discussions is definitely useful. However, it is not at all like a hacker gaining access to internal systems, work-in-progress models, confidential roadmaps, etc.

However, we shouldn't be afraid of it in the first place, and not just because China or other enemies might overtake us in the AI weapons race. The fact remains, these AI companies have turned into gatekeepers for enormous amounts of extremely valuable data.

Let's discuss three types of data that OpenAI has developed or has access to, along with a smaller number of other AI companies: bulk user interactions, customer data, and high-quality training data.

Because the firms are so secretive about their hoards, it's unclear exactly what training data they have. But to assume that they are merely enormous collections of web data that has been scraped is a fallacy. Yes, they employ web scrapers and datasets like the Pile, but turning raw data into something usable for training a model like GPT-4o is an enormous undertaking. This can only be partially automated, requiring a significant amount of human labor hours.

A few machine learning engineers have conjectured that dataset quality is the most crucial component among all the elements involved in building a large language model (or, possibly, any transformer-based system). Because of this, a model that is trained on Reddit and Twitter can never be as intelligent as one that is trained on all of the published works from the past century. (And likely the reason why, despite their claims to the contrary, OpenAI allegedly used dubious legal sources—such as copyrighted books—in their training data.)

Ads: Publish your music & listen most popular radios now with Morodok

What's Your Reaction?

like

dislike

love

funny

angry

sad

wow

Trust We are telling the trust no matter what.