OpenAI has created a tool to transcribe YouTube videos and collect data for GPT-4 training

The OpenAI team transcribed more than 1 million hours of YouTube videos to use the data to train the GPT-4 model. The New York Times reported on this report.

To do this, OpenAI researchers created a speech recognition tool called Whisper. It was able to transcribe audio from YouTube videos, producing spoken text.

OpenAI took this action after the company faced a problem with the supply of training data at the end of 2021. It had exhausted the available materials, but still needed a large amount of data.

According to knowledgeable sources, some OpenAI employees discussed how transcribing videos and using the resulting texts could be against YouTube’s rules.

But in the end, the OpenAI team decrypted more than 1 million hours of YouTube videos and uploaded the resulting texts to GPT-4. It is noteworthy that OpenAI president Greg Brockman personally helped collect the videos, according to informed sources.

Recently, YouTube CEO Neil Mohan said in an interview with Bloomberg that using videos from OpenAI’s Sora AI training platform would be a violation of YouTube’s terms of service.

As you know, creating innovative systems depends on having enough data to train the technology to instantly create text, images, sounds, and videos that resemble what humans create.