Українська правда

Artificial intelligence can be trained without violating copyrights — study

Artificial intelligence can be trained without violating copyrights — study
0

AI companies claim that it's impossible to build their models without training on copyrighted material. But as it turns out, it's entirely possible — just very difficult, The Washington Post reports.

To prove this, the researchers created a new model, less powerful but much more ethical, trained exclusively on open-source data and materials in the public domain.

The research involved scientists from 14 institutions, including MIT, Carnegie Mellon University, and the University of Toronto. Nonprofits such as the Vector Institute and the Allen Institute for AI also joined the project.

The researchers collected 8 TB of ethically obtained data, including 130,000 books from the US Library of Congress. Based on this, they trained a large language model (LLM) with 7 billion parameters. As a result, the model showed performance that was roughly comparable to Meta's Llama 2-7B from 2023. However, the authors did not publish a comparison with the most powerful modern models.

The data preparation process was tedious. Much of the information was not readable by automated tools, so it had to be manually reviewed and annotated.

"We use automated tools, but all of our stuff was manually annotated at the end of the day and checked by people," said co-author Stella Biderman.

It was also difficult to determine which licenses apply to each source.

This research is unlikely to change the strategies of large companies - they are more profitable to create more powerful models with less cost. But now in the disputes over copyright in AI, a new and powerful counterargument will appear.

Share:
Посилання скопійовано
Advert:
Advert: