Monetizing High Quality Data
The high quality data that is used to train models is becoming monetized. That is inevitable. Society will have to arrive at some policy for uses that genuinely lie in the public interest. But commercial model builders should pay and so will us commercial consumers.
‘The researchers estimate that in the three data sets — called C4, RefinedWeb and Dolma — 5 percent of all data, and 25 percent of data from the highest-quality sources, has been restricted. Those restrictions are set up through the Robots Exclusion Protocol’