Curated Datasets for Language Models
Greenshoots in development of curated model training datasets build of open source, licensed, content from wired: https://lnkd.in/eqEDEnNA
“Fairly Trained announced that it has awarded its first certification(https://lnkd.in/eFaStpGh) for a large language model built without copyright infringement
KL3M and was developed by Chicago-based legal tech consultancy startup 273 Ventures, using a curated training dataset of legal, financial, and regulatory documents… for ‘risk-averse’ clients like law firms.
On Wednesday, researchers released what they claim is the largest available AI dataset for language models composed purely of public domain content. Common Corpus, as it is called… has been posted to the open source AI platform Hugging Face” (https://lnkd.in/ePaWxsQX