Aug 19, 2023·1 min read

AI2 Unveils 'Dolma', a Groundbreaking Open Dataset for Training Advanced Language Models

The Allen Institute for AI (AI2) has launched 'Dolma', a significant step toward the openness of AI training with an expansive, free-to-use dataset.

AI2 Unveils 'Dolma', a Groundbreaking Open Dataset for Training Advanced Language Models

With the AI landscape witnessing the widespread use and critical function of language models such as GPT-4 and Claude, the primary data fueling these algorithmic powerhouses, however, remains veiled in secrecy. In a move set to disrupt this paradigm, the Allen Institute for AI (AI2) brings forward 'Dolma', an expansive, accessible text dataset intended for in-depth inspection and free usage. This critical breakthrough aims to steer AI research towards a more open and transparent path.

Nicknamed after the Tibetan dumplings and reflecting its purpose to satisfy OLMo's hunger for data, Dolma is designed to assist in building AI2's anticipated open language model, abbreviated as OLMo. According to the beliefs of the research authority at AI2, the AI research community should have free access to and authority to modify not just the model, but also the dataset it's based on - a view embodied in the creation of Dolma.

Luca Soldaini, an AI2 researcher, elucidates in a blog post the meticulous selection and careful methodology they incorporated to render the dataset suitable for AI operations. This dataset, which Soldaini refers to as a 'data artifact', is the initial release pursuant to the OLMo project, and further detailed and exhaustive information about the undertaking is being collated in an upcoming comprehensive paper.

Instead of the less-than-transparent practices of organizations like OpenAI and Meta, who mainly keep their key dataset information proprietary, AI2 decided to take a different, and one might argue, a more ethical and democratic route. While the precise details of commonly employed AI datasets often evade public scrutiny, there's also been speculation in the AI research community about the questionable ethical and legal means through which this data is obtained, sometimes even suggesting piracy.

As an open dataset, Dolma is far from being the first of its kind. It eclipses its predecessors in size – encompassing an astronomical 3 billion tokens, a term native to AI referring to the measure of content volume – and in its simplicity and clarity with the agreement on its use and rights. Dolma is governed under the 'ImpACT' license for medium-risk artifacts, which requires users to provide pertinent details such as contact information, their intended use cases, and the disclosure of any creation involving the application of the Dolma dataset. Moreover, any such product needs to be distributed under the same license and must comply with the terms of not applying Dolma in prohibited fields, including surveillance or disinformation.

In the off chance that personal information somehow finds its way into the database despite the rigorous methodologies of AI2, the organization has provided a removal request mechanism to ensure user privacy, though the provision is strictly for specific instances barring an all-encompassing opt-out option. Dolma signifies a move towards openness, transparency, and ethical data sourcing in AI development, which can facilitate advancements in this domain. Tools like AppMaster's no-code platform, which also supports greater accessibility and transparency in app development, can further enhance these advancements.

Easy to start
Create something amazing

Experiment with AppMaster with free plan.
When you will be ready you can choose the proper subscription.

Get Started
AI2 Unveils 'Dolma', a Groundbreaking Open Dataset for Training Advanced Language Models | AppMaster