FAQ - CDLA

Permissive Version 2.0 FAQ

Why create a new data agreement?

With the ascent of machine learning and its reliance on high quality data, there is a need for data agreements which facilitate data sharing both for data providers and data recipients and create a predictable path for training ML models. Recent agreements, such as the CDLA-Permissive-1.0, CDLA-Sharing-1.0, and the O-UDA-1.0, have made good strides towards these goals, but we believe recent developments in the law and the evolving needs of machine learning make it desirable to have an even simpler, more streamlined approach.

This new agreement, the whose use of which we hope will supersede its predecessors, the CDLA-Permissive-1.0 and the O-UDA-1.0, has been crafted with the following objectives in mind:

A short, straightforward data agreement which enables data sharing and data innovation in a responsible way;
The use of data under the agreement carries no obligation;
The sharing of data under the agreement is operationally simple and requires nothing more than making available the text of the agreement;
Training an ML model based on data under the agreement, even copyrighted data, creates no obligation under the agreement for the use or the distribution of the trained model or for the insights it generates;
A data provider-friendly agreement, as the sharing of data under the agreement is designed to limit the liability of the data provider(s).

What are some anticipated typical use cases?

We envision that this agreement is suitable for situations where the original data provider created a data set, is reasonably confident in its ability to share the data set, and wants to be clear that the results from computational analysis of the data are not restricted. Some examples of typical use cases are provided below:

Releasing corporate data that does not contain personal data;
Training an AI model based on a dataset under CDLA-Permissive-2.0;
Receiving a dataset under CDLA-Permissive-2.0, adding new features to it and sharing the augmented dataset under CDLA-Permissive-2.0;
Merging a dataset under the CDLA-Permissive-2.0 with another dataset that might be under another license.

What is the difference between Data and Results?

“Data” is what you receive under the agreement, which you might augmented with your own features.

If you train an ML model on it, that would typically be considered a Result. However, as ML models continue to evolve, some models may return parts of the underlying Data in response to certain inputs. We think that this should not count as sharing Data, and we want to make sure that the use of these models would not implicate even the modest obligations for sharing Data that is provided under the CDLA-Permissive-2.0.

Is CDLA-Permissive-2.0 open?

We believe CDLA-Permissive-2.0 meets the Open Definition. It permits everything described in Section 2.1 of the Open Definition and only imposes conditions approved by Section 2.2, namely making available the license text, including a preservation of the disclaimers of warranties and liability.

Does CDLA-Permissive-2.0 require retaining or reproducing attribution notices when sharing Data?

In CDLA-Permissive-1.0, one of the requirements when sharing Data was to preserve all credit and attribution notices. Based on feedback from the community during the CDLA-Permissive-2.0 drafting process, it became clear that there was a desire to eliminate this as a requirement in the license agreement itself.

Although “attribution-style” provisions may be common in permissive open source software licenses, they do add an additional process step that (even if minor) may introduce complexity into the resharing of open data sets. Additionally, as technologies continue to evolve beyond what the CDLA drafters might anticipate today, it is unclear whether typical ways of sharing attributions for open source software will analogize well to open data sharing.

That said, nothing about CDLA-Permissive-2.0 is meant to imply that recipients of Data under CDLA-Permissive-2.0 should not provide attribution about the sources of the data. Attribution will often be important for appropriate norms in communities, and understanding its origination source is often a key aspect of why an open data set will have value. The CDLA-Permissive-2.0 simply does not make it a condition of sharing Data that is received under the license agreement.

I have shared datasets under CDLA-Permissive-1.0. Can I “upgrade” these agreements to CDLA-Permissive-2.0?

Yes. Pursuant to section 7.5 of CDLA-Permissive-1.0, Datasets previously shared under CDLA-Permissive-1.0 can be shared under future versions of CDLA Permissive published by the Community Data License Agreement workgroup under The Linux Foundation.

If I receive Data under CDLA-Permissive-2.0, can I redistribute it under different terms?

Yes – just as you can distribute software under a permissive license in a proprietary software product (subject to retention of required notices), you can redistribute the permissively licensed data in a larger aggregation under any terms you desire.

Permissive and Sharing Version 1.0 FAQ

What is the difference between the Sharing and the Permissive versions of the Agreement?

The primary difference relates to Your obligations if You decide to Publish Data that You Receive under the Agreement. The Sharing version of the Agreement requires You to Publish that Data, and any Enhanced Data, under the terms of the Sharing version of the Agreement – similar to a copyleft open source license. (See below for the distinction between Data / Enhanced Data, versus Results that are not subject to the Sharing requirements.)

The Permissive version of the Agreement, by contrast, allows Data and Enhanced Data to be Published under different terms, subject to notice and attribution requirements – similar to a permissive open source license.

The name of the document is Community Data License Agreement. Is it a License or an Agreement?

It is both. The Data Providers and the individuals or companies that receive Data are entering into an agreement. Part of that agreement is the grant of a license to Use or Publish the Data that is being made available. Given the variety of types of Data that could be licensed under the CDLA, there may also be legal protections accorded to some or all of that Data, but those protections may vary by type of Data and by country. For example, a convenient aggregation of common weather statistics made available in a standard format may not be protectable by copyright in some countries. The Agreement is intended to ensure that all parties give and receive uniform, predictable rights in Data, regardless of jursidictional differences in legal protections under applicable law.

Can you describe the kinds of data projects that you envision might be developed with the advent of this agreement?

The CDLA drafting group had in mind use cases for supporting large-scale datasets that are constantly changing, such as those that support machine learning, streaming data or artificial intelligence systems. Another may be streaming data from an IoT devices framework and bridging public and private organizations that have different requirements. We wanted to ensure that data providers and users had clarity about their ability to curate, use, and share data with the goal of enabling the creation of open, collaborative data communities.

The other aspect that’s important to understand is the CDLA was created to help communities curate and build data together in a similar way to how open source software is developed. We envision data experts reviewing and maintaining data repositories and entrust them with control over how they govern adding or removing data, what systems to use for storing data, additional representations contributors should make, what data is appropriate and how to handle more challenging issues to their industry or community such as personally identifiable information (PII) that may be in data. Different communities may wish to structure different arrangements in their governance model for dealing with these concerns, and the CDLA does not mandate particular approaches.

The context document provides an in-depth discussion of the role that we envision the CDLA playing to enable collaboration by a data community. As described there, the CDLA can serve as the inbound license and/or the outbound license for a data community.

What happens if You fail to comply with the license conditions in Section 3 of the Agreement?

Failure to comply with the license conditions triggers the termination provisions in a manner very similar to the way, for example, a short form permissive open source license works. If You comply with the license conditions, You are licensed. If You fail to comply, and fail to cure within a reasonable period after You become aware of the noncompliance, You are no longer licensed.

Is “Publish” the same as “distribution” under other copyleft licenses?

No. Although the concepts are analogous, to Publish Data under the CDLA includes any method of making the Data available or accessible to others that enables them to Use that Data, and is therefore somewhat broader than the concept of distribution in the context of many open source software licenses. In order to be Published, Data does not have to be physically distributed in order for others to make Use of it. Remote access permits Use of the Data, for example, and therefore its provision constitutes Publication – even if it might not be considered “distribution” under some open source software license agreements. Likewise, Data can be Published to a third party without giving that third party control over the Data. If the third party can study the Data, or undertake any other aspect of Use of the Data, the Data has been Published to that third party. However, if the third party does not have access to the Data itself, but only to Results from the Data, then the Data has not been Published to that third party.

Data is not Published, however, if all of the individuals who can make Use of the Data are employed by, or contractors of, the same Entity. The first Publication of the Data would be to someone outside of the affiliated companies included in the definition of Entity.

Section 3.2 of the Sharing version of the Agreement says that that I may not impose certain restrictions or restrict anyone who Receives the Data. What if there are laws or regulations that prohibit those activities?

You are not imposing those legal restrictions on others. Section 7.1 makes it clear that each individual and entity is responsible for compliance with whatever laws apply to them. If they cannot Publish the Data without violating a law that applies to them, You have not imposed any restriction on them.

Section 3.2 of the Sharing version of the Agreement mentions a “Ledger” that can be designated by a project. What is this referring to?

A key objective for the drafters of the Community Data License Agreements is that the Agreements should function within a broader ecosystem of organizations and projects that create, curate, maintain and provide many different types of Data. In furtherance of this objective, the Agreements contemplate that some projects may choose to establish official digital records to be used to record and store (1) the Data itself, and/or (2) grants, contributions and licenses to Data. This may include, for example, provenance metadata regarding the sources of the Data.

The Sharing version of the Agreement defines the term “Ledger” in Section 1.7 to mean these digital records. In Section 3.2(b), it states that if a project has designated a Ledger for these purposes, then an Entity who Receives Data cannot restrict or deter others (such as those to whom it Publishes Data) from recording in the project’s Ledger either the Data itself, or grants of rights in the Data.

You may note that the Permissive version of the Agreement does not include the definition of the term “Ledger,” since the Permissive version does not include the corresponding restrictions from Section 3.2(b) of the Sharing version – and since the term “Ledger” is not used elsewhere in the Agreement. This omission should not be taken to imply that Ledgers could not be similarly relevant or used for projects that elect to operate under a Permissive version of the Agreement.

What if the law prohibits Publication of the Data?

Section 7.1 makes it clear that each individual and entity is responsible for compliance with whatever laws apply to them. There is nothing in the Agreement which requires that You Publish any Data. If You decide to Publish, You have certain obligations under the Agreement, and You may have certain separate obligations under applicable law.

In the Sharing version of the Agreement, why are Results excluded from the sharing obligations that apply to Enhanced Data?

The Agreement creates an important category of works that are produced from analysis of the Data that is received under the Agreement. Analysis of Data is defined in the Agreement as “Computational Use of Data,” and “Results” are the outcomes or outputs that You obtain from Your Computational Use of Data.

Results are separate works from the Data licensed under the Agreement, and therefore are free of any obligation to Publish them under the Agreement – if you choose to publish them at all. You never have any obligation to share Results if You do not want to.

On the other hand, if You want to share Results from Data Received under the Sharing version of the Agreement, then You may include them with the Data You Publish and the Results will be considered Data, just like any other Data that is Published under the Agreement. Or, You may Publish them separately, under an agreement of Your choosing.

Results may include de minimis amounts of Data? What does that mean?

The assurance that no one will be obligated to share the Results of their analysis is an important feature of the Agreement and one that recipients and Data Providers will rely upon in choosing the Sharing version of the CDLA. In order to avoid concerns that a Result will lose its status as a Result if any part of the original Data might be included in the output of the analysis, the Agreement makes it clear that Results may include some Data, but not more than an insignificant amount. The goal of the de minimis exception is to preserve the ability of Results to include small snippets of the Data that was analyzed, while precluding the ability to create Results that embody so much of the Data that they effectively replace it.

De minimis has both a quantitative and qualitative aspect. What is quantitatively de minimis cannot be defined as a fixed amount applicable in all instances. It will vary based on the quantity of Data that has been Received under the Agreement and on which the Computational Use is based. A Computational Use which reproduces a significant subset of the Data that could be a substitute for the Data itself is not producing Results. To also meet the qualitative threshold as de minimis, the Data that is included will not have been selected based on its universal value, but only based on its value for this particular analysis. For example, a Computational Use which generates a subset of the Data based on how often that Data has been accessed in general for purposes not related to this inquiry is not producing Results.

Does the Agreement require each Data Provider to make a representation that the Data does not include any personal or confidential information?

No. Each Data Provider represents that Publication of the Data that it Publishes does not violate any privacy or confidentiality obligation undertaken by that Data Provider. If You choose to Publish Data that You have Received under the Agreement, You are not asked to make a representation that no other Data Provider has included Data that is subject to a privacy or confidentiality obligation that was undertaken by that Data Provider.

Does that mean that You can pass along Data when You know that someone else has inserted personal or confidential information into that Data? No. Each Data Provider represents that the Data Provider has exercised reasonable care to assure that the Data it Publishes was obtained from others with the right to Publish the Data under this Agreement. Furthermore, although the Agreement may contain no requirement to make representations on behalf of other Data Providers, You are still required to comply with all applicable laws in Publishing and Using Data Received under the Agreement.

Why doesn’t the CDLA have a choice of law provision?

The CDLA is intended to be an agreement that can be used throughout the world. Since Data may be licensed from Data Providers located in many countries, the Working Group opted not to specify a law or jurisdiction in favor of encouraging global adoption of the Agreement. A similar choice has been made in many open source software licenses to omit choice of law or choice of forum provisions.

Who is the Working Group?

The Working Group was formed by a group of Linux Foundation members and is comprised of internal legal counsel at a number of those companies plus external counsel invited to assist in the initial drafting process. The goal of the Working Group was and continues to be to create and act as the steward for a family of “open” data agreements that facilitate broad data sharing and open community development around a broad variety of data types, involving a wide variety of commercial, non-profit, academic and governmental entities. That family of agreements presently includes the Community Data License Agreement – Sharing and the Community Data License Agreement – Permissive. The Working Group will evolve over time.

There are other open data licenses. What is unique about the CDLA that warranted creating another license framework?

Other families of data licenses are also excellent, well-drafted agreements. We drafted the CDLA licenses in response to evolving use cases, taking into account what we’ve learned through experience with open source software licensing. Here were a few of the drafting group’s guiding principles in drafting the CDLA:

The CDLA licenses are intended to cover datasets as a whole as well as their individual contents in short, straightforward agreements that unambiguously cover the rights to use and to publish.
The CDLA explicitly distinguishes between the data provided under the CDLA (and additions or modifications to it), which are subject to the CDLA’s terms – and “Results” obtained by processing or analyzing that data. The CDLA does not impose obligations or restrictions on Results. This is particularly relevant for use cases for AI and machine learning systems where data is transformed through what we define as Computational Use.
The CDLA intentionally gives the data user a baseline level of confidence about their rights to use CDLA-licensed data. By publishing a CDLA dataset, the provider makes a representation that if they’ve added any data to it, they’ve used reasonable care regarding the source of that data and regarding not undertaking conflicting obligations. Users of CDLA-licensed data will have increased assurance about being able to use and publish that data themselves, while the CDLA also reflects that the data user of course remains subject to applicable laws.
The other aspect that’s important to understand is the CDLA was created to help communities curate and build data together in a similar way to how open source software is developed. We envision data experts reviewing and maintaining data repositories and entrust them with control over how they govern adding or removing data, what systems to use for storing data, additional representations contributors should make, what data is appropriate and how to handle more challenging issues to their industry or community such as personally identifiable information (PII) that may be in data. Different communities may wish to structure different arrangements in their governance model for dealing with these concerns, and the CDLA does not mandate particular approaches.

What is wrong with using _____ license?

The CDLA was not created in response to any issues with other existing licenses and the creation of the CDLA was not to say there are clear issues with any other license. The CDLA was developed to focus on more evolving data sharing needs and to create a clear framework of rights communities can operate under. (see question above)

The CDLA is also not an attempt to fix issues with other licenses – in fact we didn’t start with any other license as the model or base but rather went through a requirements gathering process to understand the use cases under which people were struggling with sharing data. Each license has its own merits, target use cases and constituency of users, and that’s great. However, we did identify there were credible groups of data creators and users who were looking for a new, simplified framework that covers both sharing and permissive uses, and that’s what led to the CDLA being created.