Cloud storage is increasingly used for storing and serving publicly available data, ranging from web pages to various datasets. This post focuses on the storage of datasets.
There are different types of cloud services, usually provided by major cloud providers, which offer numerous additional features. Those storing the data typically bear the storage costs. Alternatively, hosting your own data is possible but complicates maintenance as you must manage your infrastructure. Cloud services alleviate this by delegating maintenance to specialized organizations.
Imagine you have a dataset that interests the public. You pay cloud providers to host this dataset, potentially only accessible to specific users. These users benefit from the data without incurring costs, leaving you to find alternative funding methods, such as public taxes, or to restrict access through a paywall, thereby compromising the openness of the data.
Despite these challenges, using cloud storage remains a convenient and efficient solution due to its minimal infrastructure and upkeep costs.
In this scenario, storage costs are paid by someone other than the actual users. While the data itself doesn’t get “depleted,” someone still covers these ongoing expenses, hoping to benefit in the long term. For instance, governments fund open data repositories because the innovation stemming from such data can advance society and the economy more than the cost of storage and public access.
However, maintaining public and open datasets in the cloud is expensive, and they may not always contribute tangible value. Or, they may benefit certain entities that do not contribute financially.
Is there a more sustainable model that doesn’t require data paywalls yet ensures storage costs are covered by those who use the data? One possible approach is allowing data payments at the time of download, with the option for users to contribute voluntarily.
This payment could support the storage costs or even compensate the data creators. Data that isn’t financially supported could be phased out over time, prioritizing more popular or financially supported datasets. Less popular datasets could still cover their storage costs, while inactive data could be moved to cold storage.
It’s crucial to maintain a free access option since the full value of open data often isn’t immediately apparent and may not be directly profitable for those who download it.
Users might contribute financially to the data they use if it adds value to their projects, opting to support its continued availability voluntarily. This poses a dilemma as it does not provide immediate additional benefits to the donor, potentially benefiting competitors instead.
This is where data provenance becomes essential. Consider AI models, which must specify the datasets used during training. It would be ethically appropriate for these model developers to support the datasets’ original creators and maintainers, ensuring data availability and integrity for verification purposes.
Decentralized cloud storage and Web3 technologies provide potential solutions with their immutability guarantees and support for transparent transactions. We are eager to explore these technologies further as viable options for a commons cloud storage framework.