Educational Data Sets
There are many online sources available that provide data sets for educational and study purposes. They cover many different areas (physics, statistics, medicine, etc.) and many tools and frameworks provide wrappers for them and allow you to easily play around with the data sets. In many cases, the technical tasks of fetching the data from the server, parsing it and installing it are completely transparent for users.
We recently added a similar functionality to LabPlot and we plan to release this new feature as part of the upcoming 2.8 version. The initial implementation for this was done by Ferencz Kovács during Google Summer of Code 2019.
Besides the actual development work, Ferencz spent a large amount of time on documenting the data sets that we plan to make available in the application. To document this metadata, we introduced a JSON-based format that is documented in our repository. The metadata files contain all the relevant information, such as the name of the data set, its description and the URL as well as the parameters (for example, the separating characters) required to properly parse the file fetched from the Internet.
At the moment we have five data set collections documented and available in LabPlot:
- JSE Data Archive is a data archive provided by the Journal of Statistics Education.
- R data sets is a collection of over 1300 data sets that were originally distributed alongside the statistical software environment R and some of its add-on packages.
- Australasian Data and Story Library (OzDASL) is a library of data sets and associated stories maintained mostly by teachers of statistics in Australia and New Zealand with the emphasis given to data sets with an Australasian context.
- StatLib is an archive maintained at Carnegie Melon University.
- The Data And Story Library (DASL) is a library maintained by Data Description, creator of the data analysis and exploration software “Data Desk”.
Ideally, we’d enable the infrastructure on store.kde.org so the interested users can upload and share new collections with other users, but we are not completely there yet. The current plan is to document the collections mentioned above as comprehensively as possible and ship them directly with the next release of LabPlot. Having over 2000 data sets documented is already a great starting point for data exploration.
We extended the import menu and there is a new “Import from Data Set Collection” entry now:
In the import dialog for data sets, you can access all documented collections. The data sets are grouped thematically in categories and sub-categories and it is possible to search in either all collections, or in each collection individually to narrow down the results. By clicking the “OK” button, you import the data for the selected data set into a new spreadsheet in the project:
As shown in the video above, upon the selection of one of the data sets its detailed description is shown. A concise description for every data set is part of the metadata mentioned above. Additionally, some libraries provide more detailed descriptions for (almost) every data set which helps identify the proper data set for your teaching purposes. The JSE library is an especially good example here with their “Story behind the data” and “Pedagogical notes”:
For such collections we also store the “detailed description URL” in the metadata and fetch the extended text from the Internet at run-time.
Downloaded files are stored on the hard disk. Next time you want to import the same data set into LabPlot again, it is fetched from the local cache. In the application settings we provide some statistics about the number and the size of downloaded files and allow also to clear the cache:
Adding new data set collections or extending and improving the already existing ones is quite easy and is documented in our repository. If you are aware of any nice data sets or collections and would like to contribute them, please let us know!