Introduction
Digital Linguistics (DLx) is the science of digital data management for linguistics, including the digital storage, representation, manipulation, and dissemination of linguistic data. It concerns itself with how to represent linguistic data in digital form, as well as best practices for working with that data, using all the benefits of the modern Open Web Platform (OWP).
The Digital Linguistics project at the University of California, Santa Barbara has four primary aims:
-
Define a standard data format for storing linguistic data in computer-readable form. This format should be platform- and software-independent (i.e. not restricted to use with Windows or specific software like ELAN), and encode the linguistic concepts that linguists are already familiar with, while still maintaining flexibility and human readability. You can read more about this format here.
-
Provide scripts and libraries that make it easy for developers to work with data in DLx format and create tools and software that use that format. All DLx scripts are open source, and contributions from the community are very welcome. Check out the DLx developer page here.
-
Create a variety of web-based tools that allow linguists to more easily enter, search, and manage their data. The first of these tools, an app for managing lexicons, is under development. All of the DLx tools will be open source, and contributions from the community are very welcome.
-
Educate linguists about best practices in linguistic data management, and how they can apply the concepts of Digital Linguistics to their own work. Click here to begin learning about the use of Digital Linguistics principles in documentary linguistics.
Austin Principles of Data Citation
The Austin Principles for Data Citation in Linguistics are a set of guidelines that enable linguists to make informed decisions regarding the accessibility and transparency of their research data.
(source) The Data Format for Digital Linguistics both aligns with and helps foster the Austin Principles:
-
Importance
Data should be considered legitimate, citable products of research.
the data on which linguistic analyses are based are of fundamental importance to the field and should be treated as such […] Linguistic data should be citable and cited
The Digital Linguistics enterprise is premised on the idea that primary linguistic data is of fundamental importance to the field of linguistics, and that the management of such data merits careful attention.
-
Credit & Attribution
Data citations should facilitate giving scholarly credit and normative and legal attribution to all contributors to the data
The Data Format for Digital Linguistics includes recommendations regarding how to store information about the people who contributed to the production and management of the data, and their role(s) in the process.
-
Evidence
Linguists should cite the data upon which scholarly claims are based. In order for data to be citable, it should be stored in an accessible location, preferably a data archive or other trusted repository.
The DLx format, being web-compatible, makes it possible to publish data sets on the web at stable URLs for consumption and citation. Since the JSON format used by DLx is a simple text format, it can easily be stored in any archive or online database.
-
Unique Identification
A data citation should include a persistent method for identification that is machine actionable, globally unique, and widely used by a community.
Each of the DLx linguistic schemas allows for both URL and ID fields which are unique identifiers, so that each linguistic object can be permanently associated with a unique URL, and/or unique ID in a database.
-
Access
Data citations should facilitate access to the data themselves and to such associated metadata […] as are necessary for both humans and machines to make informed use of the referenced data.
Linguistic data should be as open as possible, in order to facilitate reproducibility; and as closed as necessary, to honor relevant ethical, legal, and speaker community constraints.
The use of JSON format for Digital Linguistics means that data is both human- and machine-readable. The Data Format for Digital Linguistics also makes recommendations regarding how to indicate access rights to various kinds of linguistic data, at every level of granularity.
-
Persistence
Unique identifiers, and metadata describing the data, and its disposition, should persist — even beyond the lifespan of the data they describe.
The cheap cost of cloud storage today makes it possible to continue storing metadata about the objects of language documentation even when that data no longer exists, is no longer public, or becomes too large for its repository.
-
Specificity & Verifiability
Data citations should make it easy for a curious reader to find the specific datum or subset of data within the larger dataset that support a claim.
Citations should specify which version of the data is being referenced.
The Data Format for Digital Linguistics provides a convention for assigning human-readable keys to each piece of data, and every level of granularity, allowing interested parties to easily reference or look up data at the level of the text, utterance, word, morpheme, and phoneme.
-
Interoperability & Flexibility
Data citation methods should be sufficiently flexibile to accommodate the variant practices among communities, but should not differ so much that they compromise interoperability
Citation standards developed for linguistics need to meet the needs of the research community, while also meeting the principles described above.
The Data Format for Digital Linguistics is highly interoperable because it is based on JSON, which has become the standard format for interchanging data on the web. Using web-based tools for Digital Linguistics also means that user interfaces can be designed to allow users to interact with the data in the ways that are most comfortable to them. Using open source tools and software also facilitates the creation of new tools to meet the needs of users.