Digital Linguistics (DLx) is the science of digital data management for linguistics, including the digital storage, representation, manipulation, and dissemination of linguistic data. It concerns itself with how to represent linguistic data in digital form, as well as best practices for working with that data, using all the benefits of the modern Open Web Platform (OWP).
The Digital Linguistics project at the University of California, Santa Barbara has three primary aims:
Define a standard data format for storing linguistic data in computer-readable form. This format should be platform- and software-independent (i.e. not restricted to use with Windows or specific software like ELAN), and encode the linguistic concepts that linguists are already familiar with, while still maintaining flexibility and human readability. You can read more about this format here.
Create a variety of web-based tools that allow linguists to more easily enter, search, and manage their data. All of the DLx tools are open source, and contributions from the community are very welcome. Click here to get started using DLx tools with your own data.
Educate linguists about best practices in linguistic data management, and how they can apply the concepts of Digital Linguistics to their own work. Click here to begin learning about the use of Digital Linguistics principles in documentary linguistics.
Austin Principles of Data Citation
The Austin Principles for Data Citation in Linguistics are
a set of guidelines that enable linguists to make informed decisions regarding the accessibility and transparency of their research data. (source) The Data Format for Digital Linguistics both aligns with and helps foster the Austin Principles:
The Digital Linguistics enterprise is premised on the idea that primary linguistic data is of fundamental importance to the field of linguistics, and that the management of such data merits careful attention.
Credit & Attribution
The Data Format for Digital Linguistics includes recommendations regarding how to store information about the people who contributed to the production and management of the data, and their role(s) in the process.
The DLx format, being web-compatible, makes it possible to publish data sets on the web at stable URLs for consumption and citation. Since the JSON format used by DLx is a simple text format, it can easily be stored in any archive or online database.
Each of the DLx linguistic schemas allows for both URL and ID fields which are unique identifiers, so that each linguistic object can be permanently associated with a unique URL, and/or unique ID in a database.
The use of JSON format for Digital Linguistics means that data is both human- and machine-readable. The Data Format for Digital Linguistics also makes recommendations regarding how to indicate access rights to various kinds of linguistic data, at every level of granularity.
The cheap cost of cloud storage today makes it possible to continue storing metadata about the objects of language documentation even when that data no longer exists, is no longer public, or becomes too large for its repository.
Specificity & Verifiability
The Data Format for Digital Linguistics provides a convention for assigning human-readable keys to each piece of data, and every level of granularity, allowing interested parties to easily reference or look up data at the level of the text, utterance, word, morpheme, and phoneme.
Interoperability & Flexibility
The Data Format for Digital Linguistics is highly interoperable because it is based on JSON, which has become the standard format for interchanging data on the web. Using web-based tools for Digital Linguistics also means that user interfaces can be designed to allow users to interact with the data in the ways that are most comfortable to them. Using open source tools and software also facilitates the creation of new tools to meet the needs of users.