Introduction

Digital Linguistics (DLx) is the science of digital data management for linguistics, including the digital storage, representation, manipulation, and dissemination of linguistic data. It concerns itself with how to represent linguistic data in digital form, as well as best practices for working with that data, using all the benefits of the modern Open Web Platform (OWP).

The Digital Linguistics project at the University of California, Santa Barbara has three primary aims:

Austin Principles of Data Citation

The Austin Principles for Data Citation in Linguistics are a set of guidelines that enable linguists to make informed decisions regarding the accessibility and transparency of their research data. (source) The Data Format for Digital Linguistics both aligns with and helps foster the Austin Principles:

  1. Importance

    Data should be considered legitimate, citable products of research.
    the data on which linguistic analyses are based are of fundamental importance to the field and should be treated as such […] Linguistic data should be citable and cited

    The Digital Linguistics enterprise is premised on the idea that primary linguistic data is of fundamental importance to the field of linguistics, and that the management of such data merits careful attention.

  2. Credit & Attribution

    Data citations should facilitate giving scholarly credit and normative and legal attribution to all contributors to the data

    The Data Format for Digital Linguistics includes recommendations regarding how to store information about the people who contributed to the production and management of the data, and their role(s) in the process.

  3. Evidence

    Linguists should cite the data upon which scholarly claims are based. In order for data to be citable, it should be stored in an accessible location, preferably a data archive or other trusted repository.

    The DLx format, being web-compatible, makes it possible to publish data sets on the web at stable URLs for consumption and citation. Since the JSON format used by DLx is a simple text format, it can easily be stored in any archive or online database.

  4. Unique Identification

    A data citation should include a persistent method for identification that is machine actionable, globally unique, and widely used by a community.

    Each of the DLx linguistic schemas allows for both URL and ID fields which are unique identifiers, so that each linguistic object can be permanently associated with a unique URL, and/or unique ID in a database.

  5. Access

    Data citations should facilitate access to the data themselves and to such associated metadata […] as are necessary for both humans and machines to make informed use of the referenced data.
    Linguistic data should be as open as possible, in order to facilitate reproducibility; and as closed as necessary, to honor relevant ethical, legal, and speaker community constraints.

    The use of JSON format for Digital Linguistics means that data is both human- and machine-readable. The Data Format for Digital Linguistics also makes recommendations regarding how to indicate access rights to various kinds of linguistic data, at every level of granularity.

  6. Persistence

    Unique identifiers, and metadata describing the data, and its disposition, should persist — even beyond the lifespan of the data they describe.

    The cheap cost of cloud storage today makes it possible to continue storing metadata about the objects of language documentation even when that data no longer exists, is no longer public, or becomes too large for its repository.

  7. Specificity & Verifiability

    Data citations should make it easy for a curious reader to find the specific datum or subset of data within the larger dataset that support a claim.
    Citations should specify which version of the data is being referenced.

    The Data Format for Digital Linguistics provides a convention for assigning human-readable keys to each piece of data, and every level of granularity, allowing interested parties to easily reference or look up data at the level of the text, utterance, word, morpheme, and phoneme.

  8. Interoperability & Flexibility

    Data citation methods should be sufficiently flexibile to accommodate the variant practices among communities, but should not differ so much that they compromise interoperability
    Citation standards developed for linguistics need to meet the needs of the research community, while also meeting the principles described above.

    The Data Format for Digital Linguistics is highly interoperable because it is based on JSON, which has become the standard format for interchanging data on the web. Using web-based tools for Digital Linguistics also means that user interfaces can be designed to allow users to interact with the data in the ways that are most comfortable to them. Using open source tools and software also facilitates the creation of new tools to meet the needs of users.