Awesome Linguistics
A curated list of anything remotely related to linguistics, sorted in alphabetical order.
Programming
Libraries, frameworks and applications useful for developing applications.
Platforms and toolkits
- CLARIN-D web tools - Tools for Analysing Research Data
- CorpusExplorer - Software for corpus linguists and text/data mining enthusiasts. The CorpusExplorer combines over 50 interactive visualizations under a user-friendly interface.
- Haxe-linguistics - Early linguistical analysis and natural language processing library for Haxe.
- Natural - General natural language tools for Node.js.
- Natural Language ToolKit (NLTK) - The most complete platform for building Python programs to work with human language data.
- Snowball - Snowball is a language in which stemming algorithms can be easily represented.
- Spacy - Industrial-strength National Language Processing in Python.
- Mate Tools, webservice via WebLicht
- UBIAI - Easy-to-use text annotation tool for teams with most comprehensive auto-annotation features. Supports NER, relations and document classification as well as OCR annotation for invoice labeling.
- textblob-de - Nice alternative for spacy (see above).
- tyo - A utility for finding Typo-Bridges.
- UralicNLP - An open source Python library for processing morphologically rich and, for the most part, endangered Uralic languages. It can do morphological analysis, generation, lemmatization, disambiguation and lexical lookup for a great many Uralic languages.
Algorithms
- Stemming algorithms for various European languages - Various stemming algorithms from snowball.
- The Porter Stemmer Algorithm - The ‘official’ home page for distribution of the Porter Stemming Algorithm, written and maintained by its author, Martin Porter.
Data sets
- EuroRomCom Data - JSON formatted Pan-Romance word lists.
- Araneum Germanicum
- CEHugeWebCorpus - German corpus based on CommonCrawl
- Digitales Wörterbuch der deutschen Sprache (DWDS)
- GC4 Corpus (CommonCrawl)
- IDS Corpora - German Reference Corpus
- Leipzig Corpora Collection - sampled sentences in different languages.
- SdeWaC - big german internet corpus
- C-WEP
- DysList (list of dyslexic errors)
- Falko
- Litkey
- OpinionSpam
Resources
- Low Resource Languages - A list of resources for conservation, development, and documentation of low resource (human) languages.
- Language Science Press - Language Science Press is a born-digital scholar-led open access publisher in linguistics.