Return Home


A Benchmark Dataset for Legal Language Understanding in English


LexGLUE's contribution to the state of the art are two-fold. First, it combines and refines seven, huge document datasets into one, easy-to-access corpus. Second, thanks to the authors' thoughtful packaging of their datasets, they can be effortlessly integrated into Hugging Face's state-of-the-art transformer library for training and evaluation purposes with just a couple lines of codes.

According to the authors:

By unifying and facilitating the access to a set of law-related datasets and tasks, we hope to attract not only more NLP experts, but also more interdisciplinary researchers (e.g., law doctoral students willing to take NLP courses). More broadly, we hope LexGLUE will speed up the adoption and transparent evaluation of new legal NLP methods and approaches in the commercial sector too. Indeed, there have been many commercial press releases in the legal-tech industry on high-performing systems, but almost no independent evaluation of the performance of machine learning and NLP-based tools. A standard publicly available benchmark would also allay concerns of undue influence in predictive models, including the use of metadata which the relevant law expressly disregards.>

LexGLUE's seven, constituent datasets contain over 100,000 training instances total, primarily for multi-label or multi-class text classification tasks:

LexGLUE Summary.png

The seven included datasets and tasks are:

  1. ECtHR Tasks A: includes over 11,000 cases from the European Court of Human Rights mapped to specific violation of human rights provisions of the European Convention of Human Rights (ECHR), permitting trained models to match fact patterns to alleged violations of the ECHR. Task A takes a list of facts and outputs a list of violated articles.
  2. ECtHR Tasks B: Tasks B uses the same training data as ECtHR Tasks A and the same inputs (a list of case facts), except the outputs are allegedly violated articles.
  3. SCOTUS: this dataset maps thousands of U.S. Supreme Court opinions to one of 14 different issue areas, such as Criminal Procedure, Civil Rights, Economic Activity, etc.. This permits the training and evaluation of models that can do the same.
  4. EUR-LEX: includes over 65,000 EU Laws mapped to corresponding legal concepts assigned by the EU’s Publications Office.
  5. LEDGAR: includes an excerpt of the 100 most common contractual provision labels found in a dataset containing 850,000 contract provisions labelled with 12,500 different categories.
  6. UNFAIR-ToS: includes thousands of terms of service sentences labeled with any of 8 applicable types of unfair contractual terms (sentences) under European consumer law.
  7. CaseHOLD: is comprised of around 53,000 multiple choice questions on the holdings of U.S. court cases.


Open Source:Yes
Paid Support:No
API:Hugging Face Library Integration



Tech Stack


Project Developer(s)


Michael Bommarito

Adjunct Professor of Law at Michigan State University

From the Stanford CodeX Website: Michael Bommarito is a former CodeX fellow. He is an Adjunct Professor of Law at Michigan State University and Head of Research at the ReInventLaw Laboratory. His research interests include natural language processing, machine learning, decision science, optimization, visualization, modeling, and policy, especially as applied to law and finance.


Dirk Hartung

Founder and Executive Director, Center for Legal Technology and Data Science at Bucerius Law School

From Stanford's directory: Dirk Hartung is the founder and Executive Director of the Center for Legal Technology and Data Science at Bucerius Law School in Hamburg, Germany. He is the Co-Academic Director for the Bucerius Summer Program in Legal Technology and Operations and Bucerius Legal Technology Essentials. He develops the technology curriculum for this leading German law school. He is writing a PhD on digital lawyering under unauthorized practice of law regimes.


Ion Androutsopoulos

Professor of AI in the Department of Informatics of the AUEB

From Ion's personal website: I am Professor of Artificial Intelligence (AI) in the Department of Informatics of the Athens University of Economics and Business (AUEB), and head of AUEB's Natural Language Processing Group. I am also Scientific Advisor of the AI Centre of Excellence in Document Intelligence at NCSR "Demokritos", and Adjunct Researcher of the Institute for the Management of Information Systems (Digital Curation Unit) at the Research Centre "Athena".


Abhik Jana

Postdoctoral research associate at Universität Hamburg

From Abhik's personal website: I am a postdoctoral research associate at Universität Hamburg working under the supervision of Professor Chris Biemann. I am currently working on HILANO project which deals with the anonymization of sensitive data.


Daniel Martin Katz

Professor of Law, Chicago Kent College of Law

From his personal website: Research Interests include legal informatics, applied legal technology, law & economics, legal & regulatory complexity, artificial intelligence, artificial intelligence & law, machine learning & natural language processing, complex systems, network science, governance, financial regulation, financial technology, quantitative finance, quantitative modeling of litigation and jurisprudence, economics of the professions, blockchain & crypto infrastructure and the overall impact of information technology, analytics and automation on the future of society.


Nikolaos Aletras

Lecturer in NLP, Computer Science Department at the University of Sheffield

From his Sheffield profile page: Nikos Aletras is a Lecturer in Natural Language Processing (NLP) in the Computer Science Department at the University of Sheffield, co-affiliated with the Machine Learning (ML) group. Previously, he was a research scientist at Amazon (Core ML and Alexa) and a research associate at UCL, Department of Computer Science, Media Futures Group. He completed a PhD in NLP at the University of Sheffield. His research interests are in NLP, Machine Learning and Data Science. He develops text analysis methods to solve problems in other scientific areas such as (computational) social and legal science.


Ilias Chalkidis

ost-doctoral researcher at the Department of Computer Science at University of Copenhagen

From Ilias' personal website: I am a post-doctoral researcher at the Department of Computer Science at University of Copenhagen (CoAStaL NLP Group). I recently received my Ph.D. from the Department of Informatics at Athens University of Economics and Business. My expertise is in Legal Natural Language Processing (LegalNLP), also known as Legal Intelligence. I have been a reviewer for ACL venues (ACL/EMNLP/NAACL 2020-2021) and reputable journals, such as AI & Law, PeerJ, ACM Computing Surveys, and Computer Speech & Language. I have also served and currently serve in the program committees of AI and NLP workshops targeting legal applications (AI4LEGAL, NLLP).