microsoft-llmlingua-2-bert-base-multilingual-cased-meetingbank
Version: 2
HuggingFaceLast updated July 2025

LLMLingua-2-Bert-base-Multilingual-Cased-MeetingBank

This model was introduced in the paper LLMLingua-2: Data Distillation for Efficient and Faithful Task-Agnostic Prompt Compression (Pan et al, 2024) . It is a BERT multilingual base model (cased) finetuned to perform token classification for task agnostic prompt compression. The probability $p_{preserve}$ of each token $x_i$ is used as the metric for compression. This model is trained on the extractive text compression dataset constructed with the methodology proposed in the LLMLingua-2 , using training examples from MeetingBank (Hu et al, 2023) as the seed data. You can evaluate the model on downstream tasks such as question answering (QA) and summarization over compressed meeting transcripts using this dataset . For more details, please check the project page of LLMLingua-2 and LLMLingua Series .

Usage

from llmlingua import PromptCompressor

compressor = PromptCompressor(
    model_name="microsoft/llmlingua-2-bert-base-multilingual-cased-meetingbank",
    use_llmlingua2=True
)

original_prompt = """John: So, um, I've been thinking about the project, you know, and I believe we need to, uh, make some changes. I mean, we want the project to succeed, right? So, like, I think we should consider maybe revising the timeline.
Sarah: I totally agree, John. I mean, we have to be realistic, you know. The timeline is, like, too tight. You know what I mean? We should definitely extend it.
"""
results = compressor.compress_prompt_llmlingua2(
    original_prompt,
    rate=0.6,
    force_tokens=['\n', '.', '!', '?', ','],
    chunk_end_tokens=['.', '\n'],
    return_word_label=True,
    drop_consecutive=True
)

print(results.keys())
print(f"Compressed prompt: {results['compressed_prompt']}")
print(f"Original tokens: {results['origin_tokens']}")
print(f"Compressed tokens: {results['compressed_tokens']}")
print(f"Compression rate: {results['rate']}")

# get the annotated results over the original prompt
word_sep = "\t\t|\t\t"
label_sep = " "
lines = results["fn_labeled_original_prompt"].split(word_sep)
annotated_results = []
for line in lines:
    word, label = line.split(label_sep)
    annotated_results.append((word, '+') if label == '1' else (word, '-')) # list of tuples: (word, label)
print("Annotated results:")
for word, label in annotated_results[:10]:
    print(f"{word} {label}")

Citation

@article{wu2024llmlingua2,
    title = "{LLML}ingua-2: Data Distillation for Efficient and Faithful Task-Agnostic Prompt Compression",
    author = "Zhuoshi Pan and Qianhui Wu and Huiqiang Jiang and Menglin Xia and Xufang Luo and Jue Zhang and Qingwei Lin and Victor Ruhle and Yuqing Yang and Chin-Yew Lin and H. Vicky Zhao and Lili Qiu and Dongmei Zhang",
    url = "https://arxiv.org/abs/2403.12968",
    journal = "ArXiv preprint",
    volume = "abs/2403.12968",
    year = "2024",
}

microsoft/llmlingua-2-bert-base-multilingual-cased-meetingbank powered by Hugging Face Inference Toolkit

Send Request

You can use cURL or any REST Client to send a request to the AzureML endpoint with your AzureML token.
curl <AZUREML_ENDPOINT_URL> \
    -X POST \
    -H "Authorization: Bearer <AZUREML_TOKEN>" \
    -H "Content-Type: application/json" \
    -d '{"inputs":"My name is Sarah Jessica Parker but you can call me Jessica"}'
        

Supported Parameters

  • inputs (string): The input text data
  • parameters (object):
    • ignore_labels (string[]): A list of labels to ignore
    • stride (integer): The number of overlapping tokens between chunks when splitting the input text.
    • aggregation_strategy (string): One of the following:
      • none: Do not aggregate tokens
      • simple: Group consecutive tokens with the same label in a single entity.
      • first: Similar to "simple", also preserves word integrity (use the label predicted for the first token in a word).
      • average: Similar to "simple", also preserves word integrity (uses the label with the highest score, averaged across the word’s tokens).
      • max: Similar to "simple", also preserves word integrity (uses the label with the highest score across the word’s tokens).
Check the full API Specification at the Hugging Face Inference documentation .
Model Specifications
LicenseApache-2.0
Last UpdatedJuly 2025
ProviderHuggingFace