Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks
Paper
•
1908.10084
•
Published
•
12
This is a sentence-transformers model finetuned from sentence-transformers/all-MiniLM-L6-v2. It maps sentences & paragraphs to a 384-dimensional dense vector space and can be used for semantic textual similarity, semantic search, paraphrase mining, text classification, clustering, and more.
SentenceTransformer(
(0): Transformer({'max_seq_length': 256, 'do_lower_case': False}) with Transformer model: BertModel
(1): Pooling({'word_embedding_dimension': 384, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
(2): Normalize()
)
First install the Sentence Transformers library:
pip install -U sentence-transformers
Then you can load this model and run inference.
from sentence_transformers import SentenceTransformer
# Download from the 🤗 Hub
model = SentenceTransformer("sgadagin/fine_tuned_sbert")
# Run inference
sentences = [
'Security Event : Hack In Paris (16-17 June, 2011)\n\n\nHack In Paris is an international and corporate security event that will take place in Disneyland Paris® fromJune 16th to 17th of 2011. Please refer to the homepage to get up-to-date information about the event.\n\nTopics\nThe following list contains major topics the conference will cover. Please consider submitting even if the subject of your research is not listed here.\nAdvances in reverse engineering\nVulnerability research and exploitation\nPenetration testing and security assessment\nMalware analysis and new trends in malicous codes\nForensics, IT crime & law enforcement\nPrivacy issues: LOPPSI, HADOPI, …\nLow-level hacking (console security & mobile devices)\nRisk management and ISO 27001\nDates\nJanuary 20: CFP announced\nMarch 30: Submission deadline\nApril 15: Notification sent to authors\nApril 17: Program announcement\nJune 16-17: Hack In Paris\nJune 18: Nuit du Hack\nMore Information: https://hackinparis.com\n\n',
'Security Event : Hack In Paris (16-17 June, 2011)\n\n\nHack In Paris is an international and corporate security event that will take place in Disneyland Paris® fromJune 16th to 17th of 2011. Please refer to the homepage to get up-to-date information about the event.\n\nTopics\nThe following list contains major topics the conference will cover. Please consider submitting even if the subject of your research is not listed here.\nAdvances in reverse engineering\nVulnerability research and exploitation\nPenetration testing and security assessment\nMalware analysis and new trends in malicous codes\nForensics, IT crime & law enforcement\nPrivacy issues: LOPPSI, HADOPI, …\nLow-level hacking (console security & mobile devices)\nRisk management and ISO 27001\nDates\nJanuary 20: CFP announced\nMarch 30: Submission deadline\nApril 15: Notification sent to authors\nApril 17: Program announcement\nJune 16-17: Hack In Paris\nJune 18: Nuit du Hack\nMore Information: https://hackinparis.com\n\n',
'Google is going to shut down its social media network Google+ after the company suffered a massive data breach that exposed the private data of hundreds of thousands of Google Plus users to third-party developers.\n\nAccording to the tech giant, a security vulnerability in one of Google+\'s People APIs allowed third-party developers to access data for more than 500,000 users, including their usernames, email addresses, occupation, date of birth, profile photos, and gender-related information.\n\nSince Google+ servers do not keep API logs for more than two weeks, the company cannot confirm the number of users impacted by the vulnerability.\n\nHowever, Google assured its users that the company found no evidence that any developer was aware of this bug, or that the profile data was misused by any of the 438 developers that could have had access.\n"However, we ran a detailed analysis over the two weeks prior to patching the bug, and from that analysis, the Profiles of up to 500,000 Google+ accounts were potentially affected. Our analysis showed that up to 438 applications may have used this API," Google said in blog post published today.\nThe vulnerability was open since 2015 and fixed after Google discovered it in March 2018, but the company chose not to disclose the breach to the public—at the time when Facebook was being roasted for Cambridge Analytica scandal.\n\nThough Google has not revealed the technical details of the security vulnerability, the nature of the flaw seems to be something very similar to Facebook API flaw that recently allowed unauthorized developers to access private data from Facebook users.\n\nBesides admitting the security breach, Google also announced that the company is shutting down its social media network, acknowledging that Google+ failed to gain broad adoption or significant traction with consumers.\n"The consumer version of Google+ currently has low usage and engagement: 90 percent of Google+ user sessions are less than five seconds," Google said.\nIn response, the company has decided to shut down Google+ for consumers by the end of August 2019. However, Google+ will continue as a product for Enterprise users.\n\nGoogle Introduces New Privacy Controls Over Third-Party App Permissions\n\nAs part of its "Project Strobe," Google engineers also reviewed third-party developer access to Google account and Android device data; and has accordingly now introduced some new privacy controls.\n\nWhen a third-party app prompts users for access to their Google account data, clicking "Allow" button approves all requested permissions at once, leaving an opportunity for malicious apps to trick users into giving away powerful permissions.\nBut now Google has updated its Account Permissions system that asks for each requested permission individually rather than all at once, giving users more control over what type of account data they choose to share with each app.\n\nSince APIs can also allow developers to access users\' extremely sensitive data, like that of Gmail account, Google has limited access to Gmail API only for apps that directly enhance email functionality—such as email clients, email backup services and productivity services.\n\nGoogle shares fell over 2 percent to $1134.23 after the data breach reports.\n\n',
]
embeddings = model.encode(sentences)
print(embeddings.shape)
# [3, 384]
# Get the similarity scores for the embeddings
similarities = model.similarity(embeddings, embeddings)
print(similarities.shape)
# [3, 3]
sentence_0, sentence_1, and label| sentence_0 | sentence_1 | label | |
|---|---|---|---|
| type | string | string | int |
| details |
|
|
|
| sentence_0 | sentence_1 | label |
|---|---|---|
U.S. online fashion retailer SHEIN has admitted that the company has suffered a significant data breach after unknown hackers stole personally identifiable information (PII) of almost 6.5 million customers. |
U.S. online fashion retailer SHEIN has admitted that the company has suffered a significant data breach after unknown hackers stole personally identifiable information (PII) of almost 6.5 million customers. |
1 |
A location based Social Networking platform with 45 million users,'Foursquare' was vulnerable to the primary email address disclosed. |
A location based Social Networking platform with 45 million users,'Foursquare' was vulnerable to the primary email address disclosed. |
1 |
Earlier this week Dropbox team unveiled details of three critical vulnerabilities in Apple macOS operating system, which altogether could allow a remote attacker to execute malicious code on a targeted Mac computer just by convincing a victim into visiting a malicious web page. |
Earlier this week Dropbox team unveiled details of three critical vulnerabilities in Apple macOS operating system, which altogether could allow a remote attacker to execute malicious code on a targeted Mac computer just by convincing a victim into visiting a malicious web page. |
3 |
SoftmaxLossmulti_dataset_batch_sampler: round_robinoverwrite_output_dir: Falsedo_predict: Falseeval_strategy: noprediction_loss_only: Trueper_device_train_batch_size: 8per_device_eval_batch_size: 8per_gpu_train_batch_size: Noneper_gpu_eval_batch_size: Nonegradient_accumulation_steps: 1eval_accumulation_steps: Nonetorch_empty_cache_steps: Nonelearning_rate: 5e-05weight_decay: 0.0adam_beta1: 0.9adam_beta2: 0.999adam_epsilon: 1e-08max_grad_norm: 1num_train_epochs: 3max_steps: -1lr_scheduler_type: linearlr_scheduler_kwargs: {}warmup_ratio: 0.0warmup_steps: 0log_level: passivelog_level_replica: warninglog_on_each_node: Truelogging_nan_inf_filter: Truesave_safetensors: Truesave_on_each_node: Falsesave_only_model: Falserestore_callback_states_from_checkpoint: Falseno_cuda: Falseuse_cpu: Falseuse_mps_device: Falseseed: 42data_seed: Nonejit_mode_eval: Falseuse_ipex: Falsebf16: Falsefp16: Falsefp16_opt_level: O1half_precision_backend: autobf16_full_eval: Falsefp16_full_eval: Falsetf32: Nonelocal_rank: 0ddp_backend: Nonetpu_num_cores: Nonetpu_metrics_debug: Falsedebug: []dataloader_drop_last: Falsedataloader_num_workers: 0dataloader_prefetch_factor: Nonepast_index: -1disable_tqdm: Falseremove_unused_columns: Truelabel_names: Noneload_best_model_at_end: Falseignore_data_skip: Falsefsdp: []fsdp_min_num_params: 0fsdp_config: {'min_num_params': 0, 'xla': False, 'xla_fsdp_v2': False, 'xla_fsdp_grad_ckpt': False}fsdp_transformer_layer_cls_to_wrap: Noneaccelerator_config: {'split_batches': False, 'dispatch_batches': None, 'even_batches': True, 'use_seedable_sampler': True, 'non_blocking': False, 'gradient_accumulation_kwargs': None}deepspeed: Nonelabel_smoothing_factor: 0.0optim: adamw_torchoptim_args: Noneadafactor: Falsegroup_by_length: Falselength_column_name: lengthddp_find_unused_parameters: Noneddp_bucket_cap_mb: Noneddp_broadcast_buffers: Falsedataloader_pin_memory: Truedataloader_persistent_workers: Falseskip_memory_metrics: Trueuse_legacy_prediction_loop: Falsepush_to_hub: Falseresume_from_checkpoint: Nonehub_model_id: Nonehub_strategy: every_savehub_private_repo: Nonehub_always_push: Falsegradient_checkpointing: Falsegradient_checkpointing_kwargs: Noneinclude_inputs_for_metrics: Falseinclude_for_metrics: []eval_do_concat_batches: Truefp16_backend: autopush_to_hub_model_id: Nonepush_to_hub_organization: Nonemp_parameters: auto_find_batch_size: Falsefull_determinism: Falsetorchdynamo: Noneray_scope: lastddp_timeout: 1800torch_compile: Falsetorch_compile_backend: Nonetorch_compile_mode: Nonedispatch_batches: Nonesplit_batches: Noneinclude_tokens_per_second: Falseinclude_num_input_tokens_seen: Falseneftune_noise_alpha: Noneoptim_target_modules: Nonebatch_eval_metrics: Falseeval_on_start: Falseuse_liger_kernel: Falseeval_use_gather_object: Falseaverage_tokens_across_devices: Falseprompts: Nonebatch_sampler: batch_samplermulti_dataset_batch_sampler: round_robin| Epoch | Step | Training Loss |
|---|---|---|
| 1.0684 | 500 | 1.2186 |
| 2.1368 | 1000 | 1.145 |
@inproceedings{reimers-2019-sentence-bert,
title = "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks",
author = "Reimers, Nils and Gurevych, Iryna",
booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing",
month = "11",
year = "2019",
publisher = "Association for Computational Linguistics",
url = "https://arxiv.org/abs/1908.10084",
}
Base model
sentence-transformers/all-MiniLM-L6-v2