Join the conversation

Join the community of Machine Learners and AI enthusiasts.

Sign Up
omarkamaliΒ 
posted an update 7 days ago
Post
5179
Exciting updates to the Wikipedia Monthly dataset for November! πŸš€

・ Fixed a bug to remove infobox leftovers and other wiki markers such as __TOC__
・ New python package https://pypi.org/project/wikisets: a dataset builder with efficient sampling so you can combine the languages you want seamlessly for any date (ideal for pretraining data but works for any purpose)
・ Moved the pipeline to a large server. Much higher costs but with better reliability and predictability (let me know if you'd like to sponsor this!).
・ Dataset sizes are unfortunately missing for this month due to shenanigans with the migration, but should be back in December's update.

Check out the dataset:
omarkamali/wikipedia-monthly
In this post