2020 has been a tough year for everyone and staying in for 8 months at a stretch brought in a wave of opportunities and Girl Script extended was a cherry on the cake!
Girl Script Summer of Code is the 3 month long Open Source program during summers conducted by Girl Script Foundation aims to help beginners get started with Open Source but this time they also came up with an extension version due to the worldwide pandemic.
The forms were out in August and I got the selection mail one afternoon and i was excited for the 3 month journey awaiting.
The extended version unlike any other Open Source program was more than just contribution. We worked in groups of 50 to develop projects required by the industry which means the projects were real time requirement of the IT sector .✨
I choose the project which resembled my previous undertakings titled “How Many Topics?”, it was a Research based project and required Natural Language processing , Python and Mathematics.
Topic modeling requires discovering the underlying thematic structure in a text corpus and the top terms appearing in each topic are considered as output. Despite the diversity of topic modeling algorithms that have been proposed, a common challenge in successfully applying these techniques is the selection of an appropriate number of topics for a given corpus.
We worked on a research topic where we proposed a term centric stability analysis strategy to address this issue, the idea being that a model with an appropriate number of topics will be more robust to perturbations in the data.
This project undertook the following pipeline:
- Dataset Collection
2. Applying different Topic Modeling algorithms
3. Proof of Work for Algorithms
4. Fine-tuning the parameters for topic modeling,
5. Deducing Mathematical Equation
6. Reaching a generic algorithm