Pallav and Alokeveer's Work.
The documentation is uploaded in PDF format. To view the documentation as a Google Doc, click here.
- Write scraping scripts in the Scripts folder. Save the scraped data in Data folder
- Run Filtered_Data/cumulative_data_with_keyword_count.py
- Run Filtered_Data/title_ngrams.py to check if any related keyword is not missed
- Run Filtered_Data/final_data.py to get shortlisted blogs
- Run Filtered_Data/summary.py to store all blog text in final_data.csv
- Run Filtered_Data/recall.py to get recall score
- Test various analytic techniques in the Analytics folder. Save outputs in the Outputs folder
- Scraping scripts are present in
Scriptsfolder.- Scraped data will be saved in CSV format in
Datafolder. - Before running scripts that use selenium, user will have to update driver path in script.
- For all scraping scripts, choose column names from the following list only: ["Blog Title", "Blog Date", "Blog Catchphrase", "Blog Category", "Blog Link", "Author Name", "Author Profile Link", "Thumbnail Link", "Thumbnail Credit"]. If you decide to include any new column name not listed here, update the above list so others know to use it in the future. If not done, filtered data may contain redundant columns.
- Scraped data will be saved in CSV format in
- Custom modules are present in
Modulesfolder. Add following code to scripts to import custom modules:import os import sys modules_path = os.path.abspath(os.path.join(os.path.dirname(__file__), "../Modules")) if modules_path not in sys.path: sys.path.insert(1, modules_path) - Files containing reading data are present in
Read_Filesfolder.- Each line in Read_Files/keywords.txt is treated as a keyword, used for the naive sorting.
- Read_Files/stopwords.txt contains a stopword list and Read_Files/stopwords_cleaned.txt contains the same words, but in regex cleaned form.
Filtered_Datafolder contains filtering scripts and sorted, cumulated and filtered data.- Filtered_Data/cumulative_data_with_keyword_count.py reads all CSV files from
Datafolder and filters them based on keywords read from Read_Files/keywords.txt. - Filtered_Data/final_data.py sorts and return top results in cumulative_data_with_keyword_count.csv. If required, user will have to update max rows in output in the script. Following scoring criteria is used:
- Start with an initial score of 0
- If some year is present in the Blog Title - remove if >= 2010 and <= 2021. If 2022 is present, the blog can't be removed in future steps
- If Blog Date or Blog/Thumbnail Link contains year < 2021 - remove if permitted
- +1 for each unique keyword present in the title
- +1 additionally for ("summer" or "spring" or "2022") presence in title
- +1 additionally for "2022" presence in the blog link
- Filtered_Data/title_ngrams.py geneates n-grams based on blog titles for all positively scored data in cumulative_data_with_keyword_count.csv. Plots are saved in
Ngram_Histogram_Plotsfolder. - Filtered_Data/divide_csv.py divides final_data.csv in two parts and saves in
Analyticsfolder for document analytics. - Filtered_Data/recall.py calculates the recall score for blog links in Filtered_Data/cumulative_data_with_keyword_count.csv based on links present in Read_Files/fashion_intern_forecasting_website_list.csv
- Filtered_Data/summary.py goes through all blogs in Filtered_Data/final_data.py, and appends the html text obtained in columns. This allows for effecient access to blog text.
- Filtered_Data/cumulative_data_with_keyword_count.py reads all CSV files from
Analyticsfolder contains document analytics.- All outputs are saved in Outputs folder.
- Execution of Analytics/Pallav/textrank/vogue_test.ipynb requires use of GloVe embeddings. Heads Up - the size of these word embeddings is 822 MB. Extract all files and place them inside Read_Files/glove_embeddings/. Files can be downloaded here.