Skip to content

pallavpp/Curience-Work

Repository files navigation

Curience-Work

Pallav and Alokeveer's Work.
The documentation is uploaded in PDF format. To view the documentation as a Google Doc, click here.

Workflow

  1. Write scraping scripts in the Scripts folder. Save the scraped data in Data folder
  2. Run Filtered_Data/cumulative_data_with_keyword_count.py
  3. Run Filtered_Data/title_ngrams.py to check if any related keyword is not missed
  4. Run Filtered_Data/final_data.py to get shortlisted blogs
  5. Run Filtered_Data/summary.py to store all blog text in final_data.csv
  6. Run Filtered_Data/recall.py to get recall score
  7. Test various analytic techniques in the Analytics folder. Save outputs in the Outputs folder

Info

  • Scraping scripts are present in Scripts folder.
    • Scraped data will be saved in CSV format in Data folder.
    • Before running scripts that use selenium, user will have to update driver path in script.
    • For all scraping scripts, choose column names from the following list only: ["Blog Title", "Blog Date", "Blog Catchphrase", "Blog Category", "Blog Link", "Author Name", "Author Profile Link", "Thumbnail Link", "Thumbnail Credit"]. If you decide to include any new column name not listed here, update the above list so others know to use it in the future. If not done, filtered data may contain redundant columns.
  • Custom modules are present in Modules folder. Add following code to scripts to import custom modules:
    import os
    import sys
    modules_path = os.path.abspath(os.path.join(os.path.dirname(__file__), "../Modules"))
    if modules_path not in sys.path:
        sys.path.insert(1, modules_path)
    
  • Files containing reading data are present in Read_Files folder.
    • Each line in Read_Files/keywords.txt is treated as a keyword, used for the naive sorting.
    • Read_Files/stopwords.txt contains a stopword list and Read_Files/stopwords_cleaned.txt contains the same words, but in regex cleaned form.
  • Filtered_Data folder contains filtering scripts and sorted, cumulated and filtered data.
    • Filtered_Data/cumulative_data_with_keyword_count.py reads all CSV files from Data folder and filters them based on keywords read from Read_Files/keywords.txt.
    • Filtered_Data/final_data.py sorts and return top results in cumulative_data_with_keyword_count.csv. If required, user will have to update max rows in output in the script. Following scoring criteria is used:
      • Start with an initial score of 0
      • If some year is present in the Blog Title - remove if >= 2010 and <= 2021. If 2022 is present, the blog can't be removed in future steps
      • If Blog Date or Blog/Thumbnail Link contains year < 2021 - remove if permitted
      • +1 for each unique keyword present in the title
      • +1 additionally for ("summer" or "spring" or "2022") presence in title
      • +1 additionally for "2022" presence in the blog link
    • Filtered_Data/title_ngrams.py geneates n-grams based on blog titles for all positively scored data in cumulative_data_with_keyword_count.csv. Plots are saved in Ngram_Histogram_Plots folder.
    • Filtered_Data/divide_csv.py divides final_data.csv in two parts and saves in Analytics folder for document analytics.
    • Filtered_Data/recall.py calculates the recall score for blog links in Filtered_Data/cumulative_data_with_keyword_count.csv based on links present in Read_Files/fashion_intern_forecasting_website_list.csv
    • Filtered_Data/summary.py goes through all blogs in Filtered_Data/final_data.py, and appends the html text obtained in columns. This allows for effecient access to blog text.
  • Analytics folder contains document analytics.
    • All outputs are saved in Outputs folder.
    • Execution of Analytics/Pallav/textrank/vogue_test.ipynb requires use of GloVe embeddings. Heads Up - the size of these word embeddings is 822 MB. Extract all files and place them inside Read_Files/glove_embeddings/. Files can be downloaded here.

All Requirements

About

Pallav and Alokeveer's Work

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors