Curience-Work

Pallav and Alokeveer's Work.
The documentation is uploaded in PDF format. To view the documentation as a Google Doc, click here.

Workflow

Write scraping scripts in the Scripts folder. Save the scraped data in Data folder
Run Filtered_Data/cumulative_data_with_keyword_count.py
Run Filtered_Data/title_ngrams.py to check if any related keyword is not missed
Run Filtered_Data/final_data.py to get shortlisted blogs
Run Filtered_Data/summary.py to store all blog text in final_data.csv
Run Filtered_Data/recall.py to get recall score
Test various analytic techniques in the Analytics folder. Save outputs in the Outputs folder

Info

Scraping scripts are present in Scripts folder.
- Scraped data will be saved in CSV format in Data folder.
- Before running scripts that use selenium, user will have to update driver path in script.
- For all scraping scripts, choose column names from the following list only: ["Blog Title", "Blog Date", "Blog Catchphrase", "Blog Category", "Blog Link", "Author Name", "Author Profile Link", "Thumbnail Link", "Thumbnail Credit"]. If you decide to include any new column name not listed here, update the above list so others know to use it in the future. If not done, filtered data may contain redundant columns.

Custom modules are present in Modules folder. Add following code to scripts to import custom modules:

import os
import sys
modules_path = os.path.abspath(os.path.join(os.path.dirname(__file__), "../Modules"))
if modules_path not in sys.path:
    sys.path.insert(1, modules_path)

Files containing reading data are present in Read_Files folder.
- Each line in Read_Files/keywords.txt is treated as a keyword, used for the naive sorting.
- Read_Files/stopwords.txt contains a stopword list and Read_Files/stopwords_cleaned.txt contains the same words, but in regex cleaned form.
Filtered_Data folder contains filtering scripts and sorted, cumulated and filtered data.
- Filtered_Data/cumulative_data_with_keyword_count.py reads all CSV files from Data folder and filters them based on keywords read from Read_Files/keywords.txt.
- Filtered_Data/final_data.py sorts and return top results in cumulative_data_with_keyword_count.csv. If required, user will have to update max rows in output in the script. Following scoring criteria is used:
  - Start with an initial score of 0
  - If some year is present in the Blog Title - remove if >= 2010 and <= 2021. If 2022 is present, the blog can't be removed in future steps
  - If Blog Date or Blog/Thumbnail Link contains year < 2021 - remove if permitted
  - +1 for each unique keyword present in the title
  - +1 additionally for ("summer" or "spring" or "2022") presence in title
  - +1 additionally for "2022" presence in the blog link
- Filtered_Data/title_ngrams.py geneates n-grams based on blog titles for all positively scored data in cumulative_data_with_keyword_count.csv. Plots are saved in Ngram_Histogram_Plots folder.
- Filtered_Data/divide_csv.py divides final_data.csv in two parts and saves in Analytics folder for document analytics.
- Filtered_Data/recall.py calculates the recall score for blog links in Filtered_Data/cumulative_data_with_keyword_count.csv based on links present in Read_Files/fashion_intern_forecasting_website_list.csv
- Filtered_Data/summary.py goes through all blogs in Filtered_Data/final_data.py, and appends the html text obtained in columns. This allows for effecient access to blog text.
Analytics folder contains document analytics.
- All outputs are saved in Outputs folder.
- Execution of Analytics/Pallav/textrank/vogue_test.ipynb requires use of GloVe embeddings. Heads Up - the size of these word embeddings is 822 MB. Extract all files and place them inside Read_Files/glove_embeddings/. Files can be downloaded here.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Curience-Work

Workflow

Info

All Requirements

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 50 Commits
.vscode		.vscode
Analytics		Analytics
Data		Data
Filtered_Data		Filtered_Data
Modules		Modules
Read_Files		Read_Files
Scripts		Scripts
Curience Forecasting Intern Documentation.pdf		Curience Forecasting Intern Documentation.pdf
README.md		README.md

Folders and files

Latest commit

History

Repository files navigation

Curience-Work

Workflow

Info

All Requirements

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages