Contentmine pipeline

I have a database with a list of PMID's. I want to mine the text in all openaccess articles in this list of PMIDs and get the most frequent used terms/keywords/subject.

I have tested getpapers and seen how powerful and efficient it is in getting papers. I have then moved on to quickscrape and tried downloading pdf's based on the url list in the _eupmc_fulltext_html_urls.tx_t that getpapers outputs. 

Seeing that i can use -p command in a getpapers query to download pdf's, my question is why should i use quickscrape? Also, after watching this [video](https://www.youtube.com/watch?v=5lYzOZ2Cv_I) from the 1.29 minute mark, Peter-Murray is able to skim through pdfs quite easily. How does he do that? I am using an Ubuntu 14.04 Lts box how can i skim through pdfs like that using Ubuntu? Still on the video, at the 2:23 minute mark, Peter-Murray writes what seems like Java code to filter the files for sequences and keyterms. Which tool is he using to do that? Is it part of the ContentMine API? I am not sure if what i have written above qualifies to be an issue but i am really keen to understand ContentMine and how best i can use it for my project.

Thanks

AM

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Contentmine pipeline #63

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Contentmine pipeline #63

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions