Skip to content

Contentmine pipeline #63

@alexmaina

Description

@alexmaina

I have a database with a list of PMID's. I want to mine the text in all openaccess articles in this list of PMIDs and get the most frequent used terms/keywords/subject.

I have tested getpapers and seen how powerful and efficient it is in getting papers. I have then moved on to quickscrape and tried downloading pdf's based on the url list in the _eupmc_fulltext_html_urls.tx_t that getpapers outputs.

Seeing that i can use -p command in a getpapers query to download pdf's, my question is why should i use quickscrape? Also, after watching this video from the 1.29 minute mark, Peter-Murray is able to skim through pdfs quite easily. How does he do that? I am using an Ubuntu 14.04 Lts box how can i skim through pdfs like that using Ubuntu? Still on the video, at the 2:23 minute mark, Peter-Murray writes what seems like Java code to filter the files for sequences and keyterms. Which tool is he using to do that? Is it part of the ContentMine API? I am not sure if what i have written above qualifies to be an issue but i am really keen to understand ContentMine and how best i can use it for my project.

Thanks

AM

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions