Finding a string within a books in a specific folder

Printing a string on it’s sentence within a thousand of books or whatever in Folder

Imagine you have a folder somewhere in your computer, in that folder, you have thousands of books range in txt file or docx file. One great day, I have a research on specific topic and would like to find for example  COVID or Corona in those books. It’s just an idea, to see if there were any kind of such word mentioned somewhere before. What our code is going to do here , just tell python to go to specific folder , open it, and open every book over there and locate any word that  equal to ‘Corona’, print it in its sentence, also print the name of the book of that ‘Corona’ word appears.

 

>>> import os

>>> os.chdir('') # between that bracket you can specify the source of your book folder. If you keep it empty there will occur an error

Else, you can just specify your path or source of bookstore

>>> bookstore = ‘ ‘ 

>>> def finding_string_in_bookstore(x):

    for root, dirs, files in os.walk(bookstore):

        for filename in files:

            if filename.endswith('.txt'):

                with open(os.path.join(root, filename), 'r', encoding = 'utf-8') as f:

                    text = f.read()

                    tx = f.readline()

                    sp_text = text.split()  # split every text book

                    sent_text = text.splitlines() # splitting sentences or sentence tokenizer

                    for w in sent_text:

                        if x in w:

                            print(w.partition(x), filename, file = open('result_of_my_research_in_book_store.txt', 'a+', encoding = 'utf-8'))

Printing the output into a specific text file and reading it there is better than running it in shell, it can be too much.

In that code above, what if you wanted to see the meeting of two string together in the same sentence within those thousands of books in your bookstore.

Let say, we want to see the meeting of corona and virus or corona and disease within those books.

We change the line of the code:

>>> if x in w: by replacing it with if ‘ corona ‘ in w and ‘ diseases ‘ in w:  or if ‘ corona ‘ in w and ‘ virus ‘ in w.

It’s important to keep the distance in the string to avoid an unexpected  output. Ex, looking for corona and getting coronate or corona….and something.

In my next blog, I will show my experience on sentence analyzing using numpy and pandas packages.


Comments

Popular posts from this blog

Farming land size owned by households per region in Senegal