![]() ![]() We now have a list of characters with information on which chapter they first appeared. The list of characters by chapter is available under the CC-BY-SA license, so we don’t have to worry about any copyright infringement. As mentioned, we will begin by scraping the characters in the Harry Potter and the Philosopher’s Stone book. I have prepared a Google Colab notebook if you want to follow along. Entity recognition with SpaCy’s rule-based matching.Preprocess book text (Co-reference resolution).If two characters appear within 14 words of each other, we will assume they have interacted somehow and store the number of those interactions as the relationship weight. We will use the same co-occurrence threshold as was used in the Game of Thrones extraction. Once we have found all the occurrences of entities, the only thing left is to define the co-occurrence metric and store the results in Neo4j. We also know in which chapter they first appeared, which will help us even further disambiguate the characters.Īrmed with this knowledge, we will use SpaCy’s rule-based matcher to find all mentions of a character. Luckily for us, the Harry Potter fandom page contains a list of characters in the first book. I’ve tried most of the open-source named entity recognition models to compare which worked best, but in the end, I decided that none were good enough. I did a lot of experiments to decide the best way to go about it. ![]()
0 Comments
Leave a Reply. |
AuthorWrite something about yourself. No need to be fancy, just an overview. ArchivesCategories |