Context analyzers for news stories?

I'm not really an expert or anything so there might be a better way but it sounds like the problem can be split into a few steps: convert article title and content to string, process string, organize output into useful form.

Scraping article content isn't really a mystery just google it and turn it loose on news sources you think are relevant.

The second step is obviously easier said than done. I would approach it by generating a keyword list and searching for keywords in articles. You can manually generate this list or you can use some articles you think exemplify what you are looking for and generate the keyword list from that. Or both.

Organizing the output is trivial but you have a few options depending on how you complete the previous step. If you stop your search as soon as you find a keyword you can just list the articles in no particular order. If you find all keywords, you can sort the list my what you guess to be the most relevant, tag them with keywords, etc.

By the way, there are a few very simple optimizations you could do to greatly speed this up. For example, an article is very unlikely to mention your CEO or other executives but not the company name, so you can only search for the names of people if the company is mentioned. That will greatly reduce the lookup time of the initial keyword search. Be careful with this though, just because it's unlikely doesn't mean it's impossible.

To be honest, I wouldn't do any optimizations unless it takes a really long time to run. The odds are this won't take very long. I wrote a program in python once that listed every word in a text file, counted number of occurrences and displayed the top words, calculated average paragraph length, word length, calculated the grade of the reading level, and displayed a graph of quotations per page (to give an idea of when dialogue was happening). It took a couple minutes to run through the King James Bible so I'd be shocked if your program took very long to run.

/r/Python Thread