Multi-threading vs Multi-processing. Databases. Adapting a python script to utilise EC2 instances.

i know that SQLite isn't necessarily the most performant database and wondered if I should try and migrate to something else? I'm not really sure how necessary this is. I'm using SQLAlchemy to interrogate the database.

If you can avoid using a database at all it might be faster. It doesn't sound like there's much advantage to using a database here unless this is meant to be a persistent process with lots of data coming in that needs to be continually processed. It sounds like you might be able to just store each passage on a single line (or delimit in some other fashion) in a text file on disk and then just read in a line at a time and send it to your (work queue?) multiprocessing architecture to do the work. I haven't benchmarked a DB query vs read operation like this but I'm almost certain that reading will be significantly faster. I am assuming you would do this step in a linear, single process way. And anyways, I guarantee your multiprocess workers won't be able to keep up with a single-process read operation so there's probably no point in spawning multiple processes for this step.

I'm curious about how best to store the results. Can i continue to write to a text file or would it be better to connect to another database and store the results there?

I guess it depends on the end use. Do you, or other people, need the results in a database format for lots of read/write operations? If this is a one-off processing task then I would just store it as a text file

/r/Python Thread