Academic Integrity: tutoring, explanations, and feedback — we don’t complete graded work or submit on a student’s behalf.

Right now, I have a system that generates about 20 million objects per day. I wa

ID: 658696 • Letter: R

Question

Right now, I have a system that generates about 20 million objects per day. I want to retain a year's worth of data. I need to store these so a data team can analyze them and make predictions based off of them. Since the volume is so high, I tolerate the additional loss of objects as it isn't really a big deal. In fact, the data team regularly undersamples the data anyways to make it workable.

The objects are streamed from servers to a single machine using TLS connections. The client is written in python but could be rewritten at any point

My first attempt at storing these objects was writing JSON into a file on the disk. The files are rotated every 30 minutes so we don't wind up with huge files or a large number of small files.

This became very large very quickly. I switched to writing compressed files using the bz2 module. This saves greater than 90% of the disk space since it is compressible.

Unfortunately, this makes reading anything in the files require decompression first. So my third attempt writes an additional uncompressed file with summary data. This is basic stuff like the first and last time of the messages in the file.

The data team is using the summary files to determine which files to process. This obviously speeds things up a great deal since they can often eliminate 90% of the dataset before they even start decompressing files. After that point, answering any question still involved scanning the whole file. Obviously, I can add more summary data to allow more questions to be answered without decompression.

At this point, I'm damn close to writing a pseudo-database. I don't really want to go down that path yet as I'm hoping it isn't necessary.

So far I think my requirements are the following

I need to persist objects to disk. Essentially, each file would just be a massive array with no real ordering about the objects. An objects is a key-value mapping from a set of keys to a set of values (1 to 1). The only types I care about are strings, lists, dictionaries, integers, and floating point numbers.
I need to be able to index the file somehow, preferably when the file is closed.
I need to be able to query the file using some sort of language and get an answer using those indices.
I don't need mutability, but it doesn't matter if a particularly solution allows for it.

Is there anything that can do something like this?

Explanation / Answer

You could use a compressed table with MySQL/InnoDB:

Persisted to disk
Compressed
Indexes
SQL queries

It is mutable though. Performance could probably be better with an nnon-mutable database engine.

Hire Me For All Your Tutoring Needs
Integrity-first tutoring: clear explanations, guidance, and feedback.
Drop an Email at
drjack9650@gmail.com
Chat Now And Get Quote