Most Java file processing solutions either involve a lot of boilerplate or don’t handle concurrency, backpressure, or metrics well out of the box. I needed something fast, clean, and production-friendly — so I built this.
Key features:
Multi-threaded line/batch processing using a configurable thread pool
Producer/consumer model with built-in backpressure
Buffered, asynchronous writing with optional auto-flush
Live metrics: memory usage, throughput, thread times, queue stats
Simple builder API — minimal setup to get going
Output metrics to JSON, CSV, or human-readable format
Use cases:
Large CSV or log file parsing
ETL pre-processing
Line-by-line filtering and transformation
Batch preparation before ingestion
I’d really appreciate your feedback — feature ideas, performance improvements, critiques, or whether this solves a real problem for others. Thanks for checking it out!
https://github.com/MayankPratap/Samchika/blob/ebf45acad1963d...for(int i=0;i<10000; ++i){ // do nothing just compute hash again and again. hash = str.hashCode(); }"do nothing" is correct, "again and again" not so much. Java caches the hash code for Strings and since the JIT knows that (at least in recent version[1]) it might even remove this loop entirely.
- Guys. I love you all. I did not expect such quality feedback.
I will try to incorporate most of your feedback. Your commments have given me much to learn.
This project was started to just learn more about multithreading in a practical way. I think I succeeded with that.
- A note on the name.
The nasal "m" takes on the form of the nasal in the row/class of the letter that follows it. As "ñ" is the nasal of the "c" class, the "m" becomes "ñ"
Writing Sanskrit terms using the roman script without using something like IAST/ISO-15919 is a pain in the neck. They are going to be mispronounced one way or the other. I try to get the ISO-15919 form and strip away everything that is not a-z.
So, सञ्चिका (sañcikā) = sancika
You probably want to keep the "ch," as the average English speaker is not going to remember that the "c" is the "ch" of "cheese" and not "see."
- It would be even more amazing if it had tests. It's already pretty good.
- Perhaps I misunderstand something but doesn't reading from a file require a system call? And when there is a system call, the context switches? So wouldn't using multiple threads to read from a file mean that they can't really read in parallel anyway because they block each other when executing that system call?
- Am I wrong in thinking that this is duplicating lines in memory repeatedly when buffering lines into batches, and then submitting batches to threads? And then again when calling the line processor? Seems like it might be a memory hog
- I have CONTRIBUTING.md with guidelines regarding Pull Requests if any of you would take out your precious time to make some changes in the library.
- Do you have a benchmark comparison with other similar tools?
- Does it handle line breaks inside quotes in CSV? Frankly, I don't think its possible to reliably process CSV in а multi-threaded manner.
- Please don't do this.
Have the OS handle memory paging and buffering for you and then use Java's parallel algorithms to do concurrent processing.
Create a "MappedByteBuffer" and mmap the file into memory.
If the file is too large, use an "AsynchronousFileChannel" and asynchronously read + process segments of the buffer.
- An ArrayList for huge numbers of add operations is not performant. LinkedList will see your list throughput performance at least double. There are other optimisations you can do but in a brief perusal this stood out like a sore thumb.