I have a file with 65,000,000 lines, that is about 2gb in size.

I want to read this file in N lines at a time, perform a db insert operation, and then read the next N, with N being, say, 1000 in this case. Insert order doesn't matter, so synchronous is fine.

What's the best way of doing this? I've only found was to either load in 1 line at a time, or methods that read the whole file into memory. Sample code below, that I've been using to read the file one line at a time. :

var singleFileParser = (file, insertIntoDB) => {
    var lr = new LineByLineReader(file);
    lr.on('error', function(err) {
        // 'err' contains error object
        console.error(err);
        console.error("Error reading file!");
    });

    lr.on('line', function(line) {
        insertIntoDB(line);
    // 'line' contains the current line without the trailing newline character.
    });

    lr.on('end', function() {
        // All lines are read, file is closed now.
    });
};
    
Have you thought about writing a little script to divide the file into smaller pieces? Just a thought. – archae0pteryx 5 hours ago

2 Answers 2

Lines can only be parsed one at a time by someone. So, if you want 10 at once, then you just collect them one at a time until you have collected 10 and then process the 10.

I did not think Jarek's code quite worked right so here's a different version that collects 10 lines into an array and then calls dbInsert():

var tenLines = [];
lr.on('line', function(line) {
    tenLines.push(line);
    if (tenLines.length === 10) {
        lr.pause();
        dbInsert(<yourSQL>, function(error, returnVal){
            if (error) {
                // some sort of error handling here
            }
            tenLines = [];
            lr.resume();
        });
     }
});
// process last set of lines in the tenLines buffer (if any)
lr.on('end', function() {
    if (tenLines.length !== 0) {
        // process last set of lines
        dbInsert(...);
    }
});

Jarek's version seems to call dbInsert() on every line event rather than only every 10th line event and did not process any left over lines at the end of the file if they weren't a perfect multiple of 10 lines long.

    
Yes, it had that problem. I made the exact change you did. Thanks! – JonLuca 4 hours ago
    
@JonLuca - I made an additional edit because the previous code was not processing the last set of lines unless it happened to be an exact multiple of 10. I also removed the cnt variable since it is not necessary as we can just use tenLines.length as the counter. – jfriend00 4 hours ago

Something like this should do

var cnt = 0;
var tenLines = [];
lr.on('line', function(line) {
    tenLines.push(line);
    if (++cnt >= 10) {
         lr.pause();
         // prepare your SQL statements from tenLines
         dbInsert(<yourSQL>, function(error, returnVal){
            cnt = 0;
            tenLines = [];
            lr.resume();
        });
     }
});
    
@JonLuca - Yep, all lines have to be parsed one at a time by someone, so if you want 10 before you do a database operation, then just collect lines until you have 10 and then do your database op. – jfriend00 5 hours ago
1  
Wonderful, works extremely well! Can't believe I didn't think of this. Was still thinking about being thread safe, forgot Node is single threaded. Thanks! – JonLuca 5 hours ago
    
@JonLuca - Hmmm. I'm thinking this has a race condition. When you call dbInsert(), it can be sometime in the future before the completion callback is called. That means other lr.on(...) events can flow. Then, the completion callback will arrive and you will trouce both the cnt and tenLines array. You need to grab your ten lines (probably copy them into another array) and set the cnt back to zero and the array back to empty BEFORE you start your db operation. – jfriend00 5 hours ago
    
I don't see that. I am assuming that when lr.pause() is called no events will fire on lr.on(...) until lr.resume() is called. So, then next time you see a line it should be after cnt=0; tenLines = []; have been executed in the callback. Let me think about it though. – Jarek Kulikowski 4 hours ago
    
OK, I didn't see the lr.pause() (that's kind of a semi-hidden way to code it). But, how does this not call dbInsert() on every line event? Don't you need an if statement that only calls dbInsert() when cnt === 10? – jfriend00 4 hours ago