I think that the initial post is missing some information:
- what database are you using? - maybe it is something specific to a given db. SQLite works differently than MySQL.
- how many rows? - if it is below 100k then I would not expect any major problems. 500k? This is the range where things start to matter. Though, it also depends on the time budget. If you insert 10 million rows and your budget is 1 second, then 'it will not work'.
- do you insert a single row at the same time? - your code suggest so:
element
(singular) as a variable name. This is definitely NOT what you want to do. I think that if you were doing things in batches then you would mention it in the post, so I assume: 'no'.
- do files/rows need to be inserted in a specific order? - you mention parallel tasks, so I assume 'no'. This is very important for design/performance of the whole operation.
- multiple files with hundreds of rows OR single file with million of rows - having multiple files allows you to process multiple files at the same time.
In general SQL is designed to process multiple data at the same time. If your db supports BULK INSERT (and similar operations) then go for it. Most of the databases already have some solution for 'inserting millions of very fast'. If your driver support is then just use it.
Otherwise you should try to reduce round trips to the server - insert multiple rows at the same time. This also means that you are 'compressing' (or more precisely: putting rows into a SQL protocol data frame) multiple rows at the same time. From my experience decreasing the number of operations is more important than the size of a single operation. This means that it is better to do 1000 operations with 1000 rows each than 1_000_000 operations with 1 row each (by a lot). You mention that the compression is a problem, but are you already batchig stuff, or just sending a single row at the same time?
Also, you mention that 'database is thread safe', but is your database driver thread safe? Can the driver properly 'do many things at the same time'? I have seen a case when driver opened multiple connections to the same db which promoted the transaction to distributed transaction (!). Though this would not be an 'compression' problem.
Tbh. Whatever you do on the box in terms of CPU, should not matter because in just a second wou will be doing networking IO. Usually insertion speeds should be in tens of thousands of rows per second (at least, this number is for a VERY weak box). Have you tried to import similar data using some different tool? What were the times? Maybe it is a bug in the SQL driver?
IMPORTANT: You may need to do the whole operation in transaction! I don't see you mentioning it.
IMPORTANT-2: In a certain kind of business you are not allowed to do things in parallel, because this may reorder operations.
sliemeobn Simon Leeb wrote:
I noticed your db.store is not async -> thou shalt not block on a concurrency runtime thread ; )
How can you make a sync
operation exposed by a database driver async
? At some point it has to block:
- Driver send data
- Driver waits for ACK
You driver needs to support async
for the whole operation to be async
. In local terms: if read
is sync then you can't make it async
, as soon as you call the function it will block the thread.
Pseudo-code
When batching SQL writes the database should not be that big of a problem. Even a single thread should be able to do 'sufficiently' well below 500k (though it also depends on the data, indexes (indices?) and triggers also have cost). But you have not mentioned the database you are using/estimated number of rows/time budget, so it is hard to tell.
func main(files: [File]) async {
var timer = Timer()
timer.start()
// If you need everything in 1 transaction.
// Though remember that long transactions are BAD!
// (like really, really BAD)
let db = Database(…)
await db.beginTransaction()
let results = await withTaskGroup(of: FileReadResult.self) { group in
// Assuming that your business rules allow you to process files in parallel.
// In some scenarios (banks, stocks, ERP, finance) this may not be allowed.
//
// I think that it is perfectly fine to start 20 tasks at the same time.
// Maybe if you have more files then you should do something (counting semaphore?).
// You are doing IO operations (file reads, db writes), those are a few
// orders of magnitude heavier than `Task`.
//
// Also, do not use `processorCount`, the whole point of Swift structured
// concurrency is to not care about such stuff.
for file in files {
group.addTask {
let reader = FileReader(file: file, db: db)
await reader.run()
}
}
// FileReadResult = statistics for a single file (row count, duration etc…).
var acc = [FileReadResult]()
for await result in group {
acc.append(result)
}
return acc
}
await db.endTransaction()
timer.end()
let duration = timer.duration.secondsString
// Log 'results/duration' somewhere.
// How many rows in total? How long did it take? etc…
}
struct FileReadResult {
/// How many rows were processed?
let rowCount: Int
/// How long did the read take?
let duration: Duration
}
actor FileReader {
private static let partitionCount = 5000
private let file: File
private let db: Database
/// Read file and send to database.
func run() -> FileReadResult {
var partition = [Row]()
partition.reserveCapacity(Self.partitionCount)
// Your file reading logic.
for row in self.file {
partition.append(row)
if partition.count = Self.partitionCount {
await self.db.insert(rows: partition)
partition.clear(preserveCapacity: true)
}
}
return …
}
}
actor Database {
/// Insert multiple rows at the same time.
func insert(rows: [Row]) async {
// Depends on the database.
// Nothing that we can help with, if we do not know the db.
}
/// Remember to keep your transaction as small as possible!
func beginTransaction() async {}
/// In real life this may throw!
func endTransaction() async {}
}