Subscribe to this thread
Home - General / All posts - Importing .xyz files SLOW
cycloxslug65 post(s)
#25-May-22 21:16

We collected and processed bathymetric survey data in a sonar program called Hypack and exported that data into .xyz files (3x3ft tiles) for a small lake. File sizes range from 80mb to 400mb, 6 total files, all local on a relatively fast SSD drive, 16GB memory, Intel I7 @2.9ghz. The 80mb file took >4 min to import (and the progress bar looked like it wasn't doing anything for most of that time), but it did eventually show up. I'm 10 min in to importing the 160mb file.

This seems exceptionally slow...am I doing something wrong? I just dragged the file into the project pane (similar speed if I use the file>import command). I'm going to let it run overnight, but it would be really nice to harness the advertised speed of Manifold with large data files...

oeaulong

408 post(s)
#25-May-22 21:34

I feel that this may be typical for importing text based files. There is a whole lot of conversions going on to read in the coordinates and present the output to the apropos types. I am afraid that you are, in fact, harnessing the speed of Mfd. Once imported then dealing with the files should be much more efficient. I think there are some discussions in the forum of image importing, especially those of the floating point type, where the initial import is lengthy yet subsequent processing is dramatically sped up. I cannot find it though.

oeaulong

408 post(s)
#25-May-22 22:20

It appears that from the Hypack manuals that in the Export routine, you can go directly to .SHP format. I would do that to glean your points info and once inside Manifold, you can interpolate it into an image component.

cycloxslug65 post(s)
#27-May-22 19:28

Sadly, our access to Hypack was limited to our on the water survey time, and the vendor recommended .xyz as the most portable output from that work. We are in the process of acquiring another sidescan/bathymetric software package (SonarWiz).

However, you and tjhb are on to something with importing as points... see below.

dale

591 post(s)
#03-Jun-22 04:45

off topic, another SonarWiz user here.

tjhb

9,976 post(s)
#25-May-22 21:38

You know that XYZ stores 3 text numbers for each pixel. How many pixels?

>4mn and >10mn don't sound bad times to me.

Did you have other export options from Hypack?

Did you also try importing the XYZ file(s) as CSV?

cycloxslug65 post(s)
#27-May-22 19:44

Sadly, no access to Hypack - what I've got is what I've got, at least for the moment.

I did however take your suggestion to import as csv (through the not-really-intuitive link as data source method - not sure why the standard import dialog doesn't allow specification of the data structure), which allowed me to specify the delimiters (spaces).

This was MUCH faster (about 4x) and the progress bars actually showed progress! I used the workflow described here (https://manifold.net/doc/mfd9/example__import_csv_and_create_a_drawing.htm) to generate points and then the TRANSFORM>Interpolate command to generate my raster surface.

Incidentally, this seemed to work better than importing the .xyz files directly, because the xyz files generated sparse tiles (most pixels had no stored depth value - I'm sure there is an interpolation method but I couldn't figure it out from searching the manual).

adamw

10,011 post(s)
#31-May-22 13:32

In general, the reason XYZ is so slow is that the file does not tell us: 'hey, here is an image that is 2000x4000, and here are the pixel values, starting from top left and going right then down: v1 v2 v3 ...'. Instead, XYZ tells us: 'hey, here is an image but I won't tell you the dimensions or the distance between individual pixels, instead here is pixel 1: x1 y1 value1, and here is pixel 2: x2 y2 value2 (might be in a different corner), and here is pixel 3: x3 y3 value3 (might be in the middle of the image), etc, now go and make an image out of that'. So we end up reading the file into XYV tuples, then figuring out what is the minimum and maximum X and what are the distances between individual Xs and what is likely the intended pixel size (the minimum distance between subsequent Xs is a bad choice and sometimes backfires heavily), same for Y, then placing XYV tuples onto the image.

All that said, if you send the file you were trying to read to tech (tech@manifold.net), we will look into whether the time to read it is reasonable or whether some of the heuristics were way off and could be adjusted to work better.

Dimitri

7,025 post(s)
#26-May-22 02:12

I trust you're using Release 9, right? I think the "Importing and Linking" topic says it well...

Importing large files can take a long time because the imported data will be analyzed and stored in special, pre-computed data structures within the Manifold file that allow subsequent reads and writes to be very fast. It pays to be patient with such imports as once the data is imported and stored within a Manifold project file access to that data will usually be far faster than it was in the original format. Once imported the data will open instantlythereafter.

Having an ultra fast format is part of getting super Manifold speed. Manifold itself is a parallel, spatial database system. When you import data into a Manifold project you're importing that data into a really fast, parallel, spatial database.

The import takes some time initially, for the reasons oeaulong have said. Some interchange formats, which are brutally inefficient for operational work because they use text, etc., are indeed very slow to convert into high speed data form. They'd be very slow to load into Oracle, or SQL Server, or PostgreSQL too. Other formats, like sensible GIS formats, are much faster to import. But once the data is in Manifold you've cut the cord to slow formats and it is always fast.

If you doubt that, visit the examples web page and download the 200+ MB Aus_Hydro project with all hydrology for Australia. It's in compressed .mxb format. The first time you open it, Manifold will decompress it into .map format. Thereafter .map format is used.

The .map project file is over 600 MB in size. It opens instantly, as in 1/2 second if not faster.

cycloxslug65 post(s)
#27-May-22 19:59

Yes, I'm using Release 9. I was mostly surprised that I was getting no information that it was actually doing anything for the first couple minutes. I can open the .xyz in Notepad+ basically instantly and as far as I can tell .xyz is just a text file, no headers, etc. (which then requires processing).

It seems like Manifold would import the data quickly, then go slow on the processing to pixels. If I recall correctly, the progress bar typically seems to show multiple events (like copying then processing). I recognize it is hard to make progress info available...but when nothing shows up for minutes, its hard to know at what point to give up or let it ride.

Dimitri

7,025 post(s)
#28-May-22 09:18

I was getting no information that it was actually doing anything for the first couple minutes.

? Didn't it show the import dialog?

I can open the .xyz in Notepad+ basically instantly

If you wanted Manifold to do only what Notepad+ does, Manifold also would open that .xyz for text editing very fast. But you want to do more than text editing, right?

Notepad+ is literally about a thousand times simpler than what is necessary for GIS so the data structures it uses internally can be much simpler. When it comes to something like 9, the data structures are even more sophisticated than compared to say, PostgreSQL which is terribly slow with rasters or ArcGIS Pro, which takes forever to open a big project with many components and also is terribly slow with vectors and rasters.

It seems like Manifold would import the data quickly, then go slow on the processing to pixels.

That could be done if the path to very high speed storage was a simple matter of importing the data all at once and then doing some processing. Alas, that's not the case.

There's a lot of pre-computation that goes into storing data so that a project with hundreds of components that total hundreds of gigabytes can open instantly, store changes almost instantly, and instantly pop open components that are themselves over 100 gigabytes in size. Add to that the ability to efficiently do parallel work, for example, throwing a thousand GPU cores to do a task in seconds that takes hours or days in ArcGIS Pro, and all that is asking a lot of the data structures and access methods within Manifold's internal database. Converting data from ordinary formats into those data structures takes time.

That's especially true of plain text formats that have no intelligence to them, no spatial indexes, etc., that can be used to get a head start on organizing the data. The best way to process all that is not necessarily to first read all of the text, either.

But you only have to do all that once, when the data is imported. After that, it's much, much faster than leaving the data in dumb formats.

when nothing shows up for minutes, its hard to know at what point to give up or let it ride.

The job for the dataport is to import data from a format accurately and to build internal structures within Manifold's database to allow that data to be operated on with full performance, and to do that job as fast as possible. Anything that slows the process down should be avoided.

For example, while it's nice to reassure beginners with a new format that 9 hasn't crashed, that's not worth doing if it slows the process down. It's not a safe assumption to think that it doesn't slow down the works to surface from what might be very complex internal processes, like recursion, to give an honest report of "Still working!"

A more efficient approach is to simply do the job as fast as possible. Most of the time imports happen so fast it doesn't matter. In those few cases of long imports, new users very quickly learn that 9 doesn't crash. They know if it pops open the dialog to start importing something, it's on the job and it will get it done.

As for reassuring people by issuing progress reports on different steps in the process, unless somebody understands how all the internal data structures work it seems unproductive to fire incomprehensible phrases at them. May as well just show a rotating circle "in progress" graphic or alternate between various phrases ("Still working!", "Gosh, this is taking a while!", "I'm doing my best!", "Good news! There's time for a coffee break!") so people are reassured it hasn't stopped working.

I'm personally not a big fan of progress bars, because they invite misunderstandings: most people look at a progress bar and expect to see a linear representation of a linear process, but many tasks in programs are not linear. You see that effect with Windows updates, where it may go from 5% done to 95% done in seconds and then you stare at 95% done for the next ten minutes.

That progress bars report non-linear phenomena is especially true when the same, progress bar interface is used for what are hundreds of different data ports working on wildly varying data formats and servers, where there is very great variation in what has to be done to extract data and to structure it within Manifold's internal high performance database.

One more thing: if you do very many imports from some particular format that might be a rare format for other people, don't hesitate to read the advice on suggestions and then send in a request to speed that format import up. Something like "XYZ" format is really a family of formats where there might be opportunities for optimization that are tuned to specific types of vector or raster data stored in one of those formats.

The process for developing new dataports is also non-linear, especially for those dataports that are rarely used or which have very little sample data to them. The first versions of rarely-used dataports tend to focus on reliability and quality. They don't invest vast amounts of engineering time to increase the speed of something that is rarely used. When many people use a dataport, it's natural to apply extra effort for optimizations which might speed up the dataport.

PBF for example, is a notoriously slow format but was rarely used. The dataport for importing PBF started out being very reliable but slow. Once people started using PBF a lot more, Manifold returned to the PBF dataport and tuned it. That happened twice, if I recall correctly, with each new iteration increasing speed. The latest PBF dataport is much faster than the original version.

mdsumner


4,246 post(s)
#02-Jun-22 12:00

that's crazy slow 🙏


https://github.com/mdsumner

mdsumner


4,246 post(s)
#03-Jun-22 01:46

slower than I expected, but 45s in R to read and convert from 2.1Gb CSV of ETOPO2 10800x5400, all assumed double floating point columns 'x, y, etopo', converted to an in memory raster array of 450Mb

I expect it comes down whether the grid really is regular and collapses to a simple extent, but 10m is a lot slower than this for a smaller data set.


https://github.com/mdsumner

Manifold User Community Use Agreement Copyright (C) 2007-2021 Manifold Software Limited. All rights reserved.