must be to take an arbitrary-though-not-random-nor-representative subset of 50000 records.
Adam has discussed this but let me take a cut at it as well... Suppose you have a table with 50 million records or a few billion records. Whatever the number, just make it big enough so everyone agrees (like in that thread about tables the size of the Earth) that it is completely out of the question to get your head around the table by browsing. Let's call that a "big table." What is "representative" of a big table? Can you tell by browsing a window that claims to include all of the records in that big table? No, because by definition you cannot characterize a big table by browsing it. So, how do you get a "representative" sample of a big table if browsing won't do the trick? That depends upon what you mean by "representative." "Representative" means different things to different people, usually at different times based upon the different things they happen to be doing. Quite often, when engaged in a particular analytic task a key part of that task itself is to decide exactly what is "representative" of the data, distilling the essence of it. In such cases you don't know what is "representative" unless you accomplish much of the task you have set out to do. People might apply various sophisticated means, like Adam discusses, to construct what they consider to be a "representative" subset. But with a big table they're not going to do that by interactive browsing. By definition a big table is too big for that. Reduce it by some means, such as SQL, to a mere 500,000 records and you still won't know by browsing if that is "representative" or not. You just cannot humanly browse enough records to say. The idea of tables being so big that methods we as ordinary humans expect to be useful are not useful at all, goes against our ordinary human experience. It really is impossible to get your head around a big table to any degree at all by browsing it, and that is one of the issues in this thread. People simply won't accept that. They naturally think, all their prior experience working against them, that "well, maybe I can't completely understand it but I can get some useful impressions by browsing it. Suppose it has a random sample, etc." But that isn't true at all. The only impressions you can get by browsing it are fake impressions that are as likely to mislead you as be truthful. For example, you might look at Art's data set that on the basis of a nano-scale browsing sample seems to be ordered by zip code. But that nano-scale sample is just as likely to miss huge swaths of the table where it is not ordered by zip code, and thus mislead you. Tables are ordered only for that brief moment when you construct a result using ORDER BY. Put that ordered result into a table and then the next millisecond you don't know if that order holds as a result of other people or other processes deleting/editing/etc records in that new table. The beginner makes that mistake and despite reading a hundred times "tables are not ordered" will proceed to write scripts and other workflow that assumes they are ordered, forever. The experienced DBMS person knows that tables are not ordered and routinely applies order when that is desired. Getting back to grabbing a "representative" subset of a table which you can use to make decisions as to how you might use the data in the table or systematically modify it, learn from it, etc.: As you get into thinking what you consider to be "representative" in whatever particular task you are doing, with big tables you are going to use queries so you can take advantage of the highly refined, endlessly powerful toolset thatqueries provide. --- About randomness and order and teleporting into a "sample" table that is skewed by pseudo-order of the table that is a latent effect of how it was first loaded: The fundamental notion not to lose sight of is that if you are working with a big table whatever you see in a table window means nothing in terms of order. If you think whatever you see by browsing a big table means something about the order of that table, you haven't gotten your head around what "big table" really means. Tables the size of the Earth really are a different deal than tables which can be listed on a roll of paper only a few hundred feet long. In a database of people giving a location for each person, if all of the records for people happen to be from the state of Alabama in the first page of the display of 150 million records that doesn't mean anything at all about the content of the table; for example, it does not imply that all 150 million records should be assumed to be in Alabama. Big tables are unordered. Assuming they are ordered in any way is a mistake, and concluding they are ordered because of what is seen browsing a few screens, a few feet of screens out of a table that is larger than the Earth, as if that were somehow a proof, is simply a blunder. If you want to see order use ORDER BY. I understand that some data sets in some formats as a side effect of the format or the way they were initially loaded, really do appear to be ordered, as in Art's excellent example of health care data originally stored by zip code. But no matter if the data started that way or however it appears when you first touch it, the moment that data gets loaded into any modern DBMS (Oracle, DB2, etc, etc.) or modern data handling software like Radian that can work with big tables, it no longer can be assumed to be ordered. There are exceptions, of course, but big tables tend to come from data sources that do not store them as ordered data. For order you use ORDER BY, and you don't take it for granted that you will get reliably lucky with pseudo-order as in the health care data originally stored by zip code, because that's apparently how the records were first loaded into the database system. I note that even in Art's example, counting on that ordering could be bad medicine because if there are any changes to the table, and new or edited records are inserted in an unordered fashion that is different from the original load, the DBMS could just as easily put a swath of Lost Angeles (that was a typo, but since it seems to fit let's leave it...) records intermixed with zip codes for San Francisco, which is a few hundred kilometers to the North of LA. Because some apparent order can persist through what seem to be large swaths of records the trap for beginners is to browse a few sections of the table and declare, "well, of course it is ordered! Everywhere I look it is ordered!" If you really understand what a big table is you know that is nonsense. A big table is like a paper printout of records that stretches from San Francisco all the way across America, all the way across the Atlantic, all the way across Europe, Eastern Europe, Russia, Central Asia and all the way across China to Beijing... a huge, long roll of paper with several records per inch for all of those many thousands of miles. Are you really going to crawl on your knees from San Francisco to New York scanning a few records on that printout every inch of the way? Be honest with yourself and admit you'd give up after a few blocks of crawling on your knees and would never even make it to the on-ramp of the Bay Bridge to Oakland, let alone get past Oakland, the Livermore Valley or into the central valley of California. And then you'd still have a few years of crawling on your knees to get to New York, not much progress toward Beijing. Even if you didn't crawl every inch of the way, just think how long it would take you to walk the land part of that and to row the ocean part. That is what a big table is. You might think, "well, I'm not going to crawl on my hands and knees to look at every inch of that printout. I'll just sample a bit here and there," thinking you can get your head around that by looking at a few feet of printout in San Francisco, a few feet near a favorite barbecuerestaurant in Kansas City, a few feet in the industrial swamps near Newark, and then a few feet of printout in Dusseldorf, Minsk, Rostov, Alma Ata, Tashkent and so on. But that is just crazy... what about all those records in the many, many thousands of kilometers in between? But that's not what is the forefront of somebody's mind who has looked at a few dozen screens and thinks he understand the big table. The beginner then goes on to write code that assumes it is ordered and the results might even look OK. If the results are also a big table, you might never know you have trash mixed in, because you won't be able to tell by browsing that trash is in there. And, if you make the initial conceptual mistake of thinking a big table is ordered, despite the frequently-repeated admonitions of DBMS masters that assuming it to be ordered is a mistake, well, you will never write the queries to determine if there is trash mixed in. --- Shifting gears about away from big tables: Suppose you have a table window showing a selected set of records that is small enough to browse. Let's call that a "small table," meaning it is small enough so that user interface controls like scroll bars make sense, you can browse realistically with page up / down and so on. With small tables you can indeed get your head around the table by browsing it using casual tools. That is where our experience as GIS people works against us in this new world of big tables. We have been trained by our experience with small tables that things like scroll bars should always make sense. Manifold has helped train people into expecting that with fine, convenient tools in Release 8 that really do make life very simple and convenient, so long as you don't notice the implied deal with the devil that they assume the tables involved are "small tables." It's like people who have been trained within simple, small scale economies to always use cash. They have tools to help them, like wallets that can carry different denominations of bills, strong boxes and safes, and even tools like money-counting machines where if you need to count out ten thousand in twenties you just set up the machine, push a button and it counts out the requisite number of bills. Easy. And then one day something changes, like you become a Google billionaire or you are an ordinary person in a country where hyperinflation hits. Suddenly, instead of dealing with hundreds or thousands you must deal with billions. You discover that to finance that new space exploration company you've wanted it is not really physically realistic to be shipping around warehouses full of paper money or in the hyperinflation case to go shopping for a loaf of bread with a wheelbarrow full of stacks of paper cash. The solution is electronic cash, which is not comforting to those who have been trained to not trust what they don't have "filled" into their strongbox or wallet in the form of tangible bills they can touch, just like people getting hung up on what an interactive interface can show them in the form of tangible records that have filled a particular window. But just like it is not realistic for a hyper-rich person to personally count out a billion dollars for his next hyper-yacht, it makes no sense to think you can browse a billion record table with interface tools, like scroll bars, that will work beautifully at the smaller scales for which they are designed.
|