<?xml version="1.0" encoding="utf-8"?>
<rss xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:foaf="http://xmlns.com/foaf/0.1/" xmlns:og="http://ogp.me/ns#" xmlns:rdfs="http://www.w3.org/2000/01/rdf-schema#" xmlns:schema="http://schema.org/" xmlns:sioc="http://rdfs.org/sioc/ns#" xmlns:sioct="http://rdfs.org/sioc/types#" xmlns:skos="http://www.w3.org/2004/02/skos/core#" xmlns:xsd="http://www.w3.org/2001/XMLSchema#" version="2.0" xml:base="https://www.linuxjournal.com/tag/data-science">
  <channel>
    <title>Data Science</title>
    <link>https://www.linuxjournal.com/tag/data-science</link>
    <description/>
    <language>en</language>
    
    <item>
  <title>Shall We Study Amazon's Pricing Together?</title>
  <link>https://www.linuxjournal.com/content/shall-we-study-amazons-pricing-together</link>
  <description>  &lt;div data-history-node-id="1340108" class="layout layout--onecol"&gt;
    &lt;div class="layout__region layout__region--content"&gt;
      
            &lt;div class="field field--name-field-node-image field--type-image field--label-hidden field--item"&gt;  &lt;img src="https://www.linuxjournal.com/sites/default/files/nodeimage/story/pricinggun_0.jpg" width="800" height="400" alt="pricing gun" typeof="foaf:Image" class="img-responsive" /&gt;&lt;/div&gt;
      
            &lt;div class="field field--name-node-author field--type-ds field--label-hidden field--item"&gt;by &lt;a title="View user profile." href="https://www.linuxjournal.com/users/doc-searls" lang="" about="https://www.linuxjournal.com/users/doc-searls" typeof="schema:Person" property="schema:name" datatype="" xml:lang=""&gt;Doc Searls&lt;/a&gt;&lt;/div&gt;
      
            &lt;div class="field field--name-body field--type-text-with-summary field--label-hidden field--item"&gt;&lt;p&gt;&lt;em&gt;Is it possible to figure out how we're being profiled online?&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;This past July, I spent a quality week getting rained out in a series of brainstorms by alpha data geeks at the &lt;a href="https://www.strategic-pr.com/bi-analyst-summit/"&gt;Pacific Northwest BI &amp; Analytics Summit&lt;/a&gt; in Rogue River, Oregon. Among the many things I failed to understand fully there was how much, or how well, we could know about how the commercial sites and services of the online world deal with us, based on what they gather about us, on the fly or over time, as we interact with them.&lt;/p&gt;

&lt;p&gt;The short answer was "not much". But none of the experts I talked to said "Don't bother trying." On the contrary, the consensus was that the sums of data gathered by most companies are (in the words of one expert) "spaghetti balls" that are hard, if not possible, to unravel completely. More to my mission in life and work, they said it wouldn't hurt to have humans take some interest in the subject.&lt;/p&gt;

&lt;p&gt;In fact, that was pretty much why I was invited there, as a Special Guest. My topic was "&lt;a href="https://www.strategic-pr.com/doc-searls"&gt;When customers are in full command of what companies do with their data—and data about them&lt;/a&gt;". As it says at that link, "The end of this story...is a new beginning for business, in a world where customers are fully in charge of their lives in the marketplace—both online and off: a world that was implicit in both the peer-to-peer design of the Internet and the nature of public markets in the pre-industrial world."&lt;/p&gt;

&lt;p&gt;Obviously, this hasn't happened yet.&lt;/p&gt;

&lt;p&gt;This became even more obvious during a break when I drove to our AirBnB nearby. By chance, my rental car radio was tuned to a program called &lt;a href="http://blogs.wgbh.org/innovation-hub/2018/7/27/scurvy-surgery-history-randomized-trials/"&gt;From Scurvy to Surgery: The History Of Randomized Trials&lt;/a&gt;. It was an &lt;a href="http://blogs.wgbh.org/innovation-hub/"&gt;Innovation Hub&lt;/a&gt; interview with &lt;a href="https://twitter.com/ALeighMP"&gt;Andrew Leigh&lt;/a&gt;, Ph.D. (&lt;a href="https://en.wikipedia.org/wiki/Andrew_Leigh"&gt;@ALeighMP&lt;/a&gt;), economist and member of the Australian Parliament, discussing his new book, &lt;a href="https://yalebooks.yale.edu/book/9780300236125/randomistas"&gt;&lt;em&gt;Randomistas: How Radical Researchers Are Changing Our World&lt;/em&gt;&lt;/a&gt; (Yale University Press, 2018). At one point, Leigh reported that "One expert says, 'Every pixel on Amazon's home page has had to justify its existence through a randomized trial.'"&lt;/p&gt;

&lt;p&gt;I thought, &lt;em&gt;Wow. How much of my own experience of Amazon has been as a randomized test subject? And can I possibly be in anything even remotely close to full charge of my own life inside Amazon's vast silo?&lt;/em&gt;&lt;/p&gt;&lt;/div&gt;
      
            &lt;div class="field field--name-node-link field--type-ds field--label-hidden field--item"&gt;  &lt;a href="https://www.linuxjournal.com/content/shall-we-study-amazons-pricing-together" hreflang="en"&gt;Go to Full Article&lt;/a&gt;
&lt;/div&gt;
      
    &lt;/div&gt;
  &lt;/div&gt;

</description>
  <pubDate>Tue, 02 Oct 2018 11:30:00 +0000</pubDate>
    <dc:creator>Doc Searls</dc:creator>
    <guid isPermaLink="false">1340108 at https://www.linuxjournal.com</guid>
    </item>
<item>
  <title>Learning Data Science</title>
  <link>https://www.linuxjournal.com/content/learning-data-science</link>
  <description>  &lt;div data-history-node-id="1339530" class="layout layout--onecol"&gt;
    &lt;div class="layout__region layout__region--content"&gt;
      
            &lt;div class="field field--name-field-node-image field--type-image field--label-hidden field--item"&gt;  &lt;img src="https://www.linuxjournal.com/sites/default/files/nodeimage/story/binary-code-507786_640.jpg" width="640" height="452" alt="" typeof="foaf:Image" class="img-responsive" /&gt;&lt;/div&gt;
      
            &lt;div class="field field--name-node-author field--type-ds field--label-hidden field--item"&gt;by &lt;a title="View user profile." href="https://www.linuxjournal.com/users/reuven-m-lerner" lang="" about="https://www.linuxjournal.com/users/reuven-m-lerner" typeof="schema:Person" property="schema:name" datatype="" xml:lang=""&gt;Reuven M. Lerner&lt;/a&gt;&lt;/div&gt;
      
            &lt;div class="field field--name-body field--type-text-with-summary field--label-hidden field--item"&gt;&lt;p&gt;
In my last few articles, I've written about data science and
machine learning. In case my enthusiasm wasn't obvious from my
writing, let me say it plainly: it has been a long time since I last
encountered a technology that was so poised to revolutionize the world
in which we live.
&lt;/p&gt;

&lt;p&gt;
Think about it: you can download, install and use open-source data science libraries, for free. You can download rich data sets on nearly
every possible topic you can imagine, for free. You can analyze that
data, publish it on a blog, and get reactions from governments and
companies.
&lt;/p&gt;

&lt;p&gt;
I remember learning in high school that the difference between freedom
of speech and freedom of the press is that not everyone has a printing
press. Not only has the internet provided everyone with the
equivalent of a printing press, but it has given us the power to
perform the sort of analysis that until recently was exclusively
available to governments and wealthy corporations.
&lt;/p&gt;

&lt;p&gt;
During the past year, I have increasingly heard that data science is
the sexiest profession of the 21st century and the one that will
be in greatest demand. Needless to say, those two things make for a very
appealing combination! It's no surprise that I've seen a major uptick
in the number of companies inviting me to teach on this subject.
&lt;/p&gt;

&lt;p&gt;
The upshot is that you—yes, you, dear reader—should spend time
in the coming months, weeks and years learning whatever you can
about data science. This isn't because you will change jobs and
become a data scientist. Rather, it's because everyone is going to become a data scientist. No matter what work you do, you'll be better at
it, because you will be able to use the tools of data science to analyze
past performance and make predictions based on it.
&lt;/p&gt;

&lt;p&gt;
Back when I started to develop web applications, it was the norm to
have a database team that created the tables and queries. Nowadays,
although there certainly are places that have a full-time database staff, the
assumption is that every developer has at least a passing familiarity
with relationship (or even NoSQL) databases and how to work with
them. In the same way that developers who understand databases are
more powerful than those who don't, people in the computer field who
understand data science are more powerful than those who don't.
&lt;/p&gt;

&lt;p&gt;
There is a bit of bad news on this front, though. If you thought that
the pace of technological change in programming and the web moved at a
breakneck pace, you haven't seen anything yet! The world of data
science—the tools, the algorithms, the applications—are moving
at an overwhelming speed. The good news is that everyone is
struggling to keep up, which means if you find yourself
overwhelmed, you're probably in very good company. Just be sure to keep
moving ahead, aiming to increase your understanding of the theory,
algorithms, techniques and software that data scientists use.
&lt;/p&gt;&lt;/div&gt;
      
            &lt;div class="field field--name-node-link field--type-ds field--label-hidden field--item"&gt;  &lt;a href="https://www.linuxjournal.com/content/learning-data-science" hreflang="und"&gt;Go to Full Article&lt;/a&gt;
&lt;/div&gt;
      
    &lt;/div&gt;
  &lt;/div&gt;

</description>
  <pubDate>Tue, 24 Oct 2017 12:19:27 +0000</pubDate>
    <dc:creator>Reuven M. Lerner</dc:creator>
    <guid isPermaLink="false">1339530 at https://www.linuxjournal.com</guid>
    </item>
<item>
  <title>Unsupervised Learning</title>
  <link>https://www.linuxjournal.com/content/unsupervised-learning</link>
  <description>  &lt;div data-history-node-id="1339461" class="layout layout--onecol"&gt;
    &lt;div class="layout__region layout__region--content"&gt;
      
            &lt;div class="field field--name-field-node-image field--type-image field--label-hidden field--item"&gt;  &lt;img src="https://www.linuxjournal.com/sites/default/files/nodeimage/story/BinaryData50_2.jpg" width="500" height="414" alt="" typeof="foaf:Image" class="img-responsive" /&gt;&lt;/div&gt;
      
            &lt;div class="field field--name-node-author field--type-ds field--label-hidden field--item"&gt;by &lt;a title="View user profile." href="https://www.linuxjournal.com/users/reuven-m-lerner" lang="" about="https://www.linuxjournal.com/users/reuven-m-lerner" typeof="schema:Person" property="schema:name" datatype="" xml:lang=""&gt;Reuven M. Lerner&lt;/a&gt;&lt;/div&gt;
      
            &lt;div class="field field--name-body field--type-text-with-summary field--label-hidden field--item"&gt;&lt;p&gt;
In my last few articles, I've looked into machine learning and
how you can build a model that describes the world in some way. All of
the examples I looked at were of "supervised learning", meaning
that you loaded data that already had been categorized or classified in
some way, and then created a model that "learned" the ways
the inputs mapped to the outputs. With a good model, you then
were able to predict the output for a new set of inputs.
&lt;/p&gt;
&lt;p&gt;
Supervised learning is a very useful technique and is quite
widespread. But, there is another set of techniques in machine
learning known as &lt;em&gt;unsupervised learning&lt;/em&gt;. These techniques, broadly
speaking, ask the computer to find the hidden structure in the
data—in other words, to "learn" what the meaning of the data is, what
relationships it contains, which features are of importance, and which
data records should be considered to be outliers or anomalies.
&lt;/p&gt;

&lt;p&gt;
Unsupervised learning also can be used for what's known as
"dimensionality reduction", in which the model functions as a
preprocessing step, reducing the number of features in order to
simplify the inputs that you'll hand to another model.
&lt;/p&gt;

&lt;p&gt;
In other words, in supervised learning, you teach the computer about
your data and hope that it understands the relationships and
categorization well enough to categorize data it hasn't
seen before successfully.
&lt;/p&gt;

&lt;p&gt;
In unsupervised learning, by contrast, you're asking the computer to
tell you something interesting about the data.
&lt;/p&gt;

&lt;p&gt;
This month, I take an initial look at the world of unsupervised
learning. Can a computer categorize data as well as a human? How can
you use Python's scikit-learn to create such models?
&lt;/p&gt;

&lt;h3&gt;
Unsupervised Learning&lt;/h3&gt;

&lt;p&gt;
There's a children's card game called &lt;em&gt;Set&lt;/em&gt; that is a useful way to
think about machine learning. Each card in the game contains a
picture. The picture contains one, two or three shapes. There are
several different shapes, and each shape has a color and a fill
pattern. In the game, players are supposed to identify three-card
groups of cards using any one of those properties. Thus, you could
create a group based on the color green, in which all cards are green
in color (but contain different numbers of shapes, shapes and fill
patterns). You could create a group based on the number of shapes, in
which every card has two shapes, but those shapes can be of any
color, any shape and any fill pattern.
&lt;/p&gt;
                 
&lt;p&gt;
The idea behind the game is that players can create a variety of
different groups and should take advantage of this in order to win
the game.
&lt;/p&gt;&lt;/div&gt;
      
            &lt;div class="field field--name-node-link field--type-ds field--label-hidden field--item"&gt;  &lt;a href="https://www.linuxjournal.com/content/unsupervised-learning" hreflang="und"&gt;Go to Full Article&lt;/a&gt;
&lt;/div&gt;
      
    &lt;/div&gt;
  &lt;/div&gt;

</description>
  <pubDate>Thu, 10 Aug 2017 12:13:28 +0000</pubDate>
    <dc:creator>Reuven M. Lerner</dc:creator>
    <guid isPermaLink="false">1339461 at https://www.linuxjournal.com</guid>
    </item>
<item>
  <title>iguaz.io</title>
  <link>https://www.linuxjournal.com/content/iguazio</link>
  <description>  &lt;div data-history-node-id="1339148" class="layout layout--onecol"&gt;
    &lt;div class="layout__region layout__region--content"&gt;
      
            &lt;div class="field field--name-field-node-image field--type-image field--label-hidden field--item"&gt;  &lt;img src="https://www.linuxjournal.com/sites/default/files/nodeimage/story/12049f7.jpg" width="573" height="369" alt="" typeof="foaf:Image" class="img-responsive" /&gt;&lt;/div&gt;
      
            &lt;div class="field field--name-node-author field--type-ds field--label-hidden field--item"&gt;by &lt;a title="View user profile." href="https://www.linuxjournal.com/users/james-gray" lang="" about="https://www.linuxjournal.com/users/james-gray" typeof="schema:Person" property="schema:name" datatype="" xml:lang=""&gt;James Gray&lt;/a&gt;&lt;/div&gt;
      
            &lt;div class="field field--name-body field--type-text-with-summary field--label-hidden field--item"&gt;&lt;p&gt;
An IT megatrend in progress involves the shift from legacy monolithic apps
running on enterprise storage to systems of engagement that interact with
users, collect real-time data from many sources and store it in elastic and
shared data services. A self-described "disruptive" enterprise
seeking to push this vision forward is &lt;a href="http://iguaz.io"&gt;iguaz.io&lt;/a&gt;, which recently announced a
virtualized data services architecture for revolutionizing both private and
public clouds. 
&lt;/p&gt;
&lt;img src="http://www.linuxjournal.com/files/linuxjournal.com/ufiles/imagecache/large-550px-centered/u1000009/12049f7.jpg" alt="" title="" class="imagecache-large-550px-centered" /&gt;&lt;p&gt;
In contrast to the current siloed approach, iguaz.io
consolidates data into a high-volume, real-time data repository that
virtualizes and presents it as streams, messages, files, objects or data
records. All data types are stored consistently on different memory or
storage tiers, and popular application frameworks (Hadoop, ELK, Spark or
Docker containers) are accelerated.
&lt;/p&gt;

&lt;p&gt;
 In addition to a 10x to 100x
improvement in time to insights at lower costs, iguaz.io says that its
architecture provides best-in-class data security based on a real-time
classification engine, a critical need for data sharing among users and
business units.
&lt;/p&gt;&lt;/div&gt;
      
            &lt;div class="field field--name-node-link field--type-ds field--label-hidden field--item"&gt;  &lt;a href="https://www.linuxjournal.com/content/iguazio" hreflang="und"&gt;Go to Full Article&lt;/a&gt;
&lt;/div&gt;
      
    &lt;/div&gt;
  &lt;/div&gt;

</description>
  <pubDate>Fri, 02 Sep 2016 15:30:00 +0000</pubDate>
    <dc:creator>James Gray</dc:creator>
    <guid isPermaLink="false">1339148 at https://www.linuxjournal.com</guid>
    </item>

  </channel>
</rss>
