# American Institute of Mathematical Sciences

ISSN:
2380-6966

eISSN:
2380-6974

All Issues

## Big Data & Information Analytics

July 2016 , Volume 1 , Issue 2&3

Select all articles

Export/Reference:

2016, 1(2&3): 139-161 doi: 10.3934/bdia.2016001 +[Abstract](3496) +[PDF](946.5KB)
Abstract:
Nowadays we are in the big data era. The high-dimensionality ofdata imposes big challenge on how to process them effectively andefficiently. Fortunately, in practice data are not unstructured.Their samples usually lie around low-dimensional manifolds andhave high correlation among them. Such characteristics can beeffectively depicted by low rankness. As an extension to thesparsity of first order data, such as voices, low rankness is alsoan effective measure for the sparsity of second order data, suchas images. In this paper, I review the representative theories,algorithms and applications of the low rank subspace recoverymodels in data processing.
2016, 1(2&3): 163-169 doi: 10.3934/bdia.2016002 +[Abstract](2013) +[PDF](650.0KB)
Abstract:
Big Data and Big Graphs have become landmarks of current cross-border research, destined to remain so for long time. While we try to optimize the ability of assimilating both, novel methods continue to inspire new applications, and vice versa.Clearly these two big things, data and graphs, are connected, but can we ensure management of their complexities, computational efficiency, robust inference? Critical bridging features are addressed here to identify grand challenges and bottlenecks.
2016, 1(2&3): 171-183 doi: 10.3934/bdia.2016003 +[Abstract](2171) +[PDF](1882.6KB)
Abstract:
Urban air pollution post a great threat to human health, and has been a major concern of many metropolises in developing countries. Lately, a few air quality monitoring stations have been established to inform public the real-time air quality indices based on fine particle matters, e.g. $PM_{2.5}$, in countries suffering from air pollutions. Air quality, unfortunately, is fairly difficult to manage due to multiple complex human activities from driving to smelting. We observe that human activities' hidden regular pattern offers possibility in predication, and this motivates us to infer urban air condition from the perspective of time series. In this paper, we focus on $PM_{2.5}$ based urban air quality, and introduce two kinds of time-series methods for real-time and fine-grained air quality prediction, harnessing historical air quality data reported by existing monitoring stations. The methods are evaluated based in the real-life $PM_{2.5}$ concentration data in the year of 2013 (January - December) in Wuhan, China.
2016, 1(2&3): 185-216 doi: 10.3934/bdia.2016004 +[Abstract](2251) +[PDF](1687.0KB)
Abstract:
With the advent of the Internet of Things (IoT) and cloud computing,the need for data stores that would be able to store and process big data inan ecient and cost-e ective manner has increased dramatically. Traditionaldata stores seem to have numerous limitations in addressing such requirements.NoSQL data stores have been designed and implemented to address the shortcomingsof relational databases by compromising on ACID and transactionalproperties to achieve high scalability and availability. These systems are designedto scale to thousands or millions of users performing updates, as wellas reads, in contrast to traditional RDBMSs and data warehouses. Althoughthere is a plethora of potential NoSQL implementations, there is no one-size- t-all solution to satisfy even main requirements. In this paper, we explorepopular and commonly used NoSQL technologies and elaborate on their documentation,existing literature and performance evaluation. More speci cally,we will describe the background, characteristics, classi cation, data model andevaluation of NoSQL solutions that aim to provide the capabilities for big dataanalytics. This work is intended to help users, individuals or organizations,to obtain a clear view of the strengths and weaknesses of well-known NoSQLdata stores and select the right technology for their applications and use cases.To do so, we rst present a systematic approach to narrow down the properNoSQL candidates and then adopt an experimental methodology that can berepeated by anyone to nd the best among short listed candidates consideringtheir speci c requirements.
2016, 1(2&3): 217-225 doi: 10.3934/bdia.2016005 +[Abstract](1719) +[PDF](345.0KB)
Abstract:
Given a data set with one categorical response variable and multiple categorical or continuous explanatory variables, it is required in some applications to discretize the continuous explanatory ones. A proper supervised discretization usually achieves a better result than the unsupervised ones. Rather than individually doing so as recently proposed by Huang, Pan and Wu in [12,13], we suggest a forward supervised discretization algorithm to capture a higher association from the multiple explanatory variables to the response variable. Experiments with the GK-tau and the GK-lambda are presented to support the statement.
2016, 1(2&3): 227-245 doi: 10.3934/bdia.2016006 +[Abstract](1592) +[PDF](1029.1KB)
Abstract:
Coalition attack is nowadays one of the most common type of attacks in the industry of online advertising. In this paper, we attempt to mitigate the problem of frauds by proposing a hybrid framework that detects the coalition attacks based on multiple metrics. We also articulate the theoretical basis for these metrics to be integrated into the hybrid framework. Furthermore, we instance the framework with two metrics and develop a detection system that identifies the coalition attacks from two distinguish perspectives.
2016, 1(2&3): 247-259 doi: 10.3934/bdia.2016007 +[Abstract](1998) +[PDF](964.6KB)
Abstract:
Most existing clustering algorithms are slow for dividing a large dataset into a large number of clusters. In this paper, we propose a truncated FCM algorithm to address this problem. The main idea behind our proposed algorithm is to keep only a small number of cluster centers during the iterative process of the FCM algorithm. Our numerical experiments on both synthetic and real datasets show that the proposed algorithm is much faster than the original FCM algorithm and the accuracy is comparable to that of the original FCM algorithm.
2016, 1(2&3): 261-274 doi: 10.3934/bdia.2016008 +[Abstract](2239) +[PDF](403.1KB)
Abstract:
News recommender systems efficiently handle the overwhelming number of news articles, simplify navigations, and retrieve relevant information. Many conventional news recommender systems use collaborative filtering to make recommendations based on the behavior of users in the system. In this approach, the introduction of new users or new items can cause the cold start problem, as there will be insufficient data on these new entries for the collaborative filtering to draw any inferences for new users or items. Content-based news recommender systems emerged to address the cold start problem. However, many content-based news recommender systems consider documents as a bag-of-words neglecting the hidden themes of the news articles. In this paper, we propose a news recommender system leveraging topic models and time spent on each article. We build an automated recommender system that is able to filter news articles and make recommendations based on users' preferences. We use topic models to identify the thematic structure of the corpus. These themes are incorporated into a content-based recommender system to filter news articles that contain themes that are of less interest to users and to recommend articles that are thematically similar to users' preferences. Our experimental studies show that utilizing topic modeling and spent time on a single article can outperform the state of the arts recommendation techniques. The resulting recommender system based on the proposed method is currently operational at The Globe and Mail (http://www.theglobeandmail.com/).
2016, 1(2&3): 275-276 doi: 10.3934/bdia.2016009 +[Abstract](1690) +[PDF](156.5KB)
Abstract:
This note introduces the research and development capacity of a data mining leader in Canada--Manifold Data Mining Inc. (Manifold)--and its collaboration with academic community.