[Faculty] Fwd: [CSRC.COLLOQUIUM] "Anomaly Detection and Explanation in Big Data"

Jose Castillo jcastillo at sdsu.edu
Tue Oct 26 15:29:21 PDT 2021


[image: SDSU_CSRC Logo.jpg]


DATE:
*Friday, October 29, 2021*


TITLE:
*Anomaly Detection and Explanation in Big Data     *

TIME:
*3:30-4:30PM*



LOCATION:
*In Person - GMCS 314*
or
Join Zoom Meeting -   https://SDSU.zoom.us/j/86808277973
<https://sdsu.zoom.us/j/86808277973>


SPEAKER/BIO:
*Hajar Homayouni, Computer Science, San Diego State University *


ABSTRACT:

Data quality tests are used to validate the data stored in databases and
data warehouses, and to detect violations of syntactic and semantic
constraints. Domain experts grapple with the issues related to the
capturing of all the important constraints and checking that they are
satisfied. The constraints are often identified in an ad hoc manner based
on the knowledge of the application domain and the needs of the
stakeholders. Constraints can exist over single or multiple attributes as
well as records involving time series and sequences. The constraints
involving multiple attributes can involve both linear and non-linear
relationships among the attributes.

We propose ADQuaTe as a data quality test framework that automatically (1)
discovers different types of constraints from the data, (2) marks records
that violate the constraints as suspicious, and (3) explains the
violations. Domain knowledge is required to determine whether or not the
suspicious records are actually faulty. The framework can incorporate
feedback from domain experts to improve the accuracy of constraint
discovery and anomaly detection. We instantiate ADQuaTe in two ways to
detect anomalies in non-sequence and sequence data.

The first instantiation (ADQuaTe2) uses an unsupervised approach called
autoencoder for constraint discovery in non-sequence data. ADQuaTe2 is
based on analyzing records in isolation to discover constraints among the
attributes. We evaluate the effectiveness of ADQuaTe2 using real-world
non-sequence datasets from the human health and plant diagnosis domains. We
demonstrate that ADQuaTe2 can discover new constraints that were previously
unspecified in existing data quality tests and can report both previously
detected and new faults in the data. We also use non-sequence datasets from
the UCI repository to evaluate the improvement in the accuracy of ADQuaTe2
after incorporating ground truth knowledge and retraining the autoencoder
model.

The second instantiation (IDEAL) uses an unsupervised LSTM-autoencoder for
constraint discovery in sequence data. IDEAL analyzes the correlations and
dependencies among data records to discover constraints. We evaluate the
effectiveness of IDEAL using datasets from Yahoo servers, NASA Shuttle, and
Colorado State University Energy Institute. We demonstrate that IDEAL can
detect previously known anomalies from these datasets. Using mutation
analysis, we show that IDEAL can detect different types of injected faults.
We also demonstrate that the accuracy of the approach improves after
incorporating ground truth knowledge about the injected faults and
retraining the LSTM-Autoencoder model.

The novelty of this research lies in the development of a
domain-independent framework that effectively and efficiently discovers
different types of constraints from the data, detects and explains
anomalous data, and minimizes false alarms through an interactive learning
process.


Host:
*Wei Wang*

Note: Videos of previous colloquium talks can be seen on the CSRC website
in the colloquium archive section or on the CSRC YouTube page here
<https://www.youtube.com/channel/UCN0ZEztlmyDqG2pm-Rle_Eg/feed>.



-- 
You received this message because you are subscribed to the Google Groups
"CSRC Colloquium" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to csrc.colloquium+unsubscribe at sdsu.edu.
To view this discussion on the web visit
https://groups.google.com/a/sdsu.edu/d/msgid/csrc.colloquium/90971e6d-7952-4f97-8289-22ffe8ad42c5n%40sdsu.edu
<https://groups.google.com/a/sdsu.edu/d/msgid/csrc.colloquium/90971e6d-7952-4f97-8289-22ffe8ad42c5n%40sdsu.edu?utm_medium=email&utm_source=footer>
.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://engineering.sdsu.edu/pipermail/faculty/attachments/20211026/1d9367b1/attachment-0001.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: SDSU_CSRC Logo.jpg
Type: image/jpeg
Size: 56046 bytes
Desc: not available
URL: <http://engineering.sdsu.edu/pipermail/faculty/attachments/20211026/1d9367b1/attachment-0001.jpg>


More information about the Faculty mailing list